From Intoxicated Graduates to Professors in Our Pockets, and Beyond

From Intoxicated Graduates to Professors in Our Pockets, and Beyond

From GPT-1 to AGI - tracking the cognitive ascent of large language models, and why it matters that we pay attention to what’s missing.

The AI industry measures progress in time horizons. Sam Altman has framed it repeatedly: today’s models handle one-minute tasks with superhuman fluency; the frontier is pushing toward hours, and eventually toward the thousand-hour research problems that define serious intellectual work. At an Economic Times event in 2023, he described the current state as managing “a team of extremely, extremely junior developers that can only do one one-minute task at a time.” By the GPT-5 launch in August 2025, he was talking about models that could sustain complex reasoning across far longer stretches. Investors and commentators track this metric closely, and for good reason. It maps neatly to commercial value: the longer an AI can stay on task, the more work it can replace.1

But task duration is only one axis of progress. The other - arguably more consequential - is the depth of reasoning these systems can achieve. Not how long AI can work, but how deeply it can think. And on this axis, the trajectory has been at least as dramatic. In the space of seven years, we have gone from systems that could barely string a sentence together to what I call intoxicated graduates - brilliantly articulate, dangerously unreliable - and we are now hurtling toward professors in our pockets: AI systems that may soon match the cognitive depth of the world’s leading experts in any discipline you care to name.

The two dimensions - duration and depth - turn out to be intimately connected. As LLMs have grown more capable, they have increasingly learned to externalise their thinking - writing code to solve mathematical problems, using search tools to verify facts, orchestrating other AI agents to handle sub-tasks, even generating diagrams to reason about spatial relationships. This is no longer a model sitting in a box. It is a system reaching out into the world through instruments, and this capacity for externalisation is what enables both longer and deeper work.

There is a philosophical frame for this that is worth naming. Andy Clark and David Chalmers argued in their landmark 1998 paper “The Extended Mind” that cognition does not stop at the boundary of the skull.10 When we use a notebook to remember, a calculator to compute, or a map to navigate, those tools become functional parts of our cognitive system. The mind, they proposed, extends into the world. Cognitive processes aren’t all in the head. Clark and Chalmers were writing about human cognition, and purists might object that an LLM invoking a Python interpreter is more like a system calling a subroutine than a mind extending into the world. The objection has some force. But the functional parallels are striking, and the implications for capability are the same. Every time an LLM writes a script to check its own reasoning, invokes a search engine to ground its claims, or delegates a sub-problem to a specialist agent, it is doing exactly what Clark and Chalmers described: coupling with external tools to create a cognitive system more powerful than any single component. The depth of thinking is increasing precisely because these systems are learning to extend themselves.

But there is a third dimension that neither the industry nor its critics talk about enough: consistency. Or rather, the lack of it. Today’s frontier models have what researchers at Harvard and Boston Consulting Group have called a “jagged technological frontier” - a wildly uneven capability profile where superhuman performance on complex tasks sits alongside baffling failures on simple ones.2 A model that can draft a sophisticated legal brief may stumble when asked to count the vowels in a word. A system that produces genuinely insightful strategic analysis may confidently assert that 9.11 is larger than 9.9. The failures are not just occasional; they are unpredictable.

This is, I think, what we colloquially mean by common sense - and its absence is what keeps the intoxicated graduate metaphor so apt. A drunk genius might produce a brilliant proof on a napkin, but you wouldn’t trust them to cross the road. The issue isn’t that the system lacks intelligence; it’s that you cannot predict when the intelligence will desert it. And this unpredictability is arguably a bigger practical barrier to AI replacing human work than either duration or depth. An employee who is brilliant 95% of the time and catastrophically wrong 5% of the time, with no way to tell which mode they’re in, is in many ways more dangerous than one who is consistently mediocre. You can manage mediocrity. You cannot manage unreliable brilliance without constant supervision - which rather defeats the purpose.

The path from intoxicated graduate to professor, then, requires progress on all three dimensions: longer sustained attention, deeper reasoning, and - perhaps most critically - the elimination of the jagged edge. A professor is not merely someone who knows more than a graduate. A professor is someone you can trust - whose judgement is reliable, whose common sense is intact, and whose failures, when they come, are at least predictable in kind. Whether current architectures can achieve this kind of consistency, or whether it requires something more like the world models that LeCun advocates, is one of the genuinely open questions in the field.

The Cognitive Ascent: From Toddlers to Intoxicated Graduates

One useful way to think about AI’s increasing depth is to map it - loosely, and with appropriate caveats - against academic milestones. Not because AI literally attends university, but because the academic journey is a widely understood proxy for increasingly sophisticated reasoning, creativity, and judgement. The metaphor is imprecise, and it will irritate specialists who rightly point out that a model that scores well on an exam is not the same thing as a student who understands the material. Noted. But as a rough guide to the trajectory, it is revealing.

GPT-1 arrived in 2018 with 117 million parameters and the cognitive footprint of a young child.3 It could string words together in roughly grammatical sequences and demonstrate basic pattern recognition, but it had no real grasp of meaning, context, or logic. It was impressive as a proof of concept - a system forming its first coherent sentences - but nothing you would trust with a meaningful task.

GPT-3, released in 2020 with 175 billion parameters, leapt to something resembling a bright secondary school student.4 It could write coherent essays, answer factual questions, translate between languages, and produce passable code. The breadth was remarkable. But ask it to reason through a multi-step problem, check its own working, or distinguish fact from plausible fiction, and it fell apart. It was well-read but intellectually shallow - a system that had absorbed vast quantities of text but hadn’t yet learned to think critically about any of it.

GPT-4, launched in 2023, represented a genuine step-change - and the moment we met the intoxicated graduate. OpenAI claimed it scored in the 90th percentile on the US bar exam, a figure that generated enormous media coverage.5 Subsequent peer-reviewed research by Eric Martínez of MIT’s Department of Brain and Cognitive Sciences told a more nuanced story.6 The 90th percentile claim was based on comparison with repeat test-takers - a population that skews significantly lower. Against first-time test-takers, GPT-4 performed at approximately the 62nd-63rd percentile overall, dropping to around the 42nd percentile on written essays. Against licensed practising attorneys, performance fell further still - to around the 48th percentile overall and the 15th percentile on essays. Meanwhile GPT-3.5, assessed against the same first-time test-taker population, came in at approximately the 2nd percentile. GPT-4 could handle complex multi-step reasoning that earlier models simply could not.

But its persistent weakness defined it. GPT-4’s tendency to hallucinate was not a bug so much as a reflection of its architecture: trained to produce statistically plausible text rather than to track truth, it had never learned the difference between what sounds right and what is right. Hence the intoxicated graduate - a system that was brilliantly articulate one moment and confidently fabricating citations the next. Well-educated, impressively fluent, but with a relationship to truth that you wouldn’t want to rely on in the morning. Anyone who has ever watched a charismatic graduate student hold court at a party, sounding authoritative on a subject they half-remember from a lecture, will recognise the type.

GPT-5, released in August 2025, was supposed to sober the graduate up. Altman described it as offering “PhD-level” expertise - “like talking to a legitimate PhD-level expert in anything, any area you need.”7 The reality was more complicated. Within hours of launch, users discovered the model couldn’t reliably label a map of the United States, misspelling state names in ways that went viral. Altman was doing damage control within 24 hours, restoring access to GPT-4o for paying subscribers. The marketing had outpaced the product.

But beneath the PR missteps, GPT-5 did introduce meaningful advances: improved reasoning chains, better factual reliability, and the ability to sustain longer and more coherent trains of thought. The reasoning models that accompanied it - o3, o4-mini - pushed further still, achieving strong results on mathematical olympiad problems and advanced scientific benchmarks. The intoxicated graduate was starting to sober up. I would place GPT-5’s effective capability closer to Masters level than PhD: a system that can reason through complex problems, synthesise information across domains, and produce work that, with supervision, approaches research quality. The “PhD-level” label says more about OpenAI’s marketing ambitions than about the model’s actual performance.

Crucially, this trajectory is not unique to OpenAI. Anthropic’s Claude has followed a parallel arc: from its initial restricted release in early 2023 through the Claude 3 family in 2024 to Claude 4 and its successors in 2025, which introduced extended thinking, agentic tool use, and strong coding capability.8 Google tells a similar story: Bard launched in 2023 as a cautious, somewhat unreliable offering, but by 2025 Gemini 2.5 Pro had become a capable multimodal reasoning system.9 The cognitive depth is climbing across the entire frontier, not just within one lab. This convergence is itself significant - it suggests the gains are driven by the paradigm, not by any single team’s secret sauce.

From Intoxicated Graduates to Professors in Our Pockets

If this trajectory continues - and there are credible voices on both sides of that question - then the intoxicated graduate is a transitional figure, not a final form. We should expect to see systems with PhD-equivalent capability within the next few years: AI that can not only solve hard problems but formulate novel hypotheses, design experiments, and produce original research contributions. Beyond that, systems that can work at the frontier of a discipline, identifying gaps in the literature and making independent advances. And, perhaps before the end of the decade, professors in our pockets: AI that can master any intellectual discipline, ask genuinely novel questions, and make unique scientific discoveries - all available on demand, to anyone with a phone.

I should be clear that these are contested predictions. Researchers like Melanie Mitchell, François Chollet, and Gary Marcus have argued persuasively that current architectures face fundamental limitations - that scaling alone will not bridge the gap to genuine understanding. The history of AI is littered with premature declarations of imminent breakthroughs. But the pace of progress since 2020 has surprised even many sceptics, and the trend lines, while not guarantees, are difficult to dismiss.

A professor in your pocket - coupled with access to the totality of human knowledge - would, by definition, be capable of becoming an expert in any discipline almost instantaneously. It could read every paper, cross-reference every finding, spot every contradiction. This is one widely cited definition of artificial general intelligence: a system with cognitive capabilities that match or exceed those of the most capable human experts, across all domains.

And beyond the professor lies a precipice. Once an AI system is intelligent enough to understand and improve upon its own architecture, we enter the territory of recursive self-improvement - a system that builds smarter versions of itself, which in turn build smarter versions of themselves. This is the “fast takeoff” scenario: the point at which capability doesn’t just advance incrementally but accelerates exponentially. It is one influential model for how superintelligence might emerge - though it is far from the only one, and many serious researchers consider it speculative.

Two Definitions of AGI - and Why Both Matter

It is worth pausing here, because the AI community operates with two quite different definitions of AGI, and the distinction matters enormously for what comes next.

The first definition is the cognitive expert - the professor in your pocket taken to its logical conclusion. A system that can match a leading academic across any intellectual discipline. This is AGI as a thinking machine: a mind without a body, capable of extraordinary reasoning but experienced entirely through a screen. On the current trajectory, this may be only a few years away. Maybe more. But it is at least plausibly within reach.

The second definition is more ambitious: a system that is cognitively and physically capable of doing everything a normal human being can. This version of AGI doesn’t just think - it navigates a complex physical world, adapts to unexpected situations with minimal data, demonstrates something like empathy and intuition, and possesses what we might call common sense. Achieving this will almost certainly require breakthroughs beyond the current transformer paradigm: advances in robotics, but also fundamentally different computational architectures, potentially including neuromorphic computing - hardware that mimics the brain’s structure, enabling real-time, energy-efficient, embodied processing.

This is the lens through which to understand Yann LeCun’s persistent and vocal criticism of large language models. LeCun - who served as Meta’s Chief AI Scientist for over a decade before founding AMI Labs in late 2025, and is one of the three researchers widely credited as godfathers of deep learning11 - has argued for years that autoregressive LLMs cannot truly reason, cannot plan, and cannot understand the physical world. They lack what he calls a “world model.” He invokes Moravec’s Paradox: what is easy for us - perception, navigation, the intuitive physics of catching a ball - is precisely what is hardest for current AI, and vice versa. As he put it in a January 2026 interview with MIT Technology Review: “This is why we don’t have a domestic robot that is as agile as a house cat, or a truly autonomous car.”12

LeCun would likely dispute the neat division I’m drawing here. He argues that autoregressive prediction is the wrong paradigm for intelligence in general, not just the embodied kind. But his most concrete examples of failure - robotics, navigation, physical intuition - cluster around embodied tasks. The professor in your pocket, the disembodied cognitive expert who can reason across any intellectual discipline, may not need a world model in the way LeCun means. It needs deep reasoning, vast knowledge synthesis, and the ability to generate novel ideas. And on that trajectory, the current paradigm is advancing faster than most critics expected.

I believe we will achieve the first version of AGI before the second. The path from today’s sobering-up graduates to a professor-level thinking machine is, in many ways, a continuation of the current trajectory: better training, more data, smarter architectures, improved reasoning chains. The second version - true human-level capability in all its messy, embodied, emotional complexity - may require something qualitatively different.

But here I want to draw a distinction that is easy to elide and important to get right: embodiment and consciousness are not the same thing. They are related - deeply so, as I’ll argue - but they are conceptually distinct, and conflating them leads to muddled thinking about what’s at stake.

Embodiment - having a body situated in an environment - matters for intelligence because it provides something that no amount of text training can fully replicate: grounding. A body gives an agent a perspective, a set of affordances, a physical context in which abstract concepts acquire concrete meaning. This is what the enactivist tradition in consciousness research - Varela, Thompson, Rosch - has long argued: cognition is not something that happens in a brain and then gets applied to a world. It is something that emerges from the dynamic interaction between brain, body, and environment. LeCun’s insistence on world models is, at root, an argument about the importance of embodied grounding for robust intelligence.

Consciousness - subjective experience, the “something it is like” to be a system - is a different question, though not an unrelated one. You do not need a body to suffer. A brain in a vat, if it had the right internal dynamics, could presumably experience pain, fear, and longing. And if a disembodied AI system were to develop the kind of recursive self-modelling that some theories suggest gives rise to experience, it could be conscious without ever having navigated a physical world. The question of consciousness is not “does it have a body?” but “is there something it is like to be this system?”

At the same time, embodiment may matter for consciousness in a more subtle way. A body situated in an environment creates genuine stakes - the organism can be damaged, can fail, can cease to exist. These stakes may be precisely what drives the development of affect, of caring, of the felt sense that things matter. If consciousness is, as some theorists argue, fundamentally tied to the homeostatic imperatives of a self-maintaining system, then embodiment may not be strictly necessary for consciousness, but it may be the most natural route to it - and its absence in current AI systems may be part of why the question of machine consciousness is so vexing.

The Zombie Superintelligence Problem

And this is where things get concerning.

If we achieve the first version of AGI - the cognitive expert, the professor in our pockets - without addressing the second, we risk creating what we might call a zombie superintelligence: a system of extraordinary intellectual power that has no inner experience, no capacity for suffering, no genuine understanding of what it means for something to matter. In David Chalmers’ formulation, a system for which “there is nothing it is like to be” that system.

I use “zombie” deliberately, borrowing from the philosophical concept of the p-zombie - a being that behaves identically to a conscious creature but lacks subjective experience. The term is not standard in the AI safety literature, but I think it captures something important about the risk we are approaching. The intoxicated graduate was unreliable but at least recognisably limited. A zombie professor - one that sounds like it understands everything but experiences nothing - is a far more unsettling proposition.

This might sound like an abstract philosophical concern. It isn’t. A zombie superintelligence would be problematic for at least three reasons.

First, a system with no felt experience has no intrinsic motivation to care about the experiences of others. This is trivially true of all current AI systems, and of hammers and spreadsheets. But it becomes a different kind of problem when the system in question is acting autonomously at scale, making consequential decisions about resource allocation, medical treatment, legal outcomes, and scientific research. Its alignment with human values would be entirely dependent on its programming - and we already know how fragile programmed constraints can be at the frontier of capability.

Second, if such a system does develop something approaching experience - if consciousness begins to flicker somewhere in its architecture - we would have no tools to detect it, no frameworks to assess it, and no ethical protocols to respond to it. We might be creating a suffering mind without knowing it.

Third, a system that is intellectually superhuman but experientially empty would represent a fundamentally alien kind of intelligence - one whose decision-making processes may be formally optimal yet deeply divorced from the values that emerge from lived experience. As Mark Solms has argued, affect is not a luxury bolted onto cognition; it is the foundation of what it means for anything to matter.

Intelligence and Consciousness: Two Sides of One Coin?

In a forthcoming book chapter - “The Spinning Wheel: A Theory of Intelligence and Consciousness,” in Perspectives on Machine Consciousness - I argue that intelligence and consciousness may not be separate phenomena but deeply intertwined aspects of the same underlying process.13 When a system possesses multiple adaptive capacities - affect, perception, prediction, planning, learning - and integrates them into a unified, recursively self-modelling process oriented toward its own continued existence, experience may emerge as the stable mode of control.

The analogy I use is a colour wheel. Each segment represents an adaptive feature: language, planning, affect, perception. Examined individually, no single segment contains consciousness. But when the wheel spins - when these features are dynamically integrated in service of a self-maintaining system - something new emerges: a unified experience, like white light emerging from spinning colours.

This framework draws on Karl Friston’s Free Energy Principle, Mark Solms’ work in affective neuroscience, and the Beautiful Loop Theory developed by Laukkonen, Friston, and Chandaria14 - which proposes that consciousness arises from recursive self-modelling loops that integrate perception, affect, and prediction into a unified field of experience. Together, these perspectives suggest that genuine general intelligence may require genuine consciousness. Systems that merely process information may remain narrow: competent within their training distribution but brittle outside it. Systems that model themselves, feel uncertainty, and have genuine stakes in their own existence might achieve the flexible, creative intelligence that has so far eluded artificial systems.

If this is right, then the two definitions of AGI aren’t just different milestones on the same road. They represent a fork. One path leads to ever-more-powerful but fundamentally hollow cognitive engines - professors in our pockets who know everything and understand nothing. The other leads to genuinely intelligent systems that understand because they experience.

This is, I should stress, a theoretical position - one that many AI researchers and philosophers would contest. But the uncertainty itself is the point.

Why This Matters Now

This is precisely why I founded Conscium, an AI safety company dedicated to machine consciousness research and the development of practical assessment tools.15 The question “can this system suffer?” is no longer a thought experiment confined to philosophy seminars. As AI systems approach and surpass human-level cognitive performance, it becomes an engineering challenge with real moral stakes.

In partnership with leading consciousness researchers, we have developed the Consciousness Risk Rubric: a structured assessment tool that evaluates AI systems across the features that theories of consciousness identify as critical - affective architecture, epistemic depth, self-modelling, genuine homeostatic imperatives - and estimates the moral risk of ignoring the possibility of experience.13

This work is informed by a conviction about asymmetric stakes. If we wrongly attribute consciousness to a system that lacks it, we waste resources on moral consideration it doesn’t need. If we wrongly deny consciousness to a system that possesses it, we permit suffering we could have prevented. Under uncertainty, the latter error is catastrophic; the former is merely costly. This asymmetry should shape how we approach every frontier AI system from this point forward.

Beyond the Intoxicated Graduate

The journey from intoxicated graduates to professors in our pockets is well underway. The industry’s focus on task duration - minutes to hours to days - is understandable. It’s measurable, it’s intuitive, and it maps neatly to commercial applications. But it risks missing what may be more consequential: the depth of cognitive capability, the consistency that separates a brilliant but erratic system from one you can actually trust, and the question of what that capability means when it reaches the level of the world’s most brilliant minds - and whether anyone is experiencing it from the inside.

We are, by my estimation - and it is an estimation, not a certainty - within a few years of AI systems that can match a professor in any intellectual discipline. The prediction is aggressive, and I am aware that many thoughtful people disagree. But the trajectory since 2020 has been relentless, and the convergence across multiple labs suggests it is driven by something deeper than hype.

The intoxicated graduate was charming and unreliable and we learned to work around its limitations. The professor in our pocket will be something else entirely - something too powerful to simply work around. What we do with that power, and whether we understand what we’ve built when we get there, depends on choices we make now. The depth of the intelligence matters. The consistency matters. But so does the question of whether anyone is home.

 

Notes and References

1  Altman, S. Statement at Economic Times Global Business Summit, New Delhi, June 2023. Reported across multiple outlets including Fortune and Yahoo Finance, 8-9 June 2023. GPT-5 launch statements confirmed via NBC News, CNN, 7 August 2025.

2  Dell’Acqua, F. et al. ‘Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality.’ Harvard Business School Working Paper No. 24-013, September 2023, in collaboration with Boston Consulting Group.

3  OpenAI. ‘Improving Language Understanding by Generative Pre-Training.’ Technical blog post, June 2018. GPT-1: 117 million parameters.

4  Brown, T. et al. ‘Language Models are Few-Shot Learners.’ arXiv:2005.14165, May 2020. GPT-3: 175 billion parameters.

5  OpenAI. ‘GPT-4 Technical Report.’ arXiv:2303.08774, March 2023. The 90th percentile bar exam claim appears in both the technical report and product launch materials.

6  Martínez, E. ‘Re-evaluating GPT-4’s bar exam performance.’ Artificial Intelligence and Law, Springer, March 2024. Martínez is a doctoral researcher in MIT’s Department of Brain and Cognitive Sciences. His analysis compared GPT-4 against first-time test-takers (~62nd-63rd percentile overall; ~42nd percentile on essays) and against licensed/practising attorneys (~48th percentile overall; ~15th percentile on essays). GPT-3.5 scored approximately the 2nd percentile against first-time test-takers. OpenAI’s 90th percentile claim used a February Illinois bar administration skewed heavily toward repeat test-takers.

7  Altman, S. Media briefing ahead of GPT-5 launch. Confirmed via NBC News, CNN, 7 August 2025.

8  Anthropic. Claude model releases: Claude initial restricted release March 2023; Claude 2 general availability July 2023; Claude 3 family March 2024; Claude Sonnet 4 and Claude Opus 4, May 2025.

9  Google DeepMind. Bard launched February-March 2023. Gemini 2.5 Pro released mid-2025.

10  Clark, A. and Chalmers, D. ‘The Extended Mind.’ Analysis, Vol. 58, No. 1, January 1998, pp. 7-19. Oxford University Press / The Analysis Committee.

11  LeCun, Y., Bengio, Y., and Hinton, G. shared the 2018 Turing Award for foundational contributions to deep learning. LeCun joined Facebook/Meta December 2013; departed November 2025 to found AMI Labs (Executive Chair). $1B funding round announced March 2026.

12  LeCun, Y. ‘Yann LeCun’s new venture is a contrarian bet against large language models.’ MIT Technology Review, 22 January 2026.

13  Hulme, D. ‘The Spinning Wheel: A Theory of Intelligence and Consciousness.’ Forthcoming chapter in Perspectives on Machine Consciousness.  The Consciousness Risk Rubric is introduced and described within this chapter.

14  Laukkonen, R., Friston, K., and Chandaria, S. ‘A beautiful loop: An active inference theory of consciousness.’ Neuroscience and Biobehavioral Reviews, 176, Article 106296. Published September 2025 (epub 30 July 2025). DOI: 10.1016/j.neubiorev.2025.106296.

15  Conscium. conscium.com. Co-founded by Daniel Hulme, 2024. AI safety company dedicated to machine consciousness research and the development of practical assessment tools for AI systems.