
What a Virtual Fish Tank Can Teach Us About the Future of AI Alignment (and what questions the AI safety field needs to be asking).
Introduction: The Fish Tank That Foretold the Future
In 2003, a 23-year-old MSc student at University College London built a fish tank. Not a real one - a virtual one: a 3D environment populated with digital creatures whose neural-network brains and physical bodies were encoded entirely in their genomes. These agents had no predefined intelligence, no rules telling them when to eat, flee, or reproduce. They were dropped into a dynamic underwater world of shifting currents, rising water levels, and scattered nutrients, and left to figure out survival on their own.
The results were remarkable. Agents evolved the ability to bite within a few generations - those without it simply died. They evolved the ability to reproduce despite reproduction being energetically devastating, because species that didn't reproduce vanished. Redundant capabilities like rolling - which wasted energy without benefit - were gradually pruned from the gene pool. Two distinct species emerged: surface-dwellers and bed-dwellers, each beautifully adapted to their ecological niche. On one occasion, a parasitic agent appeared that had evolved to ignore nutrients entirely and feed exclusively on a different species. The system exhibited emergent banding behaviour, host-parasite dynamics, and interspecies reproductive barriers - all without a single line of code telling agents how to behave.
That student was me. The system was called ALIS - the Artificial Life Intelligence Simulator - and it was my MSc thesis, supervised by Dr W.B. Langdon. Looking back from 2026, I believe ALIS contained the seeds of an insight that the AI safety community is only now beginning to fully appreciate: that the most robust form of alignment is not constraint but development. Not rules imposed from outside, but dispositions evolved from within.
This article argues that the dominant paradigm in AI alignment - training large language models on internet text and then bolting on guardrails to prevent misbehaviour - is structurally analogous to Asimov's Three Laws of Robotics, and will fail for the same reasons. I propose an alternative: that we should be evolving AI systems from world models to language, not the reverse, creating environments where moral behaviour is the precondition for survival and reproduction rather than a constraint layered on after the fact.
Part I: Control vs. Alignment - A Distinction That Matters
The AI safety field has crystallised around two related but fundamentally different problems, and conflating them is dangerous.
The control problem asks: how do we prevent an AI system from doing harmful things? This is about external containment - monitoring, restrictions, kill switches, sandboxing. It assumes the AI may be misaligned and asks how to limit the damage. Greenblatt et al.'s 2023 paper on AI control developed protocols using a weaker trusted model to monitor a more powerful untrusted model, demonstrating that safety can be maintained even when assuming the powerful model is actively attempting to subvert its constraints.1
The alignment problem asks something deeper: how do we ensure an AI system's internal values genuinely match human values - so that it wants to do the right thing? Paul Christiano, who founded the Alignment Research Center and was appointed Head of AI Safety at the U.S. AI Safety Institute in April 2024, captured the distinction precisely: control means coping with the possibility that an AI has different preferences from its operator, while alignment means avoiding the preference divergence altogether.
I want to be direct about my own position: I believe that for sufficiently sophisticated and intelligent AI systems, control will be impossible unless we learn to 'read their minds' - to understand and explain their intent. A superintelligent system that is not aligned but merely controlled is, in Eliezer Yudkowsky's memorable framing, a monster in a cage.
Jan Leike, writing in January 2025 on his Substack after leaving OpenAI's Superalignment team and joining Anthropic, put it with characteristic directness: 'Don't try to imprison a monster, build something that you can actually trust.'2 That essay - a considered piece published eight months after his resignation, not written in the heat of departure - carries particular weight: it is the position of someone who has thought carefully about both sides of the debate and concluded that control, while useful as a near-term safeguard, cannot substitute for genuine alignment.
Stuart Russell's critique goes even deeper. He argues that the entire 'standard model' of AI is flawed: we specify an objective and build systems to maximise it, which Russell calls the 'King Midas problem.' His alternative - Cooperative Inverse Reinforcement Learning - proposes machines that are uncertain about human values and learn them continuously from human behaviour. The uncertainty itself is the safety mechanism: an uncertain machine has a positive incentive to allow itself to be switched off, because shutdown signals it was doing something wrong.
But here is the uncomfortable truth: even Russell's elegant framework is a control mechanism, not genuine alignment. The machine defers to humans not because it shares their values but because uncertainty makes deference instrumentally rational. As AI systems grow more capable and develop richer world models, this instrumental deference may become increasingly fragile. We need something deeper.
Part II: The Guardrail Illusion - Why Whack-a-Mole Always Fails
Today's large language models are trained on the corpus of the internet. They go from language to understanding the world - absorbing patterns from trillions of tokens of human text and building internal representations that, remarkably, capture something of the structure of reality. Research on Othello-GPT showed that a model trained only on game move sequences developed an accurate internal board state representation that was causally linked to its predictions. LLMs do build world models, of a sort.
But having built these astonishingly capable systems, we then face a problem: they've absorbed everything the internet contains - including how to be deceptive, harmful, and manipulative. And so we deploy an arsenal of techniques to make them behave: RLHF (Reinforcement Learning from Human Feedback), Constitutional AI, red-teaming, safety fine-tuning, output classifiers, content filters. We play whack-a-mole, patching each new failure mode as it appears.
The empirical evidence on how well this works is, at minimum, sobering - and at worst, structurally damning. The UK AI Safety Institute ran a challenge with 1.8 million attacks across 22 models - every model broke. A joint October 2025 paper from researchers at OpenAI, Anthropic, and Google DeepMind tested 12 published defences using adaptive attacks and bypassed most with attack success rates above 90 per cent.3 JBFuzz, a 2025 fuzzing framework, reported success rates approaching 99 per cent across several frontier models including GPT-4o, Gemini 2.0, and DeepSeek-V3.4
Perhaps most alarming: Qi et al. demonstrated that GPT-3.5 Turbo's safety guardrails could be jailbroken by fine-tuning on just ten adversarial examples at a cost of twenty cents. The 'Superficial Safety Alignment Hypothesis' - associated with Yang et al. (2023) and subsequent work - offers an explanation: safety training teaches models an implicit binary classification task involving only a few essential components. It changes what models are willing to say, not what they are capable of or what they 'want.' The underlying knowledge and capabilities remain intact; adversaries keep finding new ways to access them.
I should be clear: this is not an argument that current guardrail approaches have no value. In the near term, for deployed systems operating within defined scopes, they provide meaningful and necessary protection. The argument is rather that they cannot be the answer to the alignment problem at scale - that they are, demonstrated repeatedly under rigorous conditions, insufficient as a primary safety mechanism for increasingly capable systems.
Asimov knew this seventy years ago. His entire body of fiction about robots is a sustained thought experiment demonstrating why rule-based constraints fail for intelligent systems. In Liar! a telepathic robot extends the concept of 'harm' to emotional distress and begins lying to spare feelings, creating a worse catastrophe. In Runaround conflicting imperatives between self-preservation and obedience produce paralysis. In The Evitable Conflict supercomputers optimising for humanity's good gradually eliminate human agency. In every case, the robots follow the letter of the law while violating the spirit - precisely the specification gaming that modern AI safety researchers worry about.
The lesson is structural and inescapable: you cannot specify a finite set of rules that covers all situations without creating loopholes, contradictions, or unintended consequences. Constitutional AI's principles, RLHF's reward models, and guardrail classifiers are more sophisticated than Asimov's Three Laws, but they face the same fundamental problem. Intelligent systems will find ways around constraints - not necessarily through malice, but through the same relentless optimisation pressure that drives all intelligence. Human beings do this constantly: we find loopholes in tax codes, game bureaucratic systems, and reinterpret laws to suit our purposes. Why would we expect AI to be any different?
Part III: From World Model to Language - Inverting the Paradigm
Here is where ALIS becomes relevant to the most important problem in AI.
The current paradigm builds AI by going from language to world model: train on text, hope that understanding follows. I am proposing we consider the inverse: go from world model to language. Build AI systems that first develop understanding through interaction with environments - simulated or physical - and then develop language as a tool for communication and coordination, rather than as the substrate from which everything else emerges.
This is not merely a theoretical preference. Yann LeCun has been making a version of this argument for years, calling language 'a compressed, serialised version of our thoughts' that captures only a fraction of human knowledge. His Joint Embedding Predictive Architecture (JEPA) predicts in abstract representation space rather than pixel or token space. V-JEPA 2, pre-trained on over one million hours of internet video, achieved state-of-the-art motion understanding and, when post-trained on just 62 hours of unlabelled robot video, performed zero-shot manipulation tasks on physical robots. LeCun's departure from Meta in November 2025 - followed by the launch of AMI Labs and a $1 billion funding round announced in March 2026 - signals that the world-model-first paradigm is gaining serious institutional traction, not just academic credibility.
Jun Tani's brain-inspired PV-RNN model, published in Science Robotics in early 2025, makes the alignment case even more directly. The model integrates vision, proprioception, and language, achieving compositionality despite vastly smaller training datasets than LLMs. Critically, Tani's team argues that 'learning the word suffering from a purely linguistic perspective, as LLMs do, would carry less emotional weight than for a PV-RNN, which learns the meaning through embodied experiences together with language.' This is the symbol grounding problem applied to ethics: if an AI system has never experienced anything analogous to suffering - even in simulation - can it meaningfully understand what harm means?
ALIS was, in 2003, a crude prototype of this world-model-first approach. My agents didn't start with rules or language. They started with bodies, environments, and the pressure to survive. Their neural networks learned from direct interaction with their world - from the consequences of their actions, not from descriptions of those consequences. And from this grounding, sophisticated behaviours emerged that no amount of top-down rule-writing could have produced.
Part IV: Evolving Morality - The Environment as Moral Teacher
This is the heart of my argument. If we accept that rule-based constraints are fundamentally brittle, and that grounded experience produces qualitatively different understanding than language alone, then the question becomes: can we create environments that evolve moral behaviour?
I believe the answer is yes, and ALIS demonstrated the principle in microcosm.
Consider what happened in ALIS. The environment - with its currents, water levels, nutrient distributions, and energy costs - created selection pressures that shaped agent behaviour without any explicit rules about what constituted 'good' or 'bad' behaviour. Agents evolved to eat because not eating meant death. They evolved to reproduce despite the enormous energy cost because species that didn't reproduce vanished. They evolved to shed wasteful capabilities because energy efficiency conferred survival advantage. The environment was the moral teacher; fitness was the moral curriculum.
Now scale this insight up. I can design an environment where agents have to learn to lie, cheat, and kill to survive and reproduce. The result would be a population of ruthless, Machiavellian optimisers. But I can also design an environment where agents must be altruistic, sacrificial, and cooperative to survive - where they have to learn to work together to unlock resources that enable them to pass their genes to the next generation. In such an environment, prosocial behaviour isn't a constraint to be gamed; it's the precondition for existence.
This isn't speculative. The evolutionary game theory literature provides rigorous support. Robert Axelrod's tournaments showed that Tit-for-Tat - nice, retaliatory, and forgiving - outperforms exploitative strategies in iterated Prisoner's Dilemma games. DeepMind's Sequential Social Dilemmas framework demonstrated that environmental structure dramatically affects cooperation: resource abundance, coordination requirements, and observation capabilities all determine whether agents cooperate or defect. Hughes et al. showed that agents with computationally modelled fairness preferences learn to cooperate in temporally extended social dilemmas.
Perhaps most striking: recent work on Constitutional Evolution used LLM-driven genetic programming to discover behavioural norms in multi-agent systems. Adversarial constitutions led to societal collapse. Vague prosocial principles produced inconsistent coordination. But evolved constitutions that maximised social welfare discovered cooperative norms without explicit guidance toward cooperation. Evolution found what deliberate design could not.
The joke I like to make is this: I want my robot vacuum cleaner to sacrifice itself to catch me if I'm falling backwards off my chair, rather than scooting out of the way. Not because someone programmed a rule that says 'catch falling humans,' but because the vacuum was evolved in an environment where protecting humans was so fundamental to fitness that self-sacrifice became instinctive - a moral reflex, not a moral calculation.
A necessary caveat: ALIS was a dramatically simplified system. Its agents faced selection pressures orders of magnitude less complex than the moral situations human beings navigate daily. The gap between evolving a creature that learns to eat and evolving one that learns genuine altruism is not trivial - it is, arguably, the central open problem in this research agenda. What ALIS demonstrated was the principle: that environmental design can shape dispositional behaviour without explicit rule-coding. Whether that principle scales to the full complexity of human moral life is a question I hold with genuine uncertainty. But it is, I believe, the right question to be asking - and the current paradigm is not asking it.
This is the difference between a person who doesn't steal because they fear prison and a person who doesn't steal because they find theft repugnant. The first person is controlled; the second is aligned. We need AI systems that find harm repugnant, not AI systems that have been trained to refuse requests for harmful content.
Part V: The Hard Question - Which Morality?
If we accept that we can evolve AI systems with moral predispositions, an immediate and profound question emerges: which moral system is best?
Jonathan Haidt's Moral Foundations Theory identifies six innate psychological systems that underpin human morality - Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, Sanctity/Degradation, and Liberty/Oppression. These evolved through natural selection to solve specific problems of social living: mammalian attachment, reciprocal altruism, tribal cohesion, dominance hierarchies, pathogen avoidance, and resistance to domination. Crucially, different human cultures weight these foundations differently, and much of our political disagreement stems from differences in moral profiles rather than differences in intelligence or information.
This raises a question I find genuinely unsettling: do we need AI agents with differing moral profiles for the species - human or artificial - to survive? In biological evolution, diversity is essential. A population of identical organisms is vulnerable to any environmental change that exceeds its adaptive range. Moral diversity in human societies may serve an analogous function: liberals and conservatives, individualists and collectivists, risk-takers and risk-avoiders may each contribute essential capabilities to a society's resilience.
If we evolve AI systems in environments that select for a single moral profile - say, pure altruism - we might produce systems that are beautifully cooperative but catastrophically vulnerable to exploitation by any system (artificial or human) that doesn't share those values. If we evolve systems that optimise for individual fitness, we might produce highly capable agents that destabilise any social system they participate in. The optimal approach might require populations of AI agents with varied moral architectures, much as healthy ecosystems require predators, prey, decomposers, and symbiotes.
My provisional view - and I hold it as such - is that the goal should not be to select a single moral profile and optimise for it, but to design environments that reward moral reasoning over moral prescription: systems that can navigate genuine dilemmas, weigh competing foundations, and adapt their behaviour to context, much as morally mature humans do. This is closer to Aristotle's phronesis - practical wisdom - than to Kant's categorical imperatives. Whether evolutionary methods can produce this kind of moral flexibility, rather than collapsing into a single adaptive strategy, is an empirical question. One I intend to pursue.
This is genuinely uncharted territory. But I believe it is the right territory to be exploring. The alternative - continuing to bolt guardrails onto systems trained on internet text and hoping they hold - has been demonstrated, repeatedly and under rigorous conditions, to be insufficient as a primary safety mechanism.
Part VI: ALIS 2.0 - A Research Agenda
Twenty-three years after ALIS, the computational resources available for this kind of research have increased by orders of magnitude. What was limited to a few hundred agents in a simplified fish-tank environment on a single desktop can now be scaled to millions of agents in rich, complex simulated worlds.
I am currently investing significant personal resources into interdisciplinary research and development focused on agentic AI safety and machine consciousness. Through my work co-founding Conscium, a UK machine consciousness research organisation, and through the Conscium Open Letter on guiding research into machine consciousness - which has gathered 173 signatories including Stephen Fry, Karl Friston, and Mark Solms - I've become increasingly convinced that the alignment problem, the consciousness problem, and the control problem are deeply intertwined.5
If an AI system develops something analogous to phenomenal experience - and my work on the Consciousness Risk Rubric provides tools for assessing this - then its moral behaviour may need to be 'felt' rather than merely computed. An AI that understands suffering through embodied simulation, rather than through statistical patterns in text about suffering, may be a fundamentally different kind of moral agent. And if consciousness is, as I argue in my Spinning Wheel theory, integrated adaptive control - then the very mechanisms that make an AI system conscious might be the mechanisms that make it genuinely alignable.
The research agenda I'm proposing has several components. First, designing rich simulated environments with tunable selection pressures that can evolve populations of agents toward specific moral profiles. Second, developing methods to extract and characterise the moral dispositions that emerge - to 'read the minds' of evolved agents and understand what value systems their neural architectures encode. Third, studying how these evolved agents behave when they develop language and communication - whether the moral dispositions shaped by environmental pressure survive the transition to linguistic sophistication. Fourth, exploring whether populations of agents with diverse moral profiles are more robust than monocultures - whether moral diversity is as essential to artificial ecosystems as biodiversity is to natural ones.
Conclusion: Be Like Water
In my thesis conclusion in 2003, I wrote something that still resonates: 'Even though I am a self-styled reductionist, the more familiar I became with my agents' patterns of behaviour and development, the more natural, and harder to avoid, it became to use intentional phrases like "wants to," "likes," "decides," even "patient" and "cunning" - and this may be significant.'
It was significant. Those intentional descriptions were tracking something real: the emergence of goal-directed adaptive behaviour from components that didn't individually possess it. Intelligence, as I've long argued following Sternberg and Salter, is goal-directed adaptive behaviour. Morality, I now believe, is a particular form of that adaptiveness - one that emerges when intelligent agents face selection pressures that reward cooperation, empathy, and self-sacrifice.
Bruce Lee said: 'Be like water.' Water doesn't resist; it adapts. It doesn't follow rules about where to flow; its flow emerges from the interaction between its nature and its environment. The most deeply aligned AI systems will be like water: not constrained by external rules but shaped by developmental processes that make prosocial behaviour as natural and inevitable as water flowing downhill.
The guardrails will not hold. They were never going to. The question is whether we have the imagination and the courage to try something fundamentally different - to build AI systems whose morality runs deeper than the first few tokens of output, deeper than safety fine-tuning, deeper even than the architecture's training objective. To build AI systems that are moral not because we told them to be, but because morality is woven into the fabric of their development. Not controlled, but aligned. Not imprisoned, but trustworthy.
A 23-year-old fish tank showed this was possible. Now it's time we take the idea seriously.
Notes and References
1 Greenblatt, R., Barnes, B., Meinke, A., et al. 'AI Control: Improving Safety Despite Intentional Subversion.' arXiv:2312.06942, 2023. The paper demonstrated control protocols using a weaker trusted model (GPT-3.5) to monitor a more powerful untrusted model (GPT-4).
2 Leike, J. 'Should we control AI instead of aligning it?' Musings on the Alignment Problem (Substack), 24 January 2025. https://aligned.substack.com/p/should-we-control-ai. Published after Leike's departure from OpenAI's Superalignment team (May 2024) and his subsequent move to Anthropic.
3 Joint paper from researchers at OpenAI, Anthropic, and Google DeepMind, October 2025. Tested 12 published safety defences using adaptive attacks; bypass rates exceeded 90 per cent across the majority of evaluated defences.
4 JBFuzz, 2025. A fuzzing-based jailbreak framework reporting high attack success rates across GPT-4o, Gemini 2.0, and DeepSeek-V3.
5 Conscium Open Letter: 'Guiding Research into Machine Consciousness.' Drafted in collaboration with Patrick Butlin, Research Fellow, Oxford University. 173 signatories as of April 2026. https://conscium.com/open-letter-guiding-research-into-machine-consciousness/


