World Models, Free Energy, and a Plastic Dinosaur


Support CleanTechnica’s work through a Substack subscription or on Stripe.


In 2020, I wrote about a meter-high robotic, smart fabric wrapped Plastic Dinosaur that gained consciousness to explore machine learning. A series of articles on the core concepts of machine learning and neural nets, as they existed then, each started with a brief story about Plastic Dinosaur as it learned and wandered around the lab. The 150-page 2020 CleanTechnica report that I assembled from the series of article and explained machine learning’s global intersection with clean technology was just that, an exploration and an explanation of the moment in time. In 2026, I am sure it wasn’t alive enough, not that I pretended otherwise then.

I’m returning to it because I’ve been digging into specific aspects of cognition and consciousness as I follow my primary collaborator on the material and in our latest technical business endeavor through his thinking. For readers who want to focus on cleantech and climate change and get annoyed by things that aren’t about that, or who are annoyed by long, dense articles, I suggest you switch channels now. If you are remotely interested in the rabbit hole of cognition and embodied intelligence, and want a starter kit, keep reading and don’t expect to be remotely satisfied with the depth of exploration, but do expect to find hooks for further Googling, ChatGPT questions, and book purchases.

In part, this is also triggered by statements by a current leader in the field. In early 2025, Dario Amodei, CEO of Anthropic, suggested that artificial general intelligence may already exist in limited or intermittent form, pointing to moments when large models demonstrate broad reasoning across domains rather than narrow task performance. His comments marked a return to the question of AGI as a present-tense issue rather than a distant milestone. Importantly, he made no claim that current systems possess consciousness or subjective experience, only that their general cognitive competence may, at times, meet some operational definitions of AGI. As a side note, Amodei and Anthropic are under pressure from the US government to allow their AI models to be used for domestic surveillance of citizens and for weapons control without humans in the loop, something Amodei is saying no to. Foreign surveillance is just fine by Amodei, in case you were wondering, as are his systems being used in weapons systems as long as humans make the kill decision.

It’s also triggered by the current spate of hype around bipedal robots, with Elon Musk doing much of the hyping related to his Optimus robot, but also China’s spate of dancing, martial arts, and gymnastically competent bipedal robots. I published my assessment recently of the fundamental problem they share, which is that while we have extraordinary libraries of sounds and images, and sophisticated tools for assessing and working with them that are available to all roboticists, we have nothing like that for what we think of as touch. Although locomotion and balance have improved relative to past decades, fine-grained manipulation of objects, safe interaction in unstructured environments, and long-term mechanical reliability continue to lag well behind perception and gait, partly because human hands and reflexes integrate dense tactile feedback and subconscious corrections in ways robots cannot yet replicate. These layers of mechanical and control complexity have meant that, decades after repeated waves of enthusiasm, genuinely general-purpose humanoid robots capable of reliably and safely operating among people remain beyond near-term timelines.

The framing of Plastic Robot’s core systems and the concepts related to machine learning remain salient to bipedal robots’ control systems, and at least analogically related to how they return to charging ports. Whatever else they are, current bipedal robots aren’t intelligent by any current definition, and a lot further from consciousness, no matter how much their designers try to emulate it and put in response models that attempt to convince us that they are more than just circuitry. Large language models have arguably “passed” versions of the Turing Test in limited, short conversational settings, where human judges fail to reliably distinguish them from people. But that milestone says more about how good they are at statistical imitation than about general intelligence or understanding. In longer exchanges, under technical scrutiny, or when continuity and grounding matter, their limitations become clear. Passing a narrow imitation game is not the same as possessing agency, stable identity, or consciousness, and LLMs providing speech interpretation and response in Optimus or Chinese robots shares the same limitations.

To be clear, in the exploration of cognition that follows, I’m stumbling through topics I’m much less qualified to explore than most of my articles typically address, so if you happen to be a neuroscientist, cognitive science researcher, machine learning expert, roboticist or other expert in the areas the following touches upon, please feel free to gently correct any of my undoubtedly many glosses, misapprehensions or mistakes. I have a professional background inclusive of artificial intelligence, vision recognition, data engineering and machine learning, including with one of my current businesses, but am mostly a pragmatic applier of the affordances of the new toolkits, and make no claim to be a researcher in the space or moving the needle. I solidify my understanding and memory of things by writing out what I discover, leaving a bread crumb trail that includes errors and misunderstandings, which thankfully people point out to me.

Cover of CleanTechnica machine learning and clean tech global survey and explanatory report, by author.

Paul Werbos is one of the pioneers of modern artificial intelligence, and after reading the draft agreed to write the foreword to the report and discuss it among other topics on CleanTech Talks (part 1, part 2). In the 1970s he described backpropagation in neural networks in his Harvard PhD thesis, years before it became a core training method behind today’s deep learning systems. He later worked for decades at the U.S. National Science Foundation shaping research directions in neural networks, adaptive systems, and intelligent control. Looking back, I feel both grateful and lucky that someone with that depth of insight and historical perspective agreed to write the foreword to my 2020 report. It was an act of generosity from a foundational thinker whose early work now underpins much of what we are discussing about AI today.

I explore ideas in publications, and I try to pair that with a willingness to say I was wrong. Plastic Dinosaur was not an attempt to architect general intelligence. It was a narrative device, co-developed with my long term collaborator David Clement, to explore machine learning in practical contexts. We were mapping affordances. We didn’t predict generative AI, large language models, CLIP-style multimodal embeddings, diffusion models, or tool-using agents. None of that was on our radar, or in any event it wasn’t on my radar. David’s on a rather different level than me, and I often stumble after him into deep intellectual waters where he swims, I flounder, and sometimes point out a shoal of business value in the turbulent seas. We were examining what reinforcement learning, sensors, salience, and simulation could do for real-world systems. The goal was pragmatic clarity, not metaphysics. In this case, I knew I wasn’t right to begin with, and so no mea culpas are required.

The CleanTechnica machine learning report that featured Plastic Dinosaur was grounded in the state of the art between 2018 and 2020. Deep reinforcement learning was showing results in robotics and games. Simulation-to-real transfer was a serious research program. Image recognition had crossed useful thresholds, with error rates on ImageNet dropping from over 25% in 2011 to under 5% by 2015. Autonomous vehicles were logging millions of kilometers in testing, and showing large and quick gains, something we’ve discovered to be bounded by Xeno’s Paradox. Organizations were asking how to deploy machine learning in energy systems, industrial operations, and infrastructure management. We framed ML as pattern recognition at scale, dependent on data quality, feedback loops, and domain structure. We quantified model performance in terms of precision, recall, false positives, and cost curves. If a model reduced failure rates from 5% to 1% in a process that handled 1 million events per year, that meant 40,000 fewer errors. That was the lens.

ChatGPT generated image representing the offline learning state of Plastic Dinosaur
ChatGPT generated image representing the offline learning state of Plastic Dinosaur

Plastic Dinosaur embodied those ideas. It had sensors. It had a battery level we anthropomorphized as hunger. It had reflex layers for balance and locomotion. It had salience mechanisms to focus on hands and doors. It had a dreaming phase, which meant offline replay in simulation to refine policies like opening doors or flipping light switches before deploying them to the body. It had neural net modules we labeled cerebellumnet, amygdalanet, and curiousnet to explain coordination, threat detection, and exploration. The architecture was layered, drawing on subsumption robotics from Rodney Brooks and reinforcement learning from DeepMind. The dinosaur learned through trial and error, receiving reward signals for catching balls or avoiding holes. It was a pedagogical machine.

What Plastic Dinosaur got right was that embodiment matters for many classes of problems, and possibly for any form of  consciousness per neuroscientist Mark Solms, per his 2022 book The Hidden Spring. A robot navigating a warehouse needs sensor fusion across lidar, cameras, and inertial measurement units. If each sensor produces 10 MB per second, and the robot runs 10 hours per day, that is terabytes per day of raw input. You cannot process all of that at full resolution in real time. Salience is not philosophical. It is computational necessity. We also got right that simulation is powerful. Training in a physics engine at 1,000 times real-time speed allows millions of episodes per day, compared to perhaps 10,000 real-world episodes per day. That ratio, 100:1 or 1,000:1, changes feasibility. Feedback loops, instrumented data, and iterative improvement remain the core of applied ML.

One of the many areas Plastic Dinosaur was intentionally naive in was in the leap from competence to consciousness. We pretended that stacking enough layers, enough sensors, and enough internal modeling might lead to emergent awareness. We described fear and hunger as modules without interrogating what those words mean biologically. We separated learning from doing, with offline updates that were downloaded into the dinosaur’s operational brain. That makes for useful heuristics. It does not resemble an organism that must maintain itself continuously under entropy pressure. Plastic dinosaur was perhaps a capable control system, certainly an interesting pedagogical device, but not remotely a viable organism. We were aware of this, but not in anything beyond acknowledgment of the depth of our ignorance. It was an interesting thought experiment.

The world has changed. Between 2020 and 2024, parameter counts in leading models grew from hundreds of millions to hundreds of billions. GPT-3 in 2020 had 175 billion parameters. GPT-4 was in the trillion-parameter range when including mixture-of-experts architectures. Training datasets grew to trillions of tokens. CLIP—a pre-trained visual recognition model David and I are integrating into our UK water industry digital twins solution through Trace Intercept—aligned text and images in shared embedding spaces of 512 or 768 dimensions. Diffusion models learned to map random noise to coherent images in 50 to 100 denoising steps. Models began to exhibit cross-domain generalization. They could write code, summarize papers, translate languages, and reason through math problems with few-shot prompting. This was not what we modeled in Plastic Dinosaur. We did not predict that next-token prediction over large corpora would approximate world knowledge at this scale.

The emergence of generative AI shifted the center of gravity from task-specific learning to foundation models. Pretraining on massive datasets created representations that could be fine-tuned for dozens of downstream tasks with small amounts of labeled data. In quantitative terms, transfer learning reduced labeled data requirements by orders of magnitude. A task that once required 100,000 labeled examples might now perform well with 1,000. The cost per useful model dropped. The marginal cost of inference remained high, often measured in $0.01 to $0.10 per thousand tokens for large models, but the capability envelope expanded. This was intelligence without embodiment. It was also intelligence without explicit world models in the classical robotics sense.

ChatGPT generated image of Plastic Dinosaur with a LeCun world model perspective
ChatGPT generated image of Plastic Dinosaur with a LeCun world model perspective

Yann LeCun has argued that this is insufficient for general intelligence. LeCun sits in the small constellation of researchers who shaped modern artificial intelligence at its foundations, alongside figures like Geoffrey Hinton and Yoshua Bengio, with whom he shares the 2018 Turing Award. His early work on convolutional neural networks helped make deep learning practical, and his more recent advocacy for world models continues to influence how researchers think about the path from pattern recognition to general intelligence.

His position is that real intelligence requires learned world models that can simulate the consequences of actions. A world model is a generative model of environment dynamics. A generative model of environment dynamics is an internal model that predicts how the world will change and how actions will affect that change. Instead of just recognizing patterns in the present, it lets a system simulate possible futures and choose actions based on what is likely to happen next. It’s a what-if system. In reinforcement learning terms, model-free methods learn a policy mapping states to actions, while model-based methods learn transition probabilities and reward functions. If a system learns that state s transitions to state s’ with probability 0.8 after action a, it can plan by simulating sequences. In robotics, this reduces sample complexity. If you can simulate 10,000 trajectories internally before acting once in the real world, you reduce risk and cost. LeCun’s argument is that language models compress correlations in text but do not necessarily learn grounded dynamics of the physical world.

David and I never saw subsumption and world models as opposing camps so much as layers in a hierarchy. In the early 2000s, when I was reviewing robotics literature from around the world and bringing back instead for us related to swarm-based architectures and distributed task fulfillment, the divide was clear in the papers. Subsumption architectures, following Rodney Brooks, emphasized reflex layers for obstacle avoidance, balance, and survivability. These systems reacted in tens of milliseconds and did not require internal maps, which made them robust and computationally efficient for small robots with limited processors and battery budgets measured in tens of watts. In parallel, model-based approaches were emerging that built internal representations of terrain, agents, and task goals, enabling planning across longer time horizons measured in seconds or minutes rather than milliseconds. Our view then, and still now, is that survivability and physical competence rest on layered reflexes, but sophisticated coordination and task fulfillment require a generative model that can simulate consequences before committing actuators. The two approaches address different temporal scales and different failure modes, and combining them always seemed more realistic than choosing one as dogma.

I teased these out when assessing Waymo’s (then Google’s) vs Tesla’s approach in an article in 2015, an analysis which turned out to be partly wrong. Tesla built a robust, survivable car that had the ability to get away from problems because of its excellent braking, steering and acceleration, effectively a subsumption physical layer for survivability. Then it layered in a response system which centered the car in the lane and reacted to externals, effectively a subsumption layer in a machine learning neural net, which it retrained out of the car and then redownloaded, analogous to Plastic Dinosaur’s dreaming simulation mode, but with the drivers’ interventions providing the reinforcement learning, not lots of fumbling around until something achieved the goal. Only then did it add a world view, keeping it to a Google Maps level of abstraction. At the time, this seemed clearly superior to Google’s clearly low survivability bubble of a car with a lidar nipple on the top, focus on millimeter precision full world map and lots of forward planning to get around. However, the Xeno’s Paradox problem of reinforcement learning has been biting Tesla hard for years with its Full Self Driving always taking a step half of the remaining distance to the required capabilities, and hence not arriving, as I noted in a mea culpa I published last year, a decade after my original assessment. Autopilot and Autosteer, very useful driver aids for long highway drives, are gone on new Teslas, FSD still isn’t full self driving and is subscription only. Meanwhile, Waymo’s taxis, despite odd behaviors and the occasional hack, are expanding as limited urban area transportation, appreciated especially by women traveling by themselves it seems.

To return to subsumption and world view perspectives, the core difference is that we were interested in pragmatic achievement of task-oriented robots in the early 2000s, and LeCun is articulating the requirements for artificial general intelligence and consciousness. Modern understanding of how we actually think and see the world is that we have a hallucinatory prediction engine running all the time, one that imagines what our senses will perceive next. Our senses either confirm our prediction, requiring no further effort on our part, or don’t, requiring a minor update in our hallucination. Our dreams feel real because they are running on the same architecture, but without engaging the engine of our body, liking revving an engine without the clutch engaged. When we imagining a scene, we are once again leveraging exactly the same architecture that we use for perceiving the world in the first place. When we remember something, same thing, although there are key issues around memory that I won’t get into here. There is a rather large literature on experiments which confirm this. Cognitive scientist Andy Clark’s 2023 book The Experience Machine is a more accessible book on the subject, and one I recommend over Solms much denser Hidden Spring for most people.

Plastic Dinosaur gestured at world models through dreaming. We imagined replaying experiences in simulation to refine policies. That is closer to model-based reinforcement learning than to pure reactive systems. However, we did not formalize a generative model of the world and body. We had tasks. We had rewards. We did not have an explicit latent space representing object permanence, gravity, friction coefficients, or social norms. LeCun’s framing suggests that without such models, systems will struggle with long-horizon planning in novel domains. The difference between reacting to patterns and simulating consequences becomes significant when stakes are high.

Karl Friston’s work on the cognition-oriented variational free energy principle—not Gibbs free energy, the quantitative measure of how much work can be extracted from a chemical system under given conditions—reframes intelligence at a deeper level. The principle states that self-organizing systems that resist entropy must minimize variational free energy. In practical terms, organisms maintain themselves within narrow physiological bounds. Human core temperature stays near 37°C with deviations of 1°C triggering compensatory mechanisms. Blood glucose is regulated around 4 to 6 mmol per liter. If deviations exceed certain thresholds, survival probability drops. The brain builds generative models that predict sensory inputs. Prediction error, the difference between expected and actual input, drives learning and action. Active inference means the organism acts to reduce prediction error, either by updating beliefs or by changing the world.

As a brief aside, I rue that the scientific community used the term “free energy” for both Helmholtz and Gibbs free energy, narrow scientific concepts with specific meanings, because it’s led to centuries of over unity nonsense energy schemes and cons. Like zero point energy, people with either limited intellectual capacity—and an extraordinary belief despite it in their own genius—or con artists, have wasted their time and the time of many others on the subject. Arguably the con artists who made and make bank from the credulous didn’t waste their time, but that doesn’t mean that they aren’t a blight on society.

ChatGPT generated image of the Markov blanket surrounding Plastic Dinosaur
ChatGPT generated image of the Markov blanket surrounding Plastic Dinosaur

This perspective implies that intelligence is not about maximizing reward signals handed down by a designer. It is about maintaining viability. A Markov blanket is the boundary that separates a system from its environment. It defines what the system can sense from the outside and how it can act back on the world, creating a clear line between what is “inside” the system and what is not. Sensory states depend on external states. Active states influence external states. Internal states depend on sensory states. That closed loop defines a self. In mathematics, you can write variational free energy as the expected difference between a recognition density and a generative model, plus a complexity term. Minimizing free energy keeps the system within expected states. Plastic Dinosaur had a battery level variable, but that is a single scalar. Real organisms regulate thousands of variables across multiple timescales. The difference is one of dimensionality and coupling. To be clear, while David is fully engaged with Friston’s core math and papers, my understanding is through others explanations of Friston’s work, including Solms’ who collaborated closely with him and hence is a trusted source, and David’s.

Solms adds another layer by arguing that consciousness is not cognition but affect, something I understand only through his book The Hidden Spring, not through a deeper understanding of his ground breaking work, as with my understanding of Friston’s body of work. He draws on affective neuroscience, particularly Jaak Panksepp’s identification of core emotional systems such as seeking, fear, rage, care, and panic. These systems are rooted in subcortical structures like the brainstem and limbic system. The cortex elaborates and regulates them, but does not generate raw feeling.

It’s worth pausing here. In the book, Solms spent a rather large amount of time on previous theories of where consciousness arises and is housed in the brain, specifically the cortical theories which housed it in the most recently evolved cortex, not the earlier levels. He steps through the research that makes it clear that in animals and humans, consciousness can and does exist without the cortex at all. Consciousness is something that evolved very early, long before we became homo sapiens, and exists in innumerable species in the world. Hydranencephalic children, who are borne with no cortex and spinal fluid replacing its space in the cranium, exhibit consciousness, as do lab rats who have had their cortexes surgically removed, experiments which were one of the keys to clear ethical guidelines in the treatment of lab animals.

Solms claims that consciousness arises when prediction error matters for survival. When an organism deviates from homeostatic set points, it experiences affective valence. That feeling is the subjective aspect of regulation. Cognition without affect can proceed unconsciously. Many cortical processes are not experienced directly. Explained more simply, unless you are surprised or seeing something you don’t expect to see or feeling something, you aren’t actually conscious according to Solms. Consider how often you walk or drive a frequently traversed root and arrive surprised that you’ve arrived.

When I was rereading Solms’ book recently and with a much higher level of understanding after reading related overlapping works recommended by David and some of his own work, my instinct was to map it onto Daniel Kahneman’s System 1 and System 2 framework, something he explained carefully in his 2013 book, Thinking, Fast and Slow, a book I’ve reread several times. System 1 is fast, automatic, heuristic-driven thinking, and it felt natural to equate that with Solms’ unconscious processes.

That mapping turned out to be misleading. Kahneman’s distinction is about speed and cognitive effort, not about consciousness itself. System 1 outputs can be fully conscious, such as a gut feeling of risk or familiarity, while much of what Solms describes as unconscious includes sophisticated cortical processing that never becomes experienced at all. Solms draws the line not between fast and slow thinking, but between felt and unfelt states. The unconscious in his framework is any computation not accompanied by affect, regardless of speed. What is conscious is not the heuristic or the reasoning layer, but the valenced signal tied to the organism’s internal needs. My initial overlap felt tidy, but it flattened a deeper and more important distinction.

Applying this to Plastic Dinosaur exposes the gap. We—mostly me because David provided the machine learning and intellectual depth and I provided the narrative, faux robotic architecture and clean technology exploration—labeled a module amygdalanet and called it fear, but fear in organisms is not just threat classification. It is a felt state tied to survival circuitry. In humans, amygdala activation correlates with physiological changes like increased heart rate and cortisol release. These changes are part of a global state that reorganizes perception and action. If Plastic Dinosaur detects a cliff and updates a risk score from 0.2 to 0.9, that is computation. It is not affect unless that computation is tied to a system that must maintain itself or cease to exist. Solms would argue that without a network of interoceptive signals representing internal needs, there is no basis for feeling.

ChatGPT generated image contrasting Solms' consciousness in a box vs embodied consciousness
ChatGPT generated image contrasting Solms’ consciousness in a box vs embodied consciousness

Solms has discussed the idea of building consciousness in a box. The implication is that a disembodied system running on servers does not satisfy the conditions for affective consciousness. To feel, a system must have something at stake. That stake is typically survival of a bounded entity. If a server instance crashes, another can be spun up. There is no integrated self whose continued existence depends on regulation of internal states. In thermodynamic terms, biological organisms are far-from-equilibrium systems consuming energy at rates like 100 watts for the human brain alone, roughly 20% of total resting metabolic expenditure. That energy supports constant regulation. Current AI models consume megawatt-hours during training, but once trained, they do not regulate themselves. They perform inference until shut down.

Solms speculates about what it would take to build a minimal conscious system, and his answer is not bigger language models or faster processors but an embodied machine with genuine internal needs. His hypothesis is that consciousness would require a bounded physical system with interoceptive signals representing variables that must be regulated within viable ranges, such as energy reserves, structural integrity, or thermal limits. A robotic platform would need not only sensors and actuators but also an affective valuation layer that assigns positive or negative significance to deviations from stability, driving action in order to restore equilibrium. In this framing, a “consciousness in a box” is not a chatbot running on servers but an embodied agent whose continued operation depends on minimizing prediction error relative to its own internal states, with felt valence emerging from that regulatory loop. In that context, real world Plastic Dinosaurs and future variants of the current spate of bipedal robots represent a necessary precursor to a machine consciousness.

The intersection of LeCun, Friston, and Solms suggests a layered architecture for intelligence and consciousness. Actual embodiment and needs provide the key precursor requirements. World models provide predictive capacity and simulation. Active inference provides a formal account of self-maintenance under uncertainty. Affective systems provide the valence that constitutes subjective experience. Plastic Dinosaur had the first, pieces of the second and hints of the third. It did not have the fourth in any meaningful sense. Even modern large language models, with hundreds of billions of parameters, do not demonstrate interoception, viability constraints, or affective economies. They minimize training loss functions defined by cross-entropy over token distributions. That is not the same as minimizing expected Friston’s variational free energy in a self-organizing organism.

This reframing matters for the AGI debate. If AGI is defined as cross-domain problem solving at or above human level, then foundation models approach that benchmark in limited contexts. They pass standardized tests at high percentiles. They can write code that compiles. They can generate plausible research summaries. If AGI is defined as an autonomous agent that maintains itself, models its environment, and acts to preserve its viability, then current systems fall short. They require external orchestration, power supply, cooling, and human-defined objectives. If AGI is defined as a conscious entity with subjective experience, then the bar is even higher, and current evidence does not meet it.

If I were writing the machine learning report today, I would shift the emphasis. I would treat foundation models as compressions of cultural and technical priors at unprecedented scale, leveraging them for visual recognition, speech and the like. I would frame world models as emerging again in agent architectures that integrate planning modules. I would explicitly distinguish competence, agency, and consciousness. I would probably not try to incorporate Friston’s mathematics to explain why organisms are different from tools because it’s beyond me, and would turn the report not into an exploration of affordances of modern task oriented AI toolkits for cleantech, but something much different. I would probably reference Friston and Solms in the Plastic Dinosaur narrative as a gloss for its faux-emergent consciousness without getting deeper into it.

Plastic Dinosaur did its job, at least for me. It made machine learning concepts legible. It foregrounded embodiment and feedback. It invited readers to imagine how layered systems learn. It also revealed, in hindsight, what was missing. Intelligence is getting easier to approximate in narrow metrics. Consciousness is harder to explain. The dinosaur was not alive enough, but it helped me see what alive enough might require. And it was fun.


Sign up for CleanTechnica’s Weekly Substack for Zach and Scott’s in-depth analyses and high level summaries, sign up for our daily newsletter, and follow us on Google News!


Advertisement

 


Have a tip for CleanTechnica? Want to advertise? Want to suggest a guest for our CleanTech Talk podcast? Contact us here.


Sign up for our daily newsletter for 15 new cleantech stories a day. Or sign up for our weekly one on top stories of the week if daily is too frequent.



CleanTechnica uses affiliate links. See our policy here.

CleanTechnica’s Comment Policy



Source link