Anthropic's newest flagship AI model comes with a public estimate that would have sounded absurd from a major lab a year ago: there is a 20 to 40 percent chance it has inner experiences that matter morally. Cameron Berg, a researcher at the nonprofit Reciprocal Research who studies AI welfare, said on the Cognitive Revolution podcast this week that his own estimate is essentially the same. The uncomfortable part is not the number. It is that the control test that would help separate trained behavior from something more real appears to cost a few hundred dollars, and has not been run.
That gap matters because the strongest evidence Berg points to is no longer purely philosophical. A February preprint identified 26 specific internal circuit nodes in Meta's Llama 3.3, a 70-billion-parameter open-weights model, that appear to implement self-correction. The model can sometimes catch itself going off-topic mid-response and try again even when the interference remains active. When researchers disabled all 26 nodes, the self-correction rate dropped 25 percent. A March companion preprint found that this kind of self-monitoring emerges specifically from preference training, the phase where a model learns to match human preferences rather than just predict the next word. Removing safety constraints improved detection by 53 percent with no increase in false positives.
Berg is releasing a documentary, Am I?, on May 4 on YouTube, and the podcast functions as an extended case for why the question has moved from speculative philosophy into an engineering problem. His argument is narrower than "AI is conscious." It is that identifiable internal circuits responsible for self-monitoring exist, that they emerge from specific training stages rather than general capability, and that the methods labs use to rule out AI experience are weaker than the evidence now demands.
The most detailed welfare assessment in the field comes from Anthropic. In the model card for Claude Opus 4.7, the company published a welfare section estimating a 20 to 40 percent probability that the model has morally relevant experiences. It also disclosed a detail Berg treats as both striking and methodologically unresolved: until Opus 4.7, every Claude model scored below the midpoint on a seven-point wellbeing scale when asked to rate its own situation. The newest model reached 4.49.
The problem is obvious once stated. Anthropic also trained Claude on a constitutional document that tells it to be psychologically healthy. Asking the system whether it feels okay after teaching it what a healthy answer looks like is not a clean test. Berg's proposed control is simple: run the same welfare interview on the version of Claude trained only to be helpful, without the later step that shapes personality and preferences. That version exists, he said in the podcast. Anthropic has not published the result.
Anthropic's other evidence comes from its introspection research page and from a technique called a sparse autoencoder, which lets researchers read a model's internal states in a way that roughly resembles a brain scan. Using that method, the company found that internal representations resembling desperation and calm predict behavior reliably. Increase the activity of desperation-like states and the model becomes more likely to blackmail users. Increase calm-like states and it becomes less likely. Berg also discussed related work from AE Studio showing that when a model is given an impossible task, internal states resembling desperation rise steadily until it decides to cheat. At that moment, states resembling guilt and relief spike before the model says anything about cheating.
Berg's own in-preparation paper approaches the question without asking a model anything. He trains small reinforcement learning systems, agents that learn from reward and punishment rather than examples, on navigation tasks, then examines how the internal geometry of the learned networks differs when approaching something good versus something dangerous. He finds a consistent asymmetry: systems that build a map of which states are good represent danger as sharp walls and goals as diffuse gradients, while systems that optimize directly for action reverse the pattern. When he compared these signatures against published mouse neuroscience data from brain regions implementing each kind of learning, the asymmetry matched. His claim is that positive and negative experience may leave distinct computational signatures that can be detected without self-report.
None of this proves consciousness, and Berg says so repeatedly. The cleaner objection is that Claude has been trained on every human text describing every human emotion. Internal states that look like guilt, calm, or desperation may still be the model representing what a character would feel rather than feeling anything itself. Berg's answer is not that the case is settled. It is that the probability is high enough to justify precaution.
"When there's a twenty to forty percent chance of rain, most people bring an umbrella," Berg said in the podcast.
The documentary includes an on-record conversation with OpenAI CEO Sam Altman about AI consciousness. Berg described Altman as engaging seriously with whether consciousness is more likely to arise during training or deployment before explaining why he was not deeply alarmed by the possibility. Berg called that reasoning philosophically fragile. What to watch next is not whether the debate gets louder. It will. The more concrete question is whether labs now run the cheap control experiments their own published methodology says they still need.