The trial-and-error training method that powers game-playing AI, warehouse robots, and large language model fine-tuning rests on a single assumption: maximize reward. Pick the action that gets the best score. Repeat until perfect. The programs that beat world champions, the warehouse robots, the models that mastered Go and Atari. All of them built on the same principle.
A paper published in Nature Communications in July 2024 argues that assumption may be wrong. Jorge Ramirez-Ruiz and colleagues at the University of Barcelona propose an alternative called the Maximum Occupancy Principle, or MOP. Rather than maximizing rewards, the theory says agents should maximize how many possible futures they can reach: the variety of action-state paths still available to them. Curiosity, goal-directed behavior, and even basic altruism emerge naturally from this single objective, without any reward signals at all.
The implications are significant. If MOP holds, every reinforcement learning system trained on reward maximization has been optimizing a proxy for something other than what actually matters.
The paper starts from a deceptively simple observation. A rat in a maze does not just maximize the reward of finding cheese. It explores dead ends, checks corners it already knows are empty, and sometimes just moves around. Standard reward maximization cannot explain this behavior: the rat is wasting energy on actions that do not increase its score. MOP explains it: the rat is keeping its options open, maintaining access to futures that might still be valuable.
The technical term for this is action-state path entropy, a measure of how many distinct futures an agent can still reach. Maximizing this entropy produces what the researchers call an intrinsic motivation to occupy path space. Rewards, in this framework, are not the goal. They are one possible means of exploring and occupying futures. Goal-directedness emerges because certain paths reliably lead to more future options, not because they maximize a score.
The behavioral results are concrete. In grid-world environments, MOP agents spontaneously produced dancing, hide-and-seek behavior, and what the researchers describe as a basic form of altruism: helping another agent access resources without any explicit reward for doing so. All of this emerged from a single objective function with no reward signals. The behaviors were not programmed; they were discovered.
Two follow-up papers posted to OpenReview this year independently validated the core claim. Both found that MOP keeps generating variable behavior where other entropy-based approaches — specifically FEP (Free Energy Principle) and MPOW (a predictive coding variant) — collapse to deterministic policies. When an agent has found what looks like the optimal path, FEP and MPOW agents stop exploring. MOP agents keep moving. The difference sounds small but it is not: a deterministic agent can be trapped by a changed environment, while a variable-behavior agent has already practiced alternatives.
This is where the survival instinct comes in. MOP agents implicitly avoid death states, positions in the environment from which no further action-state paths are possible. An absorbing state, in the formal language of the paper, offers zero future path occupancy. MOP agents never voluntarily move there, even when doing so might yield a short-term reward. No reward signal encodes this avoidance. It falls out of the math.
For AI researchers, this is either a curiosity or a revolution, depending on how seriously you take the theoretical foundations. MOP is not the first challenge to reward maximization. The field has a long history of alternatives that worked in simple environments and then failed at scale. What distinguishes MOP is its mathematical elegance: the paper proves that action-state path entropy is the only measure satisfying a set of intuitively reasonable axioms, including additivity across time steps. If those axioms are right, the conclusion follows necessarily.
The honest uncertainty is scale. MOP has been demonstrated in grid worlds and small environments. Whether the principle produces useful behavior in the massive state spaces of modern AI systems is unknown. No major lab has adopted MOP as a training objective. The gap between clean theoretical results and messy real-world environments is where the skeptics live.
What to watch: whether any major lab runs a serious comparison between reward maximization and path occupancy in a production-class environment. If that result lands, the proxy problem stops being a theoretical footnote and becomes the foundation of a new research agenda.