The proteins that keep us alive represent a vanishingly small sample of what nature could have built. A new paper in PNAS puts numbers to that intuition — and suggests that AI protein design tools, despite the 2024 Nobel Prize and the hype, are working from a deeply constrained dataset. The limiting factor is not selection, not epistasis, not even the physics of protein folding. It is ancestry itself.
The study, led by researchers at the Okinawa Institute of Science and Technology, the Institute of Science and Technology Austria, the University of Vienna, and the Centro de Astrobiología, models how protein families evolve across sequence space — the vast landscape of all possible amino acid chains of a given length. The researchers found that for most gene families, the "effective topological dimension" of that space — meaning how many independent directions evolution has actually sampled — is approximately one. In a few families it reaches above ten. The implication is stark: proteins have not wandered freely through functional sequence space. They have mostly stayed near where they started.
"That the starting point is the main evolutionary limit is not necessarily surprising, but the scale of its importance is really quite remarkable," said lead author Lada Isakova, a PhD student in Kondrashov's unit. "As an evolutionary biologist, I was intrigued to see how little selection and epistasis seemed to matter in our results."
The finding has direct consequences for one of biotechnology's most celebrated recent achievements. AlphaFold, the DeepMind system that won the 2024 Nobel Prize in Chemistry alongside David Baker's RosettaFold, can predict the structure of nearly any protein with extraordinary accuracy. But it predicts structures from sequences — and those sequences come from nature. The models learn from databases of proteins that exist or have existed. If those databases represent a tiny, ancestry-constrained subset of what functional proteins are possible, then generating truly novel proteins means working from a biased sample.
"Neural network approaches for functional protein prediction are limited by the data sets we provide," Isakova said. "Based on existing data, most methods won't be able to generalize well beyond the current known sequence space."
This is not a peripheral concern. The field of AI-driven protein design has moved rapidly from structure prediction to de novo generation — systems like Evo, which can design functional protein-RNA complexes and CRISPR systems from scratch, have been presented as evidence that AI can go beyond finding natural sequences to inventing new ones. But if the space of functional proteins accessible from any given ancestral starting point is narrow, then even the most sophisticated generative model is exploring within a bounded region. It cannot easily leap to a different region of sequence space that evolution never visited from that lineage.
The paper's findings also speak to how the first proteins may have arisen. The simulations suggest that point mutations alone — the gradual accumulation of small changes — could not have produced the diversity of early protein families within plausible timescales. Instead, the data support DNA recombination: small pieces of DNA shuffling and recombining to create sequences that could encode substantially different proteins. That is a well-regarded hypothesis, but this paper adds quantitative support for it.
The work raises a question the authors do not fully answer: what does it take to actually explore the unexplored regions of sequence space? Isakova's answer is cautious and direct: new experimental data. High-throughput experiments that sample functional sequence space beyond what evolution has produced — directed evolution on a larger scale, or screens of combinatorial libraries — would be needed to break out of the ancestry constraint. That is expensive, slow, and not yet standard practice.
For now, the models are trained on a sample of life that is unrepresentative of what life could, in principle, do. The Nobel Prize was for predicting structures. What the new paper reminds us is that prediction is not the same as discovery — and that the training data has baked in limits that no architecture, however powerful, can fully overcome.
Note: The primary source DOI cited by OIST and GEN News (10.1073/pnas.2532018123) currently returns a 404. Both OIST and GEN News independently cite the same paper in PNAS, and the author quotes are consistent across both sources. The story is based on the OIST press release and GEN News coverage, which are mutually corroborating. Readers seeking the primary paper should search PNAS directly for the authors Isakova and Kondrashov.
https://www.miragenews.com/limits-of-protein-evolution-unveiled-1647179/
https://www.genengnews.com/topics/translational-medicine/common-ancestry-limits-protein-sequence-exploration-computational-study-shows/
https://www.science.org/doi/10.1126/science.ado9336