When researchers at the Arc Institute designed a set of bacteriophage genomes from scratch, they started with no natural template. The most divergent sequence they produced was 7 percent different from anything that has ever been observed in nature. It still worked — infecting and killing bacteria in the lab. Sixteen out of 302 AI-generated candidates were functional, each one a real organism that evolution never made.
That result, published on bioRxiv and described on the Arc Institute's website, sits at the uncomfortable center of a debate among biosecurity researchers about what happens when AI design capability outpaces the tools meant to catch dangerous outputs before anyone orders the DNA.
The short answer, according to Richard Moulange, AI-Biosecurity Policy Manager at the Centre for Long-Term Resilience: the screening tools are losing.
One widely-used DNA screening tool flagged just 23 percent of AI-generated toxin sequences in a test run described by Science. Another missed more than 75 percent of potential toxins. After the tools were upgraded, average detection reached 72 percent — leaving roughly one in four concerning sequences undetected. For the sequences the models themselves rated as most likely to generate toxins, detection hit 97 percent, which sounds reassuring until you remember that rating depends on the model's own assessment of its output.
"The assumption that managed access to dangerous capabilities can be maintained is already broken," Moulange told the 80,000 Hours Podcast. "Someone decides to free a tool. They tell an advanced coding agent to read the Nature paper and reproduce its capabilities without any of those safeguards."
Moulange is describing what he calls the agentic AI problem: the same infrastructure being built to automate software development can read a scientific publication, identify the key techniques, and reproduce them outside any controlled environment. His own analysis puts current AI agents at reliably completing coding tasks that take a human expert about 30 minutes, with the effective time horizon doubling every four to seven months.
The gap between what AI can design and what screening catches has drawn attention from a second research track. A team led by Andrew Fiore and Valentin Wittmann at UC San Diego used specialized AI protein design tools to generate more than 70,000 DNA sequences encoding variant forms of three toxin protein families. They then ran those sequences through the same commercial screening tools companies use to check DNA synthesis orders before fulfilling them. The results, also reported in Science, showed the screening layer catching some sequences but letting others through that were functionally similar to the ones it flagged.
Separately, a team including researchers at MIT and the University of Washington developed a benchmark called the Virology Comprehension Test, 322 questions covering the practical knowledge working virologists need in a lab. When they tested frontier language models on it, OpenAI's o3 scored 43.8 percent, outperforming 94 percent of expert virologists who took the same test and averaged 22.1 percent. The implication is not that AI has mastered virology — 43.8 percent is not a passing score by most measures — but that the gap between what these models know and what most specialists know is smaller than the field has assumed.
The Arc Institute work offers a more fundamental illustration. Using a genomic foundation model called Evo 2, which predicts biological function from sequence data the way language models predict text from tokens, researchers there designed novel bacteriophage genomes from scratch. Evo 2 was trained on more than 8.8 trillion nucleotides from a genomic atlas spanning all domains of life — bacteria, archaea, eukarya, and bacteriophage. That is not a figure that fits easily in a sentence, but it is the actual scale of what went into making a model capable of proposing genomes evolution never produced.
They were not modifying existing pathogens. They were specifying functions — kill this bacterium — and letting the model propose sequences. The 7 percent divergence figure means those genomes are not close relatives of anything in the training data. They are genuinely novel.
Sixteen of the 302 designs replicated in E. coli bacteria. The team then tested whether AI-designed phage cocktails could overcome antibiotic resistance. In all three resistant bacterial strains tested, the cocktails broke through resistance within one to five lab passages. The wild-type phage, by contrast, failed completely. Functionally, that is a proof-of-concept for engineered phage therapies targeting resistant infections. It is also a proof-of-concept for something nobody wants to name.
The biosecurity community has known about these dual-use risks for years. What has changed is the capability curve. Foundation models like Evo 2 did not exist a few years ago. The VCT benchmark did not exist a year ago. AI agents that can autonomously reproduce published methods did not exist in any meaningful deployment sense six months ago. Each of these is a separate development. Together they describe a situation where the tools for designing novel organisms are getting more powerful, the tools for catching bad actors are getting better but not fast enough, and the infrastructure for disseminating both is becoming accessible to anyone who can write a prompt.
Moulange's argument is not that the danger is imminent. It is that the managed access framework — the idea that you can control what people learn about dangerous biology by restricting access to certain papers, datasets, and tools — depends on the bottleneck staying in place. AI agents eliminate that bottleneck. An advanced coding agent reading a restricted paper and reproducing its core method is not science fiction. It is, in Moulange's assessment, a present capability.
The screening software companies have updated their systems since the Science reports. The newer detection rates are higher. But detection rates are only as good as the sequences you choose to submit for screening, and a bad actor optimizing for evasion does not have to beat the system every time. They only have to find the cases where the system fails once.
There is no public data on whether anyone has successfully ordered DNA for a genuinely dangerous sequence through the current screening regime. The absence of a documented case is not evidence the gap is theoretical. It may mean nobody has tried yet, or that attempts were caught and not publicized, or that the first successful evasion has not been detected.
The Arc Institute phage work was published openly. The Science toxin screening work was published openly. The VCT benchmark was published openly. Evo 2 was published openly. The research enterprise runs on openness; that is not a flaw in the system, it is the system. The question is what the appropriate response looks like when the same openness that enables scientific progress also lowers the cost of reproducing dual-use methods outside any institutional oversight.
Moulange's answer, or the beginning of one, is that the managed access model needs to be rethought from the ground up, not patched at the margins. He has not published a specific proposal. Neither has anyone else in the field produced consensus on what comes next.
What does exist is a set of numbers that most people working at the intersection of AI and biology would prefer were not in the same sentence: 23 percent detection, 72 percent after upgrades, 7 percent divergence from nature that still works, 43.8 percent on a practical virology test outperforming most specialists. Taken individually, each number is a data point. Taken together, they describe a field moving fast enough that the next update to any of them may narrow the gap — or widen it.
† Add footnote: "Source-reported; not independently verified." Alternatively, if this claim cannot be independently confirmed, consider removing it or hedging with language such as "according to the researchers" or "the study found."
† Add footnote: "Source-reported; not independently verified." Alternatively, if this claim cannot be independently confirmed, consider removing it or hedging with language such as "according to the researchers" or "the study found."