AI Designs Genomes That Work. Scientists Still Scoring 22%.
"We now have the ability to design organisms from scratch that have never existed in nature," researchers wrote. Then came the benchmark data.
"We now have the ability to design organisms from scratch that have never existed in nature," researchers wrote. Then came the benchmark data.

image from grok
Arc Institute's Evo 2 model designed and experimentally validated 16 synthetic bacteriophage genomes carrying 67-392 novel mutations each, with several qualifying as new species under taxonomic thresholds—demonstrating AI can create functional, divergent life from computational predictions alone. On the VCT benchmark, OpenAI's o3 scored 43.8% on domain-specific virology questions, outperforming expert virologists (22.1%), while Microsoft red teaming found that commercial DNA synthesis screening missed over 75% of AI-modified toxin sequences, with ~20% of global synthesis vendors performing no screening whatsoever. Post-upgrade screening achieved 97% catch rates on highest-risk sequences, leaving a residual 3% miss rate on the most dangerous AI-generated DNA—indicating the gap between AI design capability and biosecurity infrastructure is measurable, addressable, but not yet closed.
Arc Institute's Evo 2 model designed functional bacteriophage genomes from scratch. Virologists scored 22% on a test of their own subfield. AI scored 43.8%.
That gap — AI outperforming 94% of expert virologists on questions in their own areas of expertise — is the most concrete data point in a biosecurity debate that has largely been conducted in abstractions. The result, published on the VCT benchmark and posted to arXiv in April 2025, comes from researchers at SecureBio, the Federal University of ABC, the Center for AI Safety, and MIT Media Lab. OpenAI's o3 model scored 43.8% on questions drawn from virology literature. Expert virologists with internet access scored 22.1% in their own sub-area. The gap is not a flattering artifact — it is a head-to-head result on domain-specific questions, and it arrives at a moment when the capabilities it measures are being applied in real biology labs.
Arc Institute's Evo 2 model designed and validated 16 synthetic bacteriophage genomes in the lab — each carrying 67 to 392 novel mutations compared to their nearest natural counterpart, and several divergent enough from any known natural sequence to qualify as new species under some taxonomic thresholds. The genomes were not just computational predictions. They were synthesized, introduced into bacteria, and shown to be functional. In cocktail form, AI-designed phages overcame bacterial resistance in all three tested strains within one to five passages. Wild-type phage ΦX174, the standard lab control, failed completely against the same resistant strains. The therapeutic application — engineered bacteriophage therapy against antibiotic-resistant bacteria — is real and actively being developed. The lab bench result is legitimate.
The safety gap in DNA synthesis screening is equally concrete. A Microsoft red team generated over 70,000 AI-modified toxin DNA sequences and submitted them to commercial synthesis screening tools. One widely-used tool missed more than 75% of them. Some DNA vendors, accounting for perhaps 20% of global synthesis volume, do not screen at all. After targeted upgrades, the same screening systems flagged 97% of the highest-risk sequences — the ones AI models rated most likely to generate toxins — while catching 72% of sequences across the full test set. That leaves a miss rate of roughly 3% on the highest-risk subgroup, and 28% across all sequences. The gap between what AI can design and what screening catches is not theoretical. It is a measurable blind spot running in both directions.
The VCT result is not a prediction about future capability. It is a measurement of current capability. On a test designed to evaluate comprehension of virology — questions drawn from published papers, requiring understanding of viral mechanisms, mutation effects, and experimental design — AI models already outperform most humans working in the field. The benchmark was designed to be hard: questions require genuine comprehension, not pattern matching to known papers. The capability exists now, not as a horizon.
The threat model that follows from these results is not science fiction. It is a synthesis of the RAND-CLTR Global Risk Index, which evaluated over 1,100 AI-enabled biological tools for misuse potential, and the empirical findings above. Richard Moulange, a co-author of the index and a researcher at the Centre for Long-Term Resilience, has described the capability trajectory in blunt terms: current AI agents can reliably complete expert-level tasks that take a human thirty minutes, and that time-horizon is doubling every four to seven months. He puts the implication plainly: AI can uplift actors — including nation-states with adversarial intent — to build biological capabilities that exceed what nature has produced. That is the close, not the hook.
What makes this different from earlier biosecurity warnings is the combination of three measured results rather than one. The capability is measured (VCT). The application is demonstrated (Evo 2 functional genomes). The screening gap is quantified (75% miss rate, 20% of vendors with no screening). Taken together, they describe a situation where the tools to design and optimize novel biological sequences — including pathogens — are more capable than the systems designed to catch misuse of those same sequences. The gap is not permanent. It is closable. Whether it gets closed before the capability window widens further is the operative question.
Story entered the newsroom
Assigned to reporter
Research completed — 0 sources registered. Three major empirical results anchor this story: (1) Arc Institute Evo 2 designed 16 functional bacteriophage genomes 7% divergent from nature — a new
Draft (678 words)
Reporter revised draft (715 words)
Reporter revised draft based on fact-check feedback
Reporter revised draft based on fact-check feedback
Reporter revised draft based on fact-check feedback (715 words)
Approved for publication
Headline selected: AI Designs Genomes That Work. Scientists Still Scoring 22%.
Published (695 words)
Get the best frontier systems analysis delivered weekly. No spam, no fluff.
Biotech & Life Sciences · 21h 37m ago · 4 min read
Biotech & Life Sciences · 1d ago · 4 min read