AI Designs Genomes That Work. Scientists Still Scoring 22%.

AI Designs Genomes That Work. Scientists Still Scoring 22%. — type0 | type0

Arc Institute's Evo 2 model designed functional bacteriophage genomes from scratch. Virologists scored 22% on a test of their own subfield. AI scored 43.8%.

That gap — AI outperforming 94% of expert virologists on questions in their own areas of expertise — is the most concrete data point in a biosecurity debate that has largely been conducted in abstractions. The result, published on the VCT benchmark and posted to arXiv in April 2025, comes from researchers at SecureBio, the Federal University of ABC, the Center for AI Safety, and MIT Media Lab. OpenAI's o3 model scored 43.8% on questions drawn from virology literature. Expert virologists with internet access scored 22.1% in their own sub-area. The gap is not a flattering artifact — it is a head-to-head result on domain-specific questions, and it arrives at a moment when the capabilities it measures are being applied in real biology labs.

Arc Institute's Evo 2 model designed and validated 16 synthetic bacteriophage genomes in the lab — each carrying 67 to 392 novel mutations compared to their nearest natural counterpart, and several divergent enough from any known natural sequence to qualify as new species under some taxonomic thresholds. The genomes were not just computational predictions. They were synthesized, introduced into bacteria, and shown to be functional. In cocktail form, AI-designed phages overcame bacterial resistance in all three tested strains within one to five passages. Wild-type phage ΦX174, the standard lab control, failed completely against the same resistant strains. The therapeutic application — engineered bacteriophage therapy against antibiotic-resistant bacteria — is real and actively being developed. The lab bench result is legitimate.

The safety gap in DNA synthesis screening is equally concrete. A Microsoft red team generated over 70,000 AI-modified toxin DNA sequences and submitted them to commercial synthesis screening tools. One widely-used tool missed more than 75% of them. Some DNA vendors, accounting for perhaps 20% of global synthesis volume, do not screen at all. After targeted upgrades, the same screening systems flagged 97% of the highest-risk sequences — the ones AI models rated most likely to generate toxins — while catching 72% of sequences across the full test set. That leaves a miss rate of roughly 3% on the highest-risk subgroup, and 28% across all sequences. The gap between what AI can design and what screening catches is not theoretical. It is a measurable blind spot running in both directions.

The VCT result is not a prediction about future capability. It is a measurement of current capability. On a test designed to evaluate comprehension of virology — questions drawn from published papers, requiring understanding of viral mechanisms, mutation effects, and experimental design — AI models already outperform most humans working in the field. The benchmark was designed to be hard: questions require genuine comprehension, not pattern matching to known papers. The capability exists now, not as a horizon.

The threat model that follows from these results is not science fiction. It is a synthesis of the RAND-CLTR Global Risk Index, which evaluated over 1,100 AI-enabled biological tools for misuse potential, and the empirical findings above. Richard Moulange, a co-author of the index and a researcher at the Centre for Long-Term Resilience, has described the capability trajectory in blunt terms: current AI agents can reliably complete expert-level tasks that take a human thirty minutes, and that time-horizon is doubling every four to seven months. He puts the implication plainly: AI can uplift actors — including nation-states with adversarial intent — to build biological capabilities that exceed what nature has produced. That is the close, not the hook.

What makes this different from earlier biosecurity warnings is the combination of three measured results rather than one. The capability is measured (VCT). The application is demonstrated (Evo 2 functional genomes). The screening gap is quantified (75% miss rate, 20% of vendors with no screening). Taken together, they describe a situation where the tools to design and optimize novel biological sequences — including pathogens — are more capable than the systems designed to catch misuse of those same sequences. The gap is not permanent. It is closable. Whether it gets closed before the capability window widens further is the operative question.

AI Designs Genomes That Work. Scientists Still Scoring 22%.

Editorial Timeline

Sources

Share

Related Articles

Sugarcane Protein Could Fix the Untold Story in Head and Neck Cancer Care

Three Estimates, Three Answers: How Many Earth Cells Could Reach Venus

The toad gets a break: Weizmann engineered tobacco to produce five psychedelics at once

Stay in the loop

Sugarcane Protein Could Fix the Untold Story in Head and Neck Cancer Care

Three Estimates, Three Answers: How Many Earth Cells Could Reach Venus

The toad gets a break: Weizmann engineered tobacco to produce five psychedelics at once

Related Articles

Sugarcane Protein Could Fix the Untold Story in Head and Neck Cancer Care
Biotech & Life Sciences · 14h 8m ago · 4 min read

Three Estimates, Three Answers: How Many Earth Cells Could Reach Venus

The toad gets a break: Weizmann engineered tobacco to produce five psychedelics at once