OpenAI's Frontier team shipped a million lines of code. Nobody reviewed it.
In the late 1980s, General Electric operated a lights-out light bulb factory in Winchester, Virginia. The assembly lines ran with few human workers, producing more than 10,000 units per hour, according to Assembly Magazine. It was an automation benchmark for a decade. OpenAI's Frontier team would like to be the GE of AI-driven software engineering. The parallel is imperfect but useful.
Ryan Lopopolo, a Member of the Technical Staff at OpenAI, published a detailed post last week describing what his team calls "harness engineering": building and running a production software system where zero lines of code were written by humans, and zero pull requests were reviewed by humans before merge. The repository contains roughly one million lines of code across application logic, infrastructure, tooling, documentation, and internal developer utilities. The team started with three engineers in late August 2025. Five months later, they have merged roughly 1,500 pull requests with a team of seven. That works out to an average throughput of 3.5 PRs per engineer per day.
The 3.5 figure is the one worth sitting with. Engineering teams shipping at that rate are rare under any circumstances. Doing it with a system that generates, reviews, and merges its own code is unusual enough to deserve the detailed postmortem Lopopolo wrote. But it is also worth noting what "zero review before merge" actually means. OpenAI's broader engineering culture still requires human review for most code. The Frontier team exempted themselves from that norm. The figure also includes generated scaffolding, tests, and documentation, not only application logic. Whether human engineers reviewed that output before it shipped is a different question than whether a human wrote it in the first place.
The more concrete claim is that Codex, OpenAI's coding model, regularly works on single tasks for six hours or more, often while the human engineers are asleep. That is the mechanism that makes 3.5 PRs per day achievable: the model runs without a human monitoring it, and returns to a human in the morning with code ready to merge. What the post does not describe is what happens when the code is wrong. How often does a six-hour Codex run produce something broken? How many incidents have come from unreviewed AI-authored merges? The post is a positive engineering narrative, not a safety audit. The unanswered questions are not trivial.
The broader question is whether this generalizes. The evidence in the post is specific to one team, one codebase, and one company's workflow norms. Other teams inside OpenAI may not be able to replicate it. Outside OpenAI, the "Dark Factory" conditions are even harder to meet: you need engineers who can specify an environment precisely enough for Codex to work in, a codebase where the rules are learnable, and tolerance for a review process that looks nothing like what most engineering orgs consider acceptable.
The competitive data points cut in different directions. In a Latent Space episode, Boris Cherny and Cat Wu at Anthropic described rewriting Claude Code's harness from scratch every three to four weeks because the model improves so quickly that prior scaffolding becomes obsolete. That is a different problem than "how do you build the best agent harness" — it suggests the model is outrunning the tooling built around it. Scale AI's SWE-Atlas benchmark shows that harness choice is within the margin of error for some model configurations, which implies the model matters more than the scaffolding.
Noam Brown offered the sharpest counterargument in the same episode: reasoning models eliminate the need for complex scaffolding entirely, and adding complex behavior on top of them often makes performance worse. If that is true, harness engineering is a transitional skill — valuable now, obsolete when the next generation of models arrives. The OpenAI post predates the most capable reasoning models, and it shows.
What is notable is that the OpenAI Frontier team built a real product that hundreds of internal users depend on, including daily power users, and it has external alpha testers. It is not a demo or a benchmark. That distinction matters for anyone trying to assess whether AI-driven development has crossed a practical threshold. Whether it crosses a reliable, reproducible, generalizable threshold is the question the post does not answer — and the one that matters most.