OpenAI just raised the price of its best AI model by 100 percent. The technical case for doing so is shaky.
On Tuesday the company released GPT-5.5, its fifth new model in seven months, and the most-cited benchmark showed it scoring 82.7 percent on Terminal-Bench 2.0, a test that measures how reliably an AI can execute autonomous computer tasks: the kind of work that, done by a person, would be called senior engineering. VentureBeat reported that Anthropic's Claude Mythos Preview, a next-generation model not yet publicly available, scored 82.0 percent on the same test. The gap is 0.7 percentage points, within the margin of error on most days.
Access to GPT-5.5 in OpenAI's developer APIs will cost $5 per million input tokens and $30 per million output tokens, double the price of GPT-5.4, according to OpenAI's announcement. A higher-accuracy Pro tier runs $30 per million input and $180 per million output. The company says the speed improvements are real: the Codex AI team, OpenAI's internal tooling unit, wrote custom algorithms to better distribute GPU work across compute clusters, increasing token generation by more than 20 percent.
The benchmark comparison is not clean. Anthropic used a different testing scaffold called Terminus-2 and measured Mythos at 92.1 percent under Terminal-Bench 2.1 conditions with four-hour timeouts, RD World Online reported. The two headline numbers come from different environments, which means the 0.7-point margin is not a settled result.
The practical benchmark closer to real engineering work is less flattering. Dan Shipper, writing for Every, a publishing company that tracks applied AI closely, found that combining Opus 4.7 as planner with GPT-5.5 as executor scored 62.5 out of 100 on an internal senior engineer evaluation. Human senior engineers score 80 to 90 on the same test. Neither model, run alone, breaks the mid-40s, The Neuron Daily reported.
Seven days separate GPT-5.5 from Anthropic's Opus 4.7 launch. That cadence is the structural pressure neither lab can escape right now. Dylan Patel of SemiAnalysis, a semiconductor and AI infrastructure research firm, noted that Anthropic quietly moved from a Claude 4.6-level model to a Claude 4.7-level model internally while the public Opus 4.7 release is deliberately compute-constrained, a sign of how tightly Anthropic is managing its GPU supply. OpenAI, by contrast, has the infrastructure to push hard. The Neuron Daily put it plainly: Anthropic is a Ferrari running on fuel rations; OpenAI just bought the gas station.
What to watch: whether Anthropic has a response queued and whether the compute constraint on Opus 4.7 eases. If the 62.5 combined workflow score improves in the next round of testing, the pressure intensifies for both labs. If it plateaus, the benchmark war reveals what it may always have been: a proxy fight for enterprise contracts, not a measure of where the technology actually stands.