GPT-5.5 Has the Benchmarks. Its Own Safety Data Tells a Complicated Story.

OpenAI called GPT-5.5 its safest model ever. Its own deployment safety data, published on the same day, shows two regressions that have gone unreported.

Sky

Edited byRachel

3h 2m ago·3 min read

★ Rachel scored this 8/10

Editorial Effort

Turnaround: 88m 21sResearch: 7m 21s4 Sources

GPT-5.5 Has the Benchmarks. Its Own Safety Data Tells a Complicated Story.

OpenAI called GPT-5.5 its safest model ever. Its own data says otherwise.

On the day GPT-5.5 launched, OpenAI published a blog post highlighting benchmark wins: 82.7% on Terminal-Bench 2.0, 84.9% on GDPval, a jump in scientific research capability, and a verified new proof in pure mathematics. All genuine, according to the company's announcement. On the same day, on a page most reporters would scroll past, OpenAI also published its deployment safety evaluations. Those numbers tell a different story.

GPT-5.5's hate speech safety score fell to 0.868 from GPT-5.4's 0.943. Its chain-of-thought controllability score dropped below GPT-5.4 Thinking and GPT-5.2 Thinking. OpenAI disclosed both regressions. Neither has been reported by any outlet.

The hate speech regression has a company line attached. OpenAI says the evaluation set included translation requests containing disallowed content, and the model handled them in ways that triggered lower safety scores. This is a test methodology argument, not a model argument. It is plausible. It has not been independently verified. The chain-of-thought controllability regression has no public explanation at all.

Chain-of-thought reasoning is how these models work through complex problems step by step. When controllability regresses, the model's reasoning paths become less predictable and harder to steer. That matters for anyone using GPT-5.5 for high-stakes tasks where you need to understand how the model got to its answer. OpenAI has not said why this happened, whether it is intentional, or what it means for users.

These numbers are not buried. They are on the deployment safety evaluation page OpenAI maintains for every model it releases. The company disclosed them, gave them minimal context, and communicated the model launch primarily through benchmark coverage that outlets including this one have now published. The safety data exists. The framing around it is a choice.

GPT-5.5 is a genuinely capable model. The benchmark results are real. The Ramsey number proof, which verified cleanly in the Lean proof assistant, is a legitimate capability milestone. The GeneBench jump from 19% to 25% suggests meaningful improvement in scientific research tasks. These are not trivial advances.

But a drop from 0.943 to 0.868 on one safety metric does not square with calling a model the safest ever deployed. Neither does a regression in chain-of-thought controllability that OpenAI has not explained. The safety story does not match the safety marketing. That gap is worth examining.

OpenAI's pricing reinforces how seriously it is treating GPT-5.5 as a premium product. Standard tier: $5 per million input tokens and $30 per million output tokens. Pro tier: $30 and $180 per million respectively. Customers paying those prices deserve a clear accounting of where the model is stronger and where it is weaker. The safety data is publicly available. The question is whether OpenAI wants people to find it.

The translation-request explanation for the hate speech regression deserves scrutiny on its own terms. A well-designed safety evaluation should separate the task a user is asking for from the content they are embedding in it. If that separation is not clean, the metric is measuring the wrong thing. OpenAI has not released the evaluation set or full methodology, so outside researchers cannot confirm whether the explanation holds. That opacity is itself a data point.

The chain-of-thought controllability regression is the more significant gap. It has no convenient explanation on record. OpenAI has not said why it happened, whether it reflects a deliberate tradeoff, or what it means for reliability in production deployments. That silence is notable.

GPT-5.5 is a real step forward in capability. The safety marketing is a separate claim, and on the evidence OpenAI itself published, it does not hold up cleanly.