PR #64441 adds a full Anthropic-compatible mock route to OpenClaw's internal QA lab. The practical effect: developers can now run the framework's GPT-5.4 versus Opus 4.6 parity gate on their local machine without a real Anthropic API key. More importantly, it reveals how seriously OpenClaw's maintainers take their own internal evaluation process — and how differently that looks from the way the rest of the AI industry talks about benchmarks.
The parity harness works by running identical scenario prompts against both OpenAI's GPT-5.4 and Anthropic's Claude Opus 4.6 through a shared dispatcher. The same subagent fanout state machine, the same tool-use logic, the same extractRememberedFact telemetry — nothing in the prompt path differs between the two runs. Five gate criteria must clear before the verdict is recorded: the runs must be structurally comparable, the dispatcher logic must be shared, the summary JSON must be type-safe, the scenario coverage must be complete, and the verdict label must match the actual run. A summary metadata block — startedAt, finishedAt, providerMode, primaryModel, alternateModel — is now written into every QA run JSON so that swapping candidate and baseline labels cannot silently mislabel a failed run as a pass.
This is not how most AI frameworks evaluate models. The standard industry approach is to point a model at a public benchmark, report the score, and put it on a leaderboard. The score is the story. OpenClaw's approach inverts that: the scenarios are the story, and the score is just a data point that comes out of running them. The distinction matters because public benchmarks have well-documented failure modes. Models can overfit to MMLU. Training data contamination is nearly impossible to rule out. Leaders change based on which benchmark you pick. But running your own scenarios against two models — using the same agent scaffolding, in the same environment, with the same evaluation criteria — tells you something different from what a leaderboard tells you. It tells you which model behaves better in your system, not in a synthetic test designed by a third party.
The five criteria are worth examining individually. Comparable runs means both models see the same input in the same order. Shared dispatcher logic means the routing decisions are identical — no model-specific prompting tricks that advantage one provider. Type-safe summary JSON means the verdict file cannot be accidentally misread. Scenario coverage accounting means every required test case appears in the results. Verdict label verification means the JSON records which model produced which outcome, preventing a maintainer from accidentally attributing a GPT-5.4 run to Opus or vice versa.
None of this is visible externally. The scenario catalog itself is not public, and the five criteria are only legible from reading the pull request and the linked issue #64227. That is the honest limitation of this story: the rigor is in the infrastructure, but the infrastructure is a black box from the outside. A skeptical reader should ask whether the scenarios are meaningful tests of agentic capability or whether they are constructed to make GPT-5.4 look good. OpenClaw has no obligation to publish its internal evaluation suite. But without it, the anti-benchmark-theater framing is editorial choice as much as it is engineering culture.
What is public is the infrastructure itself — the mock server, the summary JSON schema, the five-criterion gate, and the commit history that added them. That is enough to say something real: OpenClaw is building evaluation tooling that most AI frameworks treat as an afterthought, and it is building it in public, even if the scenarios themselves stay internal.
The rest of 4.11 addresses practical friction. Codex OAuth sign-ins no longer fail with invalid_scope errors after OpenAI updated their authorization URL format — a silent breakage that would have blocked every new developer integrating Codex for the first time. Audio transcription across OpenAI, Groq, and Mistral is restored after a DNS pinning policy broke multipart request handling for those providers. Slow models can now run to their configured timeout instead of hitting the idle watchdog early. The LLM idle watchdog now honors explicit run timeouts rather than applying a default window that may not match the model's actual latency profile.
On the integration side, plugin manifests can now declare their own activation and setup descriptors, which means plugin authors no longer need to hardcode authentication and pairing flows into the core. Teams OAuth gains delegated reaction support with Graph pagination. The video_generate tool accepts URL-only asset delivery with typed providerOptions, reference audio inputs, and adaptive aspect-ratio support — changes that let providers expose richer generation modes without pushing large files through OpenClaw's memory path. Ollama model discovery now caches context-window metadata, so the model picker stops refetching unchanged entries on every refresh.
None of these fixes are glamorous. Together they describe a framework that is being hardened for production use at the seams where provider APIs, OS-level permissions, and internal routing meet. The pattern — a regression caught and closed in a pre-release, with infrastructure built to prevent recurrence — is more interesting than any individual fix. That is the story behind the parity gate, even if it is not the lede.