Most teams running standard AI training pipelines have never audited the algorithm that tells the model whether its outputs are any good. That algorithm — the reward model, the component that learns what humans want — is trusted blindly in most RLHF pipelines. A new paper accepted at the 2026 Association for Computational Linguistics conference describes a failure mode where both the core model and its reward model fail simultaneously, leaving a system with no internal safety mechanism.
The paper introduces ARES — short for Adaptive Red-Teaming and End-to-end System Repair — a framework that audits the reward model before fine-tuning the core model, using Qwen3-1.7B as the core language model and Skywork-RM-Qwen3-4B as the reward model. This is a reversal of the standard RLHF sequence. It then uses the audited reward model to train the core language model. The two-stage approach is the paper's main contribution: find the critic's blind spots first, then train a model that avoids them.
Benchmarks in the paper show ARES achieving a 0.97 safety rate on StrongReject and 0.95 on HarmBench, versus an Initial RLHF baseline of 0.76 and 0.66 respectively. The paper calls this joint failure Type C — a scenario where the model generates harmful content and the reward model rates it positively at the same time. Both safety layers fail at once. The most novel finding is that Type C failures appear to have gone undetected in prior work, suggesting the vulnerability was present before anyone named it.
The practical catch: ARES requires roughly 13 hours on eight NVIDIA A100 GPUs, a compute configuration that puts it out of reach for most teams. With 4,000 samples, ARES found 63.5 percent of vulnerabilities in nine hours — meaning more than a third went undetected in that window. The paper's own numbers acknowledge the ceiling.
For teams shipping RLHF-trained models today, the implication is uncomfortable. If the reward model has a blind spot, fine-tuning the core model on it doesn't fix the blind spot — it trains the blind spot into the model's weights, where it becomes permanent. Later safety updates layered on top don't catch it, because the critic is still wrong.
The paper does not claim ARES is production-ready. Nine pages on arXiv is preliminary work. The 63.5 percent discovery rate means this is not a complete solution. But the failure mode it names is structural: RLHF pipelines that skip reward-model auditing are trusting a critic they have never stress-tested. Whether Type C failures have already contaminated models in deployment — and whether the research community will develop cheaper methods to catch them — is the open question the paper leaves behind.