The software that grades AI content moderators has a flaw in its logic: it marks the AI wrong whenever it disagrees with a human reviewer, even when the AI was the one following the rules. That is not a theoretical concern. Researchers at Reddit, the social media company, measured it across 193,000 production moderation decisions and found that roughly 80 percent of what standard accuracy metrics classify as model errors are actually policy-grounded decisions the model was right to make.
In a preprint posted to arXiv on April 22, Michael O'Herlihy and Rosa Català, both researchers at Reddit, report a 33 to 46.6 percentage-point spread between agreement-based metrics (the kind that compare AI output to human reviewer choices) and their alternative framework, which grades decisions against the written policy itself. On the most common metric, F1 score (a standard measure that combines precision and recall into a single number), the Reddit moderation model looked mediocre. Against the new framework, it looked competent.
The authors call the underlying failure mode the Agreement Trap. The name is precise: systems trained and evaluated by matching human reviewer decisions inherit whatever inconsistency, implicit bias, and policy shortcuts those humans carry. When a human reviewer and the AI both apply the rules correctly but reach different conclusions because the policy is genuinely ambiguous, the evaluation system counts that as a model failure. It is not. It is the policy doing what policies do when written in natural language about edge cases.
To escape it, the paper proposes a Defensibility Index (DI): a signal that asks not "did the AI agree with the human?" but "can this decision be grounded in the written policy?" The index is computed from a separate audit model that reads the AI's decision alongside the applicable policy text and checks whether the decision is justifiable. The paper also introduces an Ambiguity Index, which flags where the policy itself is the source of inconsistency rather than the model. Both signals come from the same forward pass of the audit model, so they don't require separate evaluation pipelines.
The researchers tested one application of this directly. They analyzed 37,286 identical decisions under three progressively more detailed versions of the same community rules and found that tightening policy language reduced the Ambiguity Index by 10.8 percentage points while the Defensibility Index stayed stable. You can reduce apparent disagreement by writing clearer rules, without retraining the model.
The practical application the paper describes is a Governance Gate: a routing system that uses the Defensibility Index alongside other signals to decide whether a moderation decision should proceed automatically or get flagged for human review. The researchers report 78.6 percent automation coverage with a 64.9 percent reduction in risk, though that result comes from offline evaluation on historical Reddit data, not from a live deployment. The paper is explicit: the framework has not been put into production at Reddit.
There are reasons to be careful about how far this finding travels. The critique of accuracy as a benchmark for AI content moderation is not new: a 2025 review in Artificial Intelligence Review surveyed the field and found a similar shift in emphasis from accuracy to legitimacy, though that work did not quantify the gap the way the Reddit researchers do. Reddit's community policies are deliberately principle-based: they describe the spirit of a rule rather than enumerating every prohibited case. That structure creates the exact kind of interpretive space that makes agreement-based evaluation break. Enterprise compliance contexts, where legal, financial, or HR policies are often far more explicit, may produce smaller gaps between agreement-based and policy-grounded metrics. The researchers do not test their framework outside Reddit's data, and this is a preprint that has not been peer-reviewed.
What holds regardless of domain is the underlying observation: if your evaluation metric rewards matching human reviewers and your human reviewers are inconsistent, you are measuring reviewer consistency, not rule compliance. The root cause taxonomy the paper produces is the number worth watching as similar analyses land in other domains. Model error, it turns out, accounts for only 19.4 percent of disagreements; the rest splits between implicit norm enforcement and genuine policy ambiguity. If that ratio holds outside Reddit, the implications for how compliance-focused AI teams measure their systems are hard to avoid.