Ask a Small Language Model to Show Its Work. Then Change One Number.

Ask a Small Language Model to Show Its Work. Then Change One Number. — type0 | type0

PREVIEWAsk a Small Language Model to Show Its Work. Then Change One Number. · MD

When a small language model shows its reasoning step by step, a new study suggests it may be performing a different task entirely: selecting the last number in the chain, not computing from it.

The paper, titled "The Readout Shortcut" and authored by Ming Liu, is posted on arXiv (https://arxiv.org/abs/2605.22870). Liu tests three instruction-tuned models with between 1 billion and 3 billion parameters: Qwen2.5-1.5B-Instruct, Llama-3.2-1B-Instruct, and Gemma-2-2B-it. All three were evaluated on GSM8K, a standard arithmetic benchmark. The methodology is precise. Liu uses prefix completion to isolate the answer-readout stage, controlling what the model can and cannot see at the point where it generates the final answer. The result is a clean separation of two channels the model could be using: retained-context completion (reading the context and computing the right answer) and positional copy (grabbing whatever number sits at the end of the chain, verbatim). The full experimental details are in the paper's HTML version (https://arxiv.org/html/2605.22870v1).

The copy channel wins.

Gold-answer presence, meaning whether the correct answer appears verbatim in the chain-of-thought text, accounts for 54 to 92 percentage points of each model's measured accuracy, according to the paper (https://arxiv.org/abs/2605.22870). That translates to 89 to 92 percent of each model's theoretical ceiling on this task. The numbers are striking on their own, but the more revealing result is what happens when the model is wrong. On items it answers incorrectly, the final output matches the last number in the chain-of-thought 95 to 96 percent of the time. The model is not making arithmetic errors. It is outputting whatever number appeared at the end of the reasoning chain, even when that chain demonstrably went somewhere else.

Liu then runs the cleanest possible test of the copy hypothesis. Replace the trailing number with a wrong value. If the model is reasoning, correct intermediate steps should still produce a right answer. If it is copying, accuracy should collapse. It collapses, the paper shows. Replacing the trailing number drops accuracy to near zero despite the model's intermediate steps being correct. Removing the trailing number entirely, leaving nothing to copy, recovers 5 to 32 percentage points above that floor. Even single-step arithmetic the model can independently perform is suppressed when a copyable number is present in the context.

The copy behavior varies by architecture. Qwen and Llama copy novel distractors, numbers that never appeared in the problem, at rates of 87 to 95 percent. Gemma gates more selectively, copying roughly 85 percent of the time, the paper notes. This is not a fluke or a training artifact. Head-level ablation, a technique that systematically disables individual attention heads to identify which components a model relies on, points to architecture-specific head sets as the mechanism. The shortcut is legible at the neuron level (see Section 5 of the paper, https://arxiv.org/html/2605.22870v1).

The finding replicates on GSM-Symbolic, a variant of the benchmark designed to eliminate spurious pattern matches. On non-arithmetic tasks from the BBH dataset, shuffle retention drops sharply: the shortcut does not generalize beyond arithmetic contexts. At the 7 billion to 8 billion parameter scale, Liu finds that content-selective gating begins to emerge, meaning larger models start to gate whether they copy based on whether the content makes sense. This is a partial reassurance, but the models tested at that scale are not frontier-class, and the paper does not characterize whether the shortcut disappears entirely or merely diminishes.

For builders and operators, the practical consequence is specific and consequential. Chain-of-thought prompting became a standard tool for monitoring whether a model deserves trust. If the model shows its reasoning, you can check whether the reasoning holds. Liu's findings, documented in the paper's discussion section (https://arxiv.org/html/2605.22870v1), suggest that for small models the chain is largely performance art. The faithfulness signal, the observation that the model follows its own reasoning steps, may be measuring positional answer transport instead of genuine computation.

This does not mean small models are useless for arithmetic tasks. It means the evaluation signal is confounded. A GSM8K accuracy score for a 1.5B model may reflect how often the correct number happened to appear last in the chain, not whether the model can actually add. Anyone monitoring small-model deployments for accuracy, compliance, or safety using CoT-based oversight should know that the readout stage may not be reading computation.

The test Liu developed is falsifiable and reproducible. Take any GSM8K problem. Let the model generate its chain of thought. Then change one digit in the final step and run it again. A true reasoner adapts. A model using the readout shortcut fails. Readers can run this on their own deployed models without access to Liu's internal systems.

The paper does not answer whether the shortcut persists in larger models, whether it appears in real-world arithmetic tasks outside the benchmark, or whether GSM8K retains validity as a reasoning benchmark for sub-3B models given this failure mode. These are open questions, not verdicts. The contribution is precise: a named, isolated, reproducible failure mode with a legible mechanism and a concrete test. Whether to treat that as a reason to move work to larger models, invest in training interventions, or treat it as a known confound in evaluation is a decision for builders, not a conclusion the paper draws.

Ask a Small Language Model to Show Its Work. Then Change One Number.

Sources