If you're running multi-agent debate in production — where two AI models argue a question out rather than thinking through it alone — you're probably spending 7 times more per query than you need to. A paper from researchers at the University of Science and Technology of China (USTC), posted to arXiv on April 3, 2026, describes a system that cuts token consumption by 71% while accuracy actually improves — from 80.09% to 82.46% across six benchmarks. The mechanism is not a better model. It's a better stopping rule.
In practice, most existing MAD systems run every query through the full multi-round debate — regardless of whether the models already agree. That's expensive. A standard MAD setup using Llama-3.1-70B and Qwen2.5-32B averages 7,413 tokens per query.
The system the USTC team built is called HCP-MAD, short for Heterogeneous Consensus-Progressive Multi-Agent Debate. The core change is architectural: it divides queries into two stages. The first stage is called HCV, for Heterogeneous Consensus Verification. Both models respond to a query, and if they reach the same answer, the system stops — no further debate. The researchers found that 80.12% of queries resolve at this first stage, with 92.16% accuracy and an average cost of just 980 tokens. Only the roughly 20% of queries where the models disagree proceed to the full debate stage.
The 71% token reduction is real. The paper reports 2,137 average tokens per query versus 7,413 for standard MAD — and versus 2,638 for DOWN, a comparable consensus-based baseline that HCP-MAD outperforms on both cost and accuracy. The accuracy improvement is modest but consistent: 82.46% versus 80.09% for DOWN across MMLU, CommonQA, GPQA, MATH-500, GSM8K, and AQuA.
One number that should concern anyone running MAD at scale: the correct-to-incorrect flip rate. Standard MAD flips from a correct to an incorrect answer 14.85% of the time — meaning the debate process actively degrades answers on a significant minority of queries. HCP-MAD's flip rate is 1.13%. The debate makes things worse far less often when consensus is the stopping signal, not a scheduled number of rounds.
An ICLR 2025 analysis of five MAD frameworks across nine benchmarks — evaluating whether the added computation actually bought reasoning gains — found that MAD broadly fails to outperform simple single-agent chain-of-thought strategies. The USTC team doesn't cite this work, but it is worth reading alongside the paper. HCP-MAD doesn't contradict the ICLR finding, but it reframes the question: if MAD is going to run regardless, the efficiency problem was never the number of agents. It was the absence of a signal for when to stop.
The paper uses a heterogeneous model pair — Llama-3.1-70B-Instruct and Qwen2.5-32B-Instruct — rather than two copies of the same model. Heterogeneous pairs are less likely to share blind spots. Two instances of the same frontier model tend to make the same errors. A large model and a smaller, different model catching each other is a more realistic proxy for what a diverse reviewer panel looks like.
No code has been released. The paper describes the architecture and reports benchmark results, but there is no GitHub repository, no weights, and no API endpoint. This matters for readers evaluating practical adoption — the paper demonstrates a method, not a deployable system. A team wanting to replicate this today would need to implement HCV from the description and run their own experiments on their own model pairs.
The preprint caveat applies: this has not been peer reviewed. arXiv posts papers before independent validation. The numbers are self-reported, and the benchmarks used are standard academic collections — real-world deployment performance on production query distributions is unknown.
What to watch: whether the HCV early-exit principle generalizes beyond the two-model heterogeneous setup the paper evaluates. If the 80% first-stage resolution rate holds across different model pairs and domain distributions, the economic case for MAD becomes substantially stronger. If it doesn't, the paper is a well-motivated ablation on a narrow experimental setup — interesting, but not the infrastructure shift the token numbers imply.
The USTC team is Yiqing Liu, Hantao Yao, Wu Liu, Allen He, and Yongdong Zhang. The paper is on arXiv.