Microsoft Research has a new trick for making reasoning models faster and cheaper: teach them to forget what they just thought.
A paper out this week called Memento describes a training pipeline that gets language models to segment their own chain-of-thought mid-generation, compress each segment into a dense summary called a memento, and then flush the original thinking from memory while keeping the answer. The results are striking: peak KV cache drops by two to three times, serving throughput nearly doubles, and a model trained on roughly 30,000 annotated examples can do this for itself with no orchestration layer involved.
The work attacks a real problem. Reasoning models like OpenAI's o1 and DeepSeek's R1 can generate thousands of tokens per call, sometimes hundreds of thousands in complex multi-step problems. All of those tokens stay in memory, attended to at equal cost, whether they lead anywhere useful or not. The model has no native ability to decide what to keep and what to discard. External workarounds exist: separate summarizers, API call restarts with condensed context, orchestration logic built around the model. But these are systems bolted onto the model rather than skills it has learned.
Memento is an attempt to make context management an actual trained behavior. The pipeline starts with OpenThoughts reasoning traces generated by QwQ-32B and turns them into training data through a multi-step process. The raw traces have no natural segment boundaries, so the team first breaks them into atomic units: sentences, code blocks, math equations. An LLM then scores each inter-sentence boundary from zero to three based on whether it represents a natural stopping point or a mid-thought break. A dynamic programming algorithm places the actual cuts to maximize boundary quality while keeping block sizes balanced.
Once segmented, a compressor LLM produces a memento for each block: a terse, information-dense record of what that segment figured out, including key intermediate values, formulas, and strategic decisions. The memento has to be compressible enough that a model could continue reasoning from it without ever seeing the original block. A judge LLM evaluates each memento across six dimensions, and if it falls short, the judge provides specific feedback and the compressor retries. Two rounds of this feedback loop raise the pass rate from 28 percent to 92 percent. The final dataset, OpenMementos, contains 228,000 annotated traces: 54 percent math, 19 percent code, 27 percent science.
Training uses a curriculum. Stage one applies standard causal attention across all tokens. Stage two begins masking blocks during training so the model learns to rely on mementos alone, matching the inference-time setup where blocks are flushed from the KV cache. Jumping straight to full masking does not work: the model is trying to learn the format, how to compress under constraints, and how to reason forward from mementos all at once, and it struggles with all three simultaneously.
The numbers are clean. Memento achieves roughly six-times trace-level compression: about 11,000 tokens of reasoning compacted to under 2,000 tokens of mementos per trace. Across five models tested, Qwen2.5-7B, Qwen3-8B, Qwen3-32B, Phi-4 Reasoning at 14 billion parameters, and OLMo3-7B-Think, the KV cache reduction and throughput gains held. Accuracy gaps are small and shrink with scale, closing further with reinforcement learning.
But there is a finding the researchers did not anticipate. The erased reasoning blocks do not fully disappear from the model's internal representations. Even after a block is masked and its KV cache entries are flushed, information from that block leaks forward through the residual KV cache state. The model still has access to something from the erased block, and when the team tried removing this implicit second channel, accuracy dropped significantly. This is not in the compressed trace, not in the memento, and not in any explicit attention state. It is just there in the representations, a ghost trace the model itself did not mean to leave.
That detail is worth sitting with. Memento is framed as efficiency: less memory, faster inference, lower cost. It is all of those things. But the mechanism that makes it work also silently preserves information the model was told to forget. Whether that matters depends on what the reasoning blocks contain. For math and code, probably not much. For a model working through a security vulnerability, a negotiation strategy, or a reasoning trace that touched sensitive context, it is a different question. The paper does not raise it as a concern. It probably should be one.
The theoretical companion is on arXiv at 2601.21576, titled Chain Of Thought Compression: A Theoretical Analysis. The authors include Vasilis Kontonis, Yuchen Zeng, Shivam Garg, Lingjiao Chen, Hao Tang, Ziyan Wang, Ahmed Awadallah, Eric Horvitz, John Langford, and Dimitris Papailiopoulos. The dataset, data generation pipeline, and a vLLM fork with native block masking are all open.
This is the kind of paper that does not announce itself loudly. It is not a new benchmark, not a GPT-5 moment, not a capabilities revelation. It is a training trick with real efficiency gains that also happens to reveal something about how transformers hold onto information they were supposed to have released. The efficiency story is the hook. The ghost trace is the part worth thinking about.