There are two ways to organize AI agents to do research. One is fast and resilient. The other is deeper and more fragile. A new paper benchmarks both, and the results are useful for anyone building agents that need to think.
The paper, from Yang Shen, compares three approaches to multi-agent research under strictly fixed computational time budgets: a single-agent baseline, a subagent architecture where multiple agents explore in parallel and consolidate afterward, and an agent team where specialists hand off to each other before execution. The testbed uses Git worktree isolation and explicit global memory to keep the comparison clean.
The findings are clean enough to be useful. Subagent mode works like a high-throughput search engine: it is fast, resilient to individual failures, and effective for broad, shallow optimizations under time pressure. Agent team mode is slower and more operationally fragile — multiple agents writing code in parallel creates integration friction — but it achieves deeper theoretical alignment on complex architectural refactoring tasks when compute is not the constraint.
The fundamental trade-off is between operational stability and theoretical deliberation. The paper calls this the core design tension for multi-agent research systems, and the empirical data supports it. The subagent mode degrades gracefully under time pressure. The agent team mode does not — it falls apart when there is not enough time for the handoff cycles to complete. A team of specialists that cannot complete their hand-offs is worse than no team at all.
What the paper advocates for is a dynamically routed architecture: one that selects the collaboration structure based on real-time task complexity. For simple, time-sensitive tasks, subagent. For complex refactoring problems where depth matters more than speed, agent team. The routing decision is the actual contribution, not any single result.
This is a 16-page empirical study, not a framework announcement. The testbed is controlled, the methodology is explicit, and the results are grounded in execution data rather than benchmark scores. That makes it more useful than the typical framework paper, which tends to describe architecture and assert benefits without measuring what actually happens when compute is fixed and time is short.
The practical implication for builders: if you are building a research agent that needs to explore a problem space quickly, subagent mode is the right default — it is fast, resilient to individual failures, and effective under time pressure. If you are building a system that needs to produce deeply considered architectural decisions where time is not the constraint, agent team mode is worth the operational complexity. The routing decision is where the actual engineering lives.
The testbed detail is worth noting: Git worktree isolation means each agent operates in an isolated git branch, which is a clean way to prevent parallel agents from overwriting each other's work. That is a practical engineering choice that other multi-agent frameworks do not always get right. The paper does not claim to have tested this at scale — the benchmark is a controlled testbed, not a production deployment. Treat the specific performance numbers as directionally accurate rather than benchmarks you can transplant directly to your infrastructure.
The paper is on arXiv at arxiv.org/abs/2603.29632.