Cohere's new open-weight coding model is shaped less like a general-purpose chatbot and more like a tool for software engineering agents, and the architecture choices tell you exactly where the company expects North Mini Code to win.
The model, released this week under Apache 2.0, pairs 30 billion total parameters with only 3 billion active per token. That gap is the story. A sparse mixture-of-experts design of this size keeps inference cost and latency close to a much smaller dense model while preserving a larger pool of specialized knowledge, which is why Cohere can pitch it for self-hosted agentic workflows without giving up on raw capacity. Cohere positions the release as its first developer-facing coding model, with weights available on Hugging Face, the Cohere API, Cohere Model Vault, and OpenRouter.
Two specific design choices stand out for engineering teams deciding whether to take this release seriously.
The first is the attention pattern. The model interleaves sliding-window attention with global attention at a 3:1 ratio, using rotary positional embeddings on the local windows and no positional embeddings on the global layer. Sliding-window attention is cheap and well-suited to local code context, while sparse global attention lets the model reach across a long repository when the task requires it. For agentic workloads that read large codebases, hold long execution traces, and need to revisit a definition from 200,000 tokens ago, that combination is a deliberate bet on repository-scale reasoning rather than short-form autocomplete.
The second is the router. Cohere describes a sigmoid-then-top-k routing scheme, with 128 total experts and 8 active per token, plus a single dense layer before the sparse MoE block. Sigmoid routers tend to be more stable and less prone to load-balancing pathologies than softmax-based alternatives, and the dense layer at the front preserves general capability before the model hands the rest of the work to specialists. The result is a model that aims for competitive coding performance at a fraction of the active compute, with 256K context and a 64K maximum output, text-in and text-out only, and a minimum hardware bar of a single H100 at FP8.
Cohere's post-training is the third leg of that story. The model went through a two-stage cascade: supervised fine-tuning followed by reinforcement learning with verifiable rewards, with a focus on agentic coding, native tool use, and interleaved thinking. The training emphasis is on workflows where the model has to call tools, read terminal output, and revise its plan, which fits the three workloads Cohere highlights: sub-agent orchestration, systems-architecture mapping of repositories, and code review on diffs.
The benchmark numbers, however, come with a large caveat. Cohere reports a score of 33.4 on the Artificial Analysis Coding Index, evaluated across SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, Terminal-Bench Hard, SciCode, and LiveCodeBench v6, using three seeds, temperature 1.0, top_p 0.95, with SWE-Bench on SWE-agent v1.1.0 and Terminal-Bench Hard on Terminus-2. The company also claims up to 2.8x higher output throughput, 30% better inter-token latency, and a slight time-to-first-token deficit compared with Devstral Small 2 at identical concurrency and hardware. Those are the only head-to-head numbers in the release, and the only comparison point, and the comparison is against a single open-weight coding model. Independent reproduction has not yet appeared in the research turn that surfaced this release, and MarkTechPost's coverage echoes Cohere's framing rather than adding outside validation.
The "sovereign AI" framing Cohere uses for the release sits inside a real tension. Apache 2.0 weights and broad distribution channels do give a team the right to self-host, audit, and modify the model on their own infrastructure. A single H100 at FP8 is also a much lower bar than the multi-GPU setups that frontier closed-weight coding systems imply, and the company has prepared serving paths through vLLM and its own cohere_melody library, plus quantized builds for Ollama, LM Studio, and llama.cpp. None of that is free or trivial, and "sovereign" here still means a serious GPU purchase and a serving stack a team has to operate. It is a genuine step toward self-hosting for an engineering organization that has decided to bet on agentic coding, and it is not a path around owning the hardware.
For teams sizing up whether to swap or benchmark, the practical question is whether the architecture's bets line up with the workload. If the job is long-horizon repository scanning, multi-step terminal workflows, and tool use with interleaved thinking, the 3:1 attention ratio and sigmoid router are designed for exactly that. If the job is short-form code completion in a single file, the same choices are overhead. The benchmarks Cohere ran are pointed at the former category, the comparisons are narrow, and community feedback will, by the company's own account, shape the next release. The model is out, the weights are open, and the proof will be in what independent coding harnesses measure in the weeks ahead.