When AI coding assistants write SDK code, they tend to write the wrong SDK — or rather, the right SDK at the wrong time. Models trained on snapshots of documentation ship code that reflects what was true months or years ago, not what's in the current release. Google's solution: a lightweight skill definition format called Agent Skills, published at agentskills.io, that lets SDK maintainers describe their API surface in a way AI agents can actually use. The headline results are striking. Gemini 3.1 Pro climbs from a 28 percent baseline pass rate to what Google calls "excellent" on a 117-prompt evaluation targeting the Gemini SDK. Gemini 3.0 Pro and Flash, both weaker reasoning models, improve from 6.8 percent to comparable levels. On the surface, that's a meaningful gap closed.
But the same Google Developers Blog post that reports those numbers also contains the uncomfortable footnote: in 56 percent of eval cases, the agent never invoked the skill at all. The skill sat there, correctly authored, and the agent did not reach for it.
That gap is the real story. Skills are only as useful as the agent that knows when to deploy them — and that's a reasoning problem, not a documentation problem. Google's own analysis bears this out: older Gemini 2.5 series models "benefit, but nowhere near as much," in the company's own words. The skill mechanism is sound. The inference about when to invoke it is not.
The format itself is Anthropic's
Agent Skills originated with Anthropic, which launched the format in October 2025 and opened it as an open standard on December 18, 2025. The agentskills.io repository has since crossed 20,000 GitHub stars, with what VentureBeat described as "tens of thousands of community-created and shared skills." As of early 2026, the format is supported across 16 or more tools including Claude Code, OpenAI Codex, Cursor, VS Code, GitHub Copilot, JetBrains Junie, and Google's own Gemini CLI.
Google published its own Gemini SDK skill at agentskills.io in March 2026, joining what is becoming an expected checkbox for SDK maintainers. Anthropic donated the Model Context Protocol to the Linux Foundation on December 9, 2025, and both Anthropic and OpenAI co-founded the Agentic AI Foundation alongside Block; Google, Microsoft, and Amazon Web Services joined as members. The governance structure for the skills format itself — who decides what belongs in a skill, how disputes are resolved, how the format evolves — is less settled. The long-term stewardship question remains open: whether the standard falls under the Agentic AI Foundation or requires its own governance structure is "an open question."
The update problem
Unlike a package manager with semantic versioning and a changelog, there is no formal mechanism for skill updates. A skill published today reflects the SDK as it exists today — and will continue to reflect it indefinitely unless the maintainer actively revises it. Google's own post flags this: "no good update mechanism yet, old skills could misinform." The company is aware of the problem. A solution has not shipped.
The practical risk is silent divergence. An agent that correctly uses a skill today may continue invoking it after the underlying SDK has changed — and without an update signal, the agent has no way to know the cached skill is stale. For SDKs with active development cycles, this is not theoretical. The skill becomes a snapshot that decays.
There is also a numbers discrepancy worth flagging. The Google Developers Blog post reports a 6.8 percent baseline for Gemini 3.0 Flash and 28 percent for 3.1 Pro on the 117-prompt SDK evaluation. The gemini-skills repository, authored by Google, claims 87 percent for Flash and 96 percent for Pro on what it describes as "correct API code following best practices." The difference likely reflects different evaluation criteria — the blog measures against the SDK prompts, the repo measures API correctness — but the gap between Google's published blog and its own repository numbers is not explained in either place.
The invocation dependency
The 56 percent non-invocation rate from Google's own evals has a parallel in third-party research. Vercel's Jude Gao published agent evaluation data in January 2026 showing that skills with default behavior achieved a 53 percent pass rate — identical to the no-skill baseline. Skills with explicit invocation instructions jumped to 79 percent. A documentation index built from the AGENTS.md format hit 100 percent. The pattern is consistent: skills work when the agent knows to use them. Without that trigger, they are inert.
For SDK maintainers, the practical read is straightforward. Publishing a skill is low-effort and the potential upside is real — the 6.8-to-excellent jump for weaker reasoning models suggests the format carries genuine signal for models that do invoke it. But the invocation dependency means the skill is a layer on top of existing documentation, not a replacement for it. The real constraint is not the skill format but agent reasoning — and that is a harder problem to solve with a JSON schema.
For builders evaluating agentic infrastructure: watch whether the invocation rate improves as reasoning models get better. That is the axis along which this format either becomes load-bearing or remains an optional extra.