Simon Willison recently did something unusual: he used an AI to audit the APIs before he wrote any code.
Willison is the creator of LLM, a Python library and CLI tool that provides a unified interface to hundreds of different large language models from dozens of vendors. If you wanted to call Claude, GPT-4, Gemini, and Mistral through the same Python code, LLM was how you did it. The library is open source, well-maintained, and widely used in the Python data tooling community.
In late March, Willison started work on a major overhaul of how LLM handles conversation storage. The trigger was a feature gap: some of the newer models from Anthropic, Google (Gemini), and OpenAI now run tools server-side during the API call itself. A model can search the web, call a function, or execute code before returning its final response. LLM's existing abstraction layer was not designed for this. The library had no way to store the results of those intermediate tool calls, no place to persist the reasoning-phase traces, and no mechanism for round-tripping encrypted context blocks that some models return, according to the GitHub issue tracking the changes.
The standard move for an open-source maintainer in this situation is to read the vendor documentation, maybe skim a few blog posts, and start coding. Willison did not do that. Instead, he turned to Claude Code, Anthropic's AI coding agent, and gave it a research task: read the raw Python client libraries for Anthropic, OpenAI, Google, and Mistral. Then use those to craft curl commands that exercise each provider's raw JSON streaming APIs across a range of scenarios. The output, including all captured API responses, lives in a new repository called research-llm-apis.
The curl approach was deliberate. SDKs abstract away edge cases; raw HTTP calls reveal them. As Willison wrote: "I had Claude Code read through the Python client libraries for Anthropic, OpenAI, Gemini and Mistral and use those to help craft curl commands to access the raw JSON for both streaming and non-streaming modes across a range of different scenarios."
He also brainstormed with Claude Opus 4.5 before touching any code. The GitHub issue tracking the planned changes describes the core problem: models that run tools during the reasoning phase return text that is not part of the final response but still needs to be stored. Encrypted context blocks returned by some models need the same treatment. Both have to persist in SQLite so they can be passed back to the API in future calls within the same conversation.
This is a SQLite schema problem, but it is also a preview of a class of problems that every developer building production LLM applications will eventually encounter. The moment a model can take multiple tool-calling steps before returning, a single API call becomes a multi-turn conversation with state that has to live somewhere. The abstraction that worked fine for pass-through text completions breaks. Willison is not the only person discovering this; he is just the one who wrote it up in detail.
The research repository is itself a product of the approach it documents. Every line of code and text in the broader simonw/research GitHub organization was written by an LLM, usually Claude Code. Willison's role is to direct the research, review the outputs, and make the decisions. The curl scripts that probe the vendor APIs were generated by an AI that read the vendor SDK source code. The architecture decision to separate reasoning traces from final responses was informed by that research.
What Willison produced is not a product announcement. It is a technical report from the frontier of building with large language models, written by someone who has been in the trenches with this stack longer than most. The research-llm-apis repo is a reference document: bash scripts that hit real endpoints, with captured JSON responses covering text completion, streaming, vision, tool calling, reasoning, and web search. A developer working on their own LLM integration could clone the repo and have a working test harness against four major providers in minutes.
The deeper pattern is the method. Willison used an AI to do standards research before writing a line of application code. He did not ask the AI to build the feature. He asked it to understand what the vendors had actually built, at the protocol level, so the abstraction he designed would fit the reality. That is a different workflow than prompt-engineering a blog post or generating a patch. It is closer to how a senior engineer might use a research assistant: you know what you need to find out, you delegate the legwork, and you apply your own judgment to the output.
For developers building on top of multiple LLM providers, the tool-execution-during-reasoning problem is not theoretical. Gemini's native web search, Anthropic's computer use, and OpenAI's operator mode all generate intermediate artifacts that have to live somewhere. The conversation abstraction that persists a simple text exchange in SQLite does not automatically extend to a multi-step tool execution trace that includes reasoning-phase content. Willison is rebuilding that abstraction now, with evidence gathered by an AI, and the decisions he makes will shape how a widely-used open-source library handles a class of API behavior that is only going to become more common.