When Shopify's pull request merge growth hit 30 percent month-over-month, Mikhail Parakhin, the company's chief technology officer, made a decision. A pull request — the bundle of code changes a developer submits for review before it enters the main codebase — was the bottleneck. Existing AI review tools were not up to the task. So Shopify built its own.
"I haven't found a good PR review tool that does what I think should be done," Parakhin said on the Latent Space podcast on April 22. "You need to run the largest models. Code review tools are just not going to cut it."
That answer, buried in a 72-minute conversation about Shopify's AI infrastructure, cuts against the prevailing narrative about AI coding. The story most companies tell is about generation: how fast AI can write code, how many pull requests it produces, how much faster features ship. Parakhin is telling a different story. The generation problem is mostly solved. The review problem is getting worse.
Shopify's internal data bears it out. PR merge growth is now running at 30 percent month-over-month, up from 10 percent before AI tools proliferated. The result is a codebase getting more code, faster, than any human review process was designed to handle. Test failure rates are climbing, deployment cycles are lengthening, and rollbacks are becoming more common. "The real problem is since there's so much more code, the probability of at least some tests failing is going up," Parakhin said. "You keep failing, then you have to find the offending PR, evict it, retest it without that PR, and so deployment cycle becomes much longer."
The irony is clean. AI-generated code is, on average, cleaner than human-generated code per line. But the volume increase swamps the per-line quality gain. "By now a good model writes code on average with fewer bugs than the average human," Parakhin said. "But since they write so much more of it, more of it will make it into production." Net result: more bugs in production, not fewer.
The review gap
Off-the-shelf review tools fail for a structural reason, according to Parakhin. Running code review well requires the largest available models, operating in a slow, sequential critique loop rather than a fast parallel swarm. That architecture is expensive and slow, which conflicts with how most commercial review products are built and priced.
"You end up in a different world where you generate not that many tokens," he said. "You generate few tokens, but it takes a long time because these are expensive models taking turns rather than many agents trying to do many things in parallel."
The metric Parakhin tracks is not raw token consumption but the ratio: budget spent during code generation versus budget spent on review with expensive frontier-class models. Getting that ratio right requires tools built for that specific tradeoff.
Shopify's solution is internal, and Parakhin declined to detail its architecture. But the demand signal is explicit: a company publicly stating that no existing product meets its requirements for AI code review. That is not a small market gap.
The December inflection
The PR review problem did not exist at this scale a year ago. Parakhin described a December 2025 inflection point that changed Shopify's internal adoption curve. "In December was the phase transition when suddenly models gotten good enough that everything took off and started growing," he said. Daily active workers using AI tools now approaches 100 percent. "It's hard not to do your job now without interacting deeply with at least one tool."
The mix of tools shifted along with the volume. Cursor and GitHub Copilot, which require an integrated development environment, have not shrunk in absolute terms, but their growth rate has been outpaced by CLI-based tools like Claude Code, Codex, and Shopify's internal agent called River. Parakhin's read: as agents get more capable, the IDE becomes less relevant. What matters is that the agent can operate without a human watching code appear line by line.
Shopify's token policy reflects this. The company funds unlimited token access for all employees, with a floor rather than a ceiling on model quality. "We try to control the models that people use, but from the bottom, not from top," Parakhin said. "Please don't use anything less than Opus 4.6." Some employees use GPT-5.4 Extra High. Nobody is supposed to use cheaper models.
Token consumption has grown accordingly, and the distribution is skewing. The top 10 percent of users are pulling away from the median. Parakhin flagged this as a pattern he is uncertain about: "The distribution is becoming more and more skewed. The top percentiles grow faster. I don't know what it tells me. It feels not ideal, to be honest."
What Tangle and Tangent add
The PR review story is part of a larger architecture Shopify has built and is now talking about publicly for the first time. Parakhin walked through three internal systems that compound each other.
Tangle is the company's third-generation ML workflow engine. It handles any data processing task where teams need to iterate, share results, and reach production quickly. Its defining feature is content-addressed caching: if two teams run the same computation, the result is shared automatically, even if the teams are unaware of each other. "That's the network effect of multiple people helping each other," Parakhin said. The system is open source and designed to go from experiment to production in a single click, without porting code to a separate runtime.
Tangent sits on top of Tangle as an auto-research loop. An agent runs experiments continuously against a specified metric, discards failures, keeps improvements, and repeats. The concept comes from Andrej Karpathy's autoresearch project. Shopify's engineering team adapted it, and the results have spread beyond ML engineers. "The highest user right now is one of our PMs," Parakhin said. Tangent is producing gains across domains: Shopify's engineering blog documents a 65 percent reduction in Polaris build time, a jump from 800 to 4,200 queries per second in their search index with no hardware changes, and improvements in prompt compression quality. Shopify's pi-autoresearch fork of the Karpathy tool has more than 3,600 GitHub stars and 200-plus forks.
SimGym is the third piece. It simulates customer behavior on Shopify storefronts using historical transaction data. Simulated AI shoppers browse stores as real customers would, generating predictions about conversion changes before any merchant runs a live A/B test. The moat is the data: Shopify has decades of merchant and buyer behavior across virtually every retail category. Without that history, simulated agents just do what the prompt says. With it, they replicate real behavioral distributions. Shopify recently extended SimGym to analyze single themes rather than requiring A/B comparisons, meaning merchants with no testing infrastructure can get conversion guidance directly.
The three systems compound. Tangle provides reproducible experiment infrastructure. Tangent runs continuous optimization against any measurable metric. SimGym provides a fast, cheap signal for changes that would otherwise require weeks of live traffic.
Liquid AI in production
One detail Parakhin raised that has not surfaced elsewhere: Shopify is running Liquid AI's non-transformer architecture in production for two use cases.
Liquid AI, a startup building on liquid neural network research originally developed at MIT, produces models based on differential equations rather than attention mechanisms. The architecture is more compact and computationally efficient than transformers for specific workloads, particularly where long context and low latency are both required.
Shopify runs a 300-million-parameter Liquid model at 30 milliseconds end-to-end for search query understanding. When a user types a query, the model generates a full tree of possible query interpretations, including personalization signals from prior queries, fast enough for production search. Getting a transformer to that latency target required engineering work with NVIDIA that the Liquid model did not, Parakhin said.
For batch catalog work, including product taxonomy assignment, attribute normalization, and identity linking across billions of products, Shopify distills larger frontier models into Liquid models in the seven-to-eight billion parameter range. The batch workloads need throughput and long context, not low latency. Liquid's subquadratic scaling with context length makes it competitive for that use case.
"It's the only non-transformer architecture that I found being genuinely competitive," Parakhin said. He was careful not to overstate it: Liquid is not a frontier model and is not positioned to compete with GPT-5.4 or Claude for general tasks. But for the specific profile of small, fast, or long-context batch inference, Shopify is routing workloads to Liquid at an increasing rate.
What breaks next
Parakhin spent time on what he sees as the next infrastructure problem: Git, pull requests, and CI/CD were designed for humans writing code at human speed. At 30 percent month-over-month PR growth, Shopify is already trying to keep its head above water, he said. Graphite's stacked PR tooling helps, but it does not solve the underlying problem. "I would say we probably need a different metaphor or different whole design of how to process it in the new agentic world. I haven't seen anything dramatically better yet."
He raised microservices, somewhat reluctantly, as a possible answer. If code can be written at machine speed and deployed independently across small services, the global mutex that merge conflicts impose becomes less of a bottleneck. "I can't believe I'm saying this because I'm a lifelong opponent of microservices," he said. "But maybe in the new world they will make a comeback."
The Shopify VP of Engineering, Farhan Thawar, told Bessemer Venture Partners earlier this month that his team has achieved roughly 20 percent productivity gains from AI tools. Parakhin's framing suggests that number is a floor: the gains in code generation are outrunning the infrastructure to absorb them.
The companies building AI review tooling now have an unusually direct brief: Shopify's CTO has listed the requirements, explained why the market has not met them, and confirmed that a company at this scale is building its own. That is the kind of signal that tends to attract founders. What Shopify built internally, someone will productize externally.