A CarbonLLM customer recently switched one feature from GPT-4o to o3 for quality. Carbon per API call went up 40 times. Nobody caught it until the CSRD export — a mandatory EU sustainability disclosure filing that requires companies to itemize their carbon footprint. That is the accountability gap at the center of a new paper from a Paris research firm, and it is about to become everyone's problem.
The EU AI Act requires general-purpose AI providers to disclose energy-efficiency metrics to regulators. Deadlines are staggered: providers who launched GPAI models after August 2, 2025 must comply by August 2, 2026; providers with pre-existing models have until August 2, 2027, according to White & Case. Fines apply. But as of March 2026, only 8 of 27 EU member states had designated enforcement contacts, and harmonized technical standards were still unresolved. The compliance deadline is arriving before anyone agreed on how to measure.
The paper, from Emotia, a Paris-based research company, describes what happens when you try to audit the AI industry on its own terms. Arnault Pachot and Thierry Petit built the ImpactLLM Observatory: a framework that converts natural-language app descriptions into bounded estimates of energy use and carbon across 41 market models. It is not a measurement tool. It is a proxy with explicit, publicly documented assumptions that anyone can trace back to published literature. The methodology starts from a single anchor point: 0.24 watt-hours per prompt, derived from a Google Gemini Apps median in a 2025 study by Elsworth et al. published in ScienceDirect. From there it scales by output token weight, context window size, and architecture class to produce low-central-high interval estimates rather than falsely precise points.
Applied to the current model lineup, the numbers are suggestive. Claude Opus 4.1, Anthropic's largest current offering, screens at roughly 3 watt-hours per request. GPT-4o mini, OpenAI's smallest production model, screens at 0.0155 watt-hours per request. GPT-5.2, OpenAI's largest, screens at 2.79 watt-hours per request and 101 gigawatt-hours of training. The range spans two orders of magnitude across the model lineup, not because anyone has measured them directly, but because the scaling assumptions are explicit and the anchors are public.
The relevant methodological task is to organize approximation rather than to pretend to escape it, the authors write. They are not claiming their numbers are right. They are claiming their numbers are the best available given that the providers themselves decline to publish energy data.
The same accountability gap shows up in commercial tooling. CarbonLLM, a production tracking service used by enterprise sustainability teams, builds carbon dashboards that require the customer to send model and token counts after the fact. The providers do not offer equivalent tooling. The result was the 40x spike: a feature that looked like a quality win turned out to carry a carbon cost nobody had visibility into until the compliance report ran.
The dynamic is confirmed in the research literature. A July 2025 paper on LLM inference energy via simulations found that inference now accounts for more than half of an LLM's total lifecycle carbon footprint, displacing training as the dominant cost factor. Reasoning-enabled models, which generate 30 to 50 times more carbon per call than concise alternatives, are the leading edge of this shift. Hugging Face's analysis of 2,742 model evaluations found that bigger models consistently cost more in carbon, but with pronounced diminishing returns, and that community fine-tunes tend to be more carbon-efficient than official releases from larger organizations.
What happens next depends on what the providers publish: real figures, methodology that approximates figures, or lobbying for softer standards. The ImpactLLM Observatory is what the second option looks like when it arrives before the first.