Jensen Huang sent a company-wide email to NVIDIA employees this week with a two-sentence ask: start using GPT-5.5-powered Codex, and do it now. "Let's jump to lightspeed," he wrote. What makes that unremarkable internal memo worth noting is the experiment it launched: over 10,000 NVIDIANs across engineering, product, legal, marketing, finance, sales, HR, operations, and developer programs are now running the AI coding tool on NVIDIA's own GB200 NVL72 hardware, right now, in production. The question is whether what happens next inside NVIDIA's walls matches the numbers on the page.
On standard benchmarks, GPT-5.5 scores 58.6 percent on SWE-Bench Pro, trailing Anthropic's Claude Opus 4.7 at 64.3 percent, according to Handy AI's model comparison. On a separate evaluation measuring how often a model produces fabricated information in contexts where accuracy matters, GPT-5.5 posts an 86 percent confabulation rate. Opus 4.7 scores 36 percent on the same measure, according to the same Handy AI analysis. The 86 percent figure describes a model's tendency to produce confident misinformation when it cannot answer correctly, not the share of all its answers that are wrong.
Inside NVIDIA, engineers have reported measurable gains: debugging cycles that once stretched across days are closing in hours, and experiments that previously required weeks are running overnight in complex, multi-file codebases, according to NVIDIA's blog. Those reports are NVIDIA's own; no independent verification exists yet.
On Blackwell hardware, GPT-5.5 delivers a 35-fold reduction in cost per million tokens and a 50-fold improvement in token throughput per watt compared with the prior generation, according to NVIDIA's blog. Analyst firm Signal65 confirmed the token-per-watt gains independently. (Signal65 is a paid analyst firm covering NVIDIA; their independence credential is technically accurate but practically limited.) The NVIDIA-OpenAI partnership dates to 2016, when Huang hand-delivered the first DGX-1 AI supercomputer to OpenAI's San Francisco headquarters, according to NVIDIA's blog.
NVIDIA's internal rollout is not a controlled demo. It is a production stress test with real engineers, real code, and real exposure to the cost of errors if hallucinations reach shipped products. The company's vertically integrated position — running its own model on its own hardware with its own internal tooling — is not a condition most enterprise customers can replicate. If the cost and performance gains are specific to NVIDIA's infrastructure, the story collapses from an enterprise adoption signal to a partnership PR piece with 10,000 internal users and no verifiable output.
What to watch: in three to six months, NVIDIA will either have internal data on whether Codex changed engineering output or it will not. That data, if it surfaces, will be the first credible signal on whether frontier AI coding tools have crossed a threshold for real engineering work. The industry has spent two years watching demos. The live test is running now.