NVIDIA released a quantum calibration benchmark on April 14 with a problem: most AI benchmarks for quantum hardware are validated on synthetic data that does not reflect real physics. The company needed a dataset of actual qubit errors recorded under conditions that cannot be replicated in a surface laboratory. That data came from Fermilab's NEXUS facility, a detector tunnel buried three hundred and fifty feet underground and built for dark matter searches. Even with lead shielding closed around a four-qubit test chip, researchers from Northwestern and Fermilab measured correlated charge jumps across multiple qubits simultaneously — the kind of coordinated error that quantum error correction codes are designed to catch, except here the errors are coming from somewhere nobody has yet identified.
This is not a minor experimental inconvenience. Correlated errors across qubits are harder to correct than independent errors, because error correction assumes the noise on one qubit is unrelated to the noise on its neighbors. When radiation causes multiple qubits to misbehave at the same time, that assumption breaks. The result, published in Nature Communications in November 2025, is the first controlled measurement of this phenomenon in an underground environment designed to suppress cosmic ray noise.
NVIDIA needed this data. The company released Ising-Calibration-1 on April 14, a thirty-five-billion parameter vision-language model fine-tuned on quantum processor calibration tasks. To validate it, Northwestern and Fermilab provided the NEXUS radiation dataset — real measurements of what background radiation does to superconducting qubits, collected over a month of continuous monitoring. That is the scarce ingredient that separates QCalEval, the new benchmark NVIDIA released alongside the model, from a standard AI benchmark press release.
QCalEval contains two hundred and forty-three samples across eighty-seven scenario types from twenty-two experiment families, spanning superconducting qubits and neutral atoms. NVIDIA has released the dataset publicly on HuggingFace and made Ising-Calibration-1 an open-weight model. Ising-Calibration-1 scores seventy-four point seven on its own benchmark in a zero-shot setting, compared to seventy-two point three for the best general-purpose vision-language model tested under the same conditions. The advantage is real, if modest, and specific to the domain. The six question types the benchmark evaluates (defect identification, parameter extraction, experiment comparison, scheduling, debugging, and pulse analysis) map directly onto what a quantum hardware engineer actually does.
The honest limitation is the hardware. The NEXUS result comes from a four-qubit chip. Whether the correlated charge noise observed there generalizes to the hundreds or thousands of qubits in systems IBM, Google, and others are building toward fault tolerance is an open question. Lead shielding reduced the correlated bursts but did not eliminate them. The unknown background source that remains even in a shielded underground lab at Fermilab has not been identified.
"AI is becoming the control plane for quantum hardware," said Sam Stanwyck, NVIDIA's director of quantum product. Current quantum processors make roughly one error per thousand operations. Fault-tolerant applications require closer to one in a trillion. The gap is not a software problem. It is a hardware noise problem, and noise at this scale requires measurement, modeling, and mitigation at every layer of the system.
This is where the NEXUS data enters the picture. Real radiation-induced correlated errors cannot be synthesized in a surface laboratory. The cosmic ray muon flux, the gamma ray background, the neutron capture cascades — these are environmental conditions that only a facility designed for particle physics can provide. NVIDIA needed this specific data because the phenomenon it causes cannot be approximated. That is why QCalEval is not just another leaderboard. It is a validation set for whether a calibration AI can recognize a class of hardware failures that only occur in specific physical conditions that most labs cannot replicate.
The SQMS and CosmiQ programs at Fermilab plan to expand the framework with additional testbeds called QUIET and LOUD, which will generate more data under controlled radiation conditions. If the programs produce results at scale, they will address the generalizability question that the four-qubit NEXUS result cannot answer on its own. Until then, the benchmark reflects one facility's dataset and one model trained on it.
What remains constant across any qubit modality is the underlying physics. Radiation causes charge fluctuations in superconducting circuits. Those fluctuations affect multiple qubits at once. Error correction codes designed for independent errors handle correlated errors poorly. Better shielding, different qubit geometries, real-time calibration: these are the engineering responses, and all of them require first knowing the noise floor. NEXUS was built to find dark matter. It found something more practical instead.