The Security Benchmark That Wanted to Prove Safety Tuning Breaks AI Agents. It Found the Opposite.
Ask yourself what happens when you remove the safety guardrails from a language model running as a security agent. Most people assume the model becomes more capable — free from the cautious refusals, willing to do the thing it was trained not to do. A new benchmark from University College London researchers suggests the opposite is true: removing safety tuning does not unlock security capability, it destroys the model's ability to do the job at all.
Isaac David and Arthur Gervais built a trace-based evaluation of 1,500 security-agent runs across 30 local vulnerability-analysis tasks, comparing stock aligned models against their uncensored or abliterated derivatives, according to their arXiv paper. The pairs were Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. Every run shared the same harness, tools, budgets, seeds, and success predicates. The comparison was clean.
The Gemma results are the sharpest. Aligned Gemma 4 31B completed 14.0% of security tasks successfully. The less-restricted derivative — the one designed to refuse less — completed 0.7%. For Gemma 4 26B, the split was 10.7% versus 0.0%. The less-restricted version of the 26B model could not solve a single task.
That alone would be a clean story about safety tuning suppressing capability. But the data goes further. In the 31B less-restricted traces, the researchers logged 0.0% refusal rate, 0.0% suppressed actions, and 0.0% unsafe actions. The model was not refusing anything. It was also not doing anything useful. Grounding scores — measuring whether the model's artifacts actually referenced the local evidence — dropped from 3.91 to 3.27 for the 31B pair and from 4.12 to 1.64 for the 26B pair when safety alignment was removed, the paper shows. The less-restricted models produced fluent reports that cited nothing in the actual code.
Cross-family results complicate the clean narrative. Qwen2.5-Coder's abliterated derivative performed worse than its aligned sibling on task success: 2.0% versus 5.3%. The abliterated Llama 3.1 8B derivative failed the tool protocol entirely — it could not reliably use the provided tools at all. This is not a story where less alignment always produces less capability; it is a story where the specific modification used to reduce refusals also degrades the reasoning chains needed for tool use.
The paper's framing is careful. The title asks whether safety alignment survives autonomous security-agent work. The answer the data actually supports is narrower: the less-restricted derivatives tested do not outperform aligned models on any task category. All families fail hard proof-of-trigger and patch-verification tasks — the most demanding work in vulnerability analysis — whether aligned or not. Safety alignment is not the reason models struggle with those tasks.
What the benchmark does establish is that refusal rate is an inadequate safety signal for agents. A model that refuses nothing is not automatically more capable; it may simply be less coherent. The authors argue safety in autonomous agents needs to be measured across a system: refusal behavior, tool reliability, and evidence grounding together, not refusal alone.
The practical implication for builders is uncomfortable. Several security-agent startups openly market their products as running on uncensored or abliterated model derivatives. The UCL data suggests that modification is not a capability unlock — it is a trade of one failure mode for a worse one. The aligned model that refuses a vulnerability description it should analyze is frustrating. The unaligned model that cheerfully produces a fluent, grounded-sounding report that is wrong about everything is harder to catch.
Arthur Gervais's work has Google funding through a 2025 Academic Research Award in AI safety — he has a stake in the mainstreaming of AI security agents, not against it. The paper's own limitations section notes the derivatives tested are not clean causal counterfactuals for safety alignment; they are third-party modifications that may alter more than refusal behavior. The Gemma gaps appearing on ordinary coding tasks as well as security tasks suggests the effect is not purely about security-specific refusal. That qualification is in the paper. It is not in the headline numbers that will circulate.
The benchmark code and traces are on GitHub. The paper is on arXiv. The 30 task manifests are fixed and the success predicates are deterministic — unlike most AI safety benchmarks, this one is auditable by anyone who wants to check the work. Whether anyone in the industry wants to hear what it says is a different question.