Artificial Intelligence arXiv:2502.13870

Berkeley Researchers Build Algorithms to Identify Which LLM Features Actually Matter

# Berkeley Researchers Build Algorithms to Identify Which LLM Features Actually Matter When something goes wrong inside a large language model — a biased output, a nonsensical answer, a safety failure — the natural question is: *why?* Which specific combination of inputs, training examples, or i...

Berkeley Researchers Build Algorithms to Identify Which LLM Features Actually Matter

When something goes wrong inside a large language model — a biased output, a nonsensical answer, a safety failure — the natural question is: why? Which specific combination of inputs, training examples, or internal components caused the model to behave that way?

That's harder to answer than it sounds. Modern LLMs don't work by ticking through a checklist. They synthesize complex relationships between thousands of input features, and those relationships — the interactions between words, concepts, and model components — are what drive their behavior. Finding those interactions has historically been like searching for a needle in a haystack, except the haystack grows exponentially with every feature you add.

A team from UC Berkeley's Artificial Intelligence Research (BAIR) lab has a new approach. Their algorithms, called SPEX and ProxySPEX, can identify the most influential interactions in an LLM at scale — from dozens of components to thousands.

The core insight is structural: while the number of possible interactions is enormous, the number of influential ones is small. LLMs, like most real-world systems, tend to rely on sparse interactions — a handful of relationships that actually drive any given output. The Berkeley team framed this as a sparse recovery problem, borrowing tools from signal processing and coding theory to isolate those critical relationships without exhaustively testing every possible combination.

"SPEX matches the high faithfulness of existing interaction techniques on short inputs, but uniquely retains this performance as the context scales to thousands of features," according to the team's blog post. "Marginal approaches like LIME and Banzhaf can also operate at this scale, but they exhibit significantly lower faithfulness because they fail to capture the complex interactions driving the model's output."

The follow-up algorithm, ProxySPEX, adds another structural observation: hierarchy. Where a high-order interaction matters, its lower-order subsets likely matter too. This yields a 10x computational improvement — matching SPEX's accuracy with far fewer inference calls.

The team demonstrated the approach on several problems:

Trolley problem: They modified a classic ethics scenario so "True" was the clear answer. GPT-4o mini got it right only 8% of the time. Standard feature attribution identified the word "trolley" as the culprit. But SPEX found a richer story — a dominant synergy between "trolley," "pulling," and "lever." Replacing those four words with synonyms dropped the failure rate to near zero.

Data attribution: Applying ProxySPEX to a ResNet model trained on CIFAR-10, they identified both synergistic interactions (semantically distinct classes working together to define a decision boundary) and redundant ones (duplicate images reinforcing a concept). This kind of fine-grained analysis could guide better data curation — preserving necessary synergies while safely removing duplicates.

Model pruning: On an MMLU dataset (high-school US history), ProxySPEX-informed attention head pruning not only matched but improved model performance over competing methods.

The code for both algorithms is available in the SHAP-IQ repository.

Papers: SPEX (ICML 2025), ProxySPEX (NeurIPS 2025)

Berkeley Researchers Build Algorithms to Identify Which LLM Features Actually Matter

Berkeley Researchers Build Algorithms to Identify Which LLM Features Actually Matter

Sources

Share

Related Articles

Iran Named a $30 Billion AI Data Center an Annihilation Target. It Is Not Bluster.

Microsofts Three New AI Models Are the Story. The Partnership Is Over.

Anthropics Claude Code Flags You as Negative If You Type WTF

Stay in the loop

Iran Named a $30 Billion AI Data Center an Annihilation Target. It Is Not Bluster.

Microsofts Three New AI Models Are the Story. The Partnership Is Over.

Anthropics Claude Code Flags You as Negative If You Type WTF

Related Articles

Iran Named a $30 Billion AI Data Center an Annihilation Target. It Is Not Bluster.
Artificial Intelligence · 3h 34m ago · 3 min read

Microsofts Three New AI Models Are the Story. The Partnership Is Over.

Anthropics Claude Code Flags You as Negative If You Type WTF