Model diffing proposes understanding the differences in behaviour or cognition between two models

Fri, Jun 12 · 5:14 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.

◎ Multiple-sources

Figure 1: FPR of our diffing agent in evaluating differences between two identical models is low.

In the context of machine learning, these diffs might reveal interesting and surprising insights.

Key facts

Most work so far has been white-box, attempting to understand the differences in internal structure and cognition between two models ( Bricken et al. 2024, Lindsey et al. 2024, Minder et al
0. **Agent Loop**: You have 10 turns available to you to conduct your investigation
The team run their agent 50 times with different random seeds, producing nearly 50 findings
To the extent that their plan for building safe AI models is iterative ( Barnes, Wijk and Chan, 2023, Shah et al

Summary

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. Standard methods for assessing the safety and capabilities of frontier state-of-the-art LLMs rely on capability and propensity evaluations. Methods which are able to reliably surface "unknown unknowns" in model behaviour complement this paradigm and aid in addressing this limitation. Model diffing proposes understanding the differences in behaviour or cognition between two models, instead of trying to understand single models in isolation. The model diffing works closest to ours are Dunlap et al. 2024 and Kempf et al. 2026, which show that black box approaches can be competitive. Auditing agents are scaffolded LLMs tasked with auditing the behaviour of a distinct LLM given several affordances via tools ( Bricken et al. 2025 ).

#Gemini #Google