Gemini · Google · Alignment Forum
Model diffing proposes understanding the differences in behaviour or cognition between two models
Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.
◎ Multiple-sources
In the context of machine learning, these diffs might reveal interesting and surprising insights.
Key facts
- Most work so far has been white-box, attempting to understand the differences in internal structure and cognition between two models ( Bricken et al. 2024, Lindsey et al. 2024, Minder et al
- 0. **Agent Loop**: You have 10 turns available to you to conduct your investigation
- The team run their agent 50 times with different random seeds, producing nearly 50 findings
- To the extent that their plan for building safe AI models is iterative ( Barnes, Wijk and Chan, 2023, Shah et al
Summary
This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. Standard methods for assessing the safety and capabilities of frontier state-of-the-art LLMs rely on capability and propensity evaluations. Methods which are able to reliably surface "unknown unknowns" in model behaviour complement this paradigm and aid in addressing this limitation. Model diffing proposes understanding the differences in behaviour or cognition between two models, instead of trying to understand single models in isolation. The model diffing works closest to ours are Dunlap et al. 2024 and Kempf et al. 2026, which show that black box approaches can be competitive. Auditing agents are scaffolded LLMs tasked with auditing the behaviour of a distinct LLM given several affordances via tools ( Bricken et al. 2025 ).