Gemini · Google · Alignment Forum
Building and evaluating model diffing agents
Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.
◌ Single Source
This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.
Key facts
- Most work so far has been white-box, attempting to understand the differences in internal structure and cognition between two models ( Bricken et al. 2024, Lindsey et al. 2024, Minder et al
- 0. **Agent Loop**: You have 10 turns available to you to conduct your investigation
- The team run their agent 50 times with different random seeds, producing nearly 50 findings
- To the extent that their plan for building safe AI models is iterative ( Barnes, Wijk and Chan, 2023, Shah et al
Summary
Standard methods for assessing the safety and capabilities of frontier state-of-the-art LLMs rely on capability and propensity evaluations. Methods which are able to reliably surface "unknown unknowns" in model behaviour complement this paradigm and aid in addressing this limitation. Model diffing proposes understanding the differences in behaviour or cognition between two models, instead of trying to understand single models in isolation. The model diffing works closest to ours are Dunlap et al. 2024 and Kempf et al. 2026, which show that black box approaches can be competitive. Auditing agents are scaffolded LLMs tasked with auditing the behaviour of a distinct LLM given several affordances via tools ( Bricken et al. 2025 ).