← Back to KHAO

Gemini · Google ·

Model diffing proposes understanding the differences in behaviour or cognition between two models

2 min read

Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.

◎ Multiple-sources

Figure 1: FPR of our diffing agent in evaluating differences between two identical models is low.

In the context of machine learning, these diffs might reveal interesting and surprising insights.

Key facts

Summary

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. Standard methods for assessing the safety and capabilities of frontier state-of-the-art LLMs rely on capability and propensity evaluations. Methods which are able to reliably surface "unknown unknowns" in model behaviour complement this paradigm and aid in addressing this limitation. Model diffing proposes understanding the differences in behaviour or cognition between two models, instead of trying to understand single models in isolation. The model diffing works closest to ours are Dunlap et al. 2024 and Kempf et al. 2026, which show that black box approaches can be competitive. Auditing agents are scaffolded LLMs tasked with auditing the behaviour of a distinct LLM given several affordances via tools ( Bricken et al. 2025 ).

#Gemini #Google