← Back to KHAO

Gemini · Google ·

Building and evaluating model diffing agents

2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

Figure 1: FPR of our diffing agent in evaluating differences between two identical models is low.

This is the second in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.

Key facts

Summary

Standard methods for assessing the safety and capabilities of frontier state-of-the-art LLMs rely on capability and propensity evaluations. Methods which are able to reliably surface "unknown unknowns" in model behaviour complement this paradigm and aid in addressing this limitation. Model diffing proposes understanding the differences in behaviour or cognition between two models, instead of trying to understand single models in isolation. The model diffing works closest to ours are Dunlap et al. 2024 and Kempf et al. 2026, which show that black box approaches can be competitive. Auditing agents are scaffolded LLMs tasked with auditing the behaviour of a distinct LLM given several affordances via tools ( Bricken et al. 2025 ).

Read full article at Alignment Forum →

#Gemini #Google