OpenAI · Alignment Forum

Even for non-behavioural proxy measures (e.g. probes or CoT verbalisation)

Thu, May 14 · 5:05 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

However, alignment evaluations are only reassuring to the extent that the model can't reliably [1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise difficult to rule out the possibility of alignment faking.

Key facts

The safe-to-dangerous shift is a fundamental problem for creating realistic evaluations, and a core AI lethality. [3]
However, the core difference between an alignment evaluation and a real deployment, is that an alignment evaluation must be safe [2]
Thanks to Marie Buhl, Geoffrey Irving and Caitlin van Bommel for helpful feedback on earlier drafts
However, alignment evaluations are only reassuring to the extent that the model can't reliably [1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise

Summary

Suppose they have a capable and potentially scheming model, and before they deploy it, they want some evidence that it won’t do anything catastrophically dangerous once they deploy it. There are many approaches one could use to try to make evaluations appear more realistic: you can try to create realistic environments (e.g. Petri, WebArena, OSWorld ); use data from past deployments (e.g. OpenAI, SAD ); and spoof tool-call responses (e.g. ToolEmu ). However, the core difference between an alignment evaluation and a real deployment, is that an alignment evaluation must be safe . The safe-to-dangerous shift is a fundamental problem for creating realistic evaluations, and a core AI lethality. . Somehow, you need to make sure a sufficiently intelligent adversary couldn’t work out if it was in evaluation or deployment based on whether the setting is safe or dangerous.

Read full article at Alignment Forum →

#OpenAI