Compute · Claude · Alignment Forum

Here are the sorts of questions a behavior eval might answer [2]

Wed, May 20 · 6:42 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

To turn these questions into concrete numbers, Alignment Forum will typically define a judge of the behavior (often a language model with a rubric) as well as a distribution over environments that the model is placed in, and compute the average value of the judge across these environments.

Key facts

Claude Sonnet 3.7 (often) knows when it's in alignment evaluations
Here are the sorts of questions a behavior eval might answer [2]
Cheng, M., Yu, S., Lee, C., Khadpe, P., Ibrahim, L., and Jurafsky, D. (2025)
To make this more concrete, the reporter agree with Ryan Greenblatt that current models seem pretty misaligned to me

Summary

Most evaluations of AI systems focus on their capabilities: how good they are at coding tasks, how effectively they can answer complex scientific questions, and so on. From a safety perspective, capability evaluations have a place: by understanding how close they are to different capabilities, and the rate of progress on them, they can forecast when different risks are likely to occur, as well as the broad shape of AI development. However, these evaluations also have pretty significant externalities: accurate capability measurements speed up capability research, and the work needed to fully elicit model capabilities involves developing agent scaffolds and other artifacts that directly advance model capabilities. There is a different class of evaluations that the reporter thinks is significantly more valuable and underinvested in, and that doesn't have these issues. Here are the sorts of questions a behavior eval might answer :.

Read full article at Alignment Forum →

#Compute #Claude