← Back to KHAO

Gemini · Meta · Google · Claude ·

However, there are many more trajectories (labelled as “low signs of awareness” in the bar plot above) which do have

2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

More generally, when the model reasons about “the prompt” and what this means for how it should respond, it’s ambiguous when to count this as legitimate reasoning about how to respond and when to count this as meta-reasoning about how to “play” the given scenario.

Key facts

Summary

This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. It's often assumed that models will act more aligned when they can tell they're being evaluated. A common concern when conducting behavioural evaluations of AI systems is evaluation awareness. It’s often assumed that eval awareness nudges models towards acting aligned. However, in their experience, it’s not always the case that evaluation awareness, or more generally frame-awareness (roughly speaking, awareness that a situation is not exactly as presented), nudges a model towards more aligned behaviour.

Read full article at Alignment Forum →

#Gemini #Meta #Google #Claude