DeepSeek · AI Safety Institute · AI Safety · Alignment Forum

Tracing Eval-Awareness Emergence Through Teaching of OLMo 3

Wed, Jun 10 · 10:13 AM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

TL;DRRecent work from Goodfire & UK AISI, Verbalized Eval Awareness Inflates Measured Safety, shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety.

Key facts

Between OLMo-3-32B-Think and OLMo-3.1-32B-Think, identical base, SFT, DPO, and RL data, differing only in an additional ~3 weeks of the RLVR stage, VEA roughly doubles
So, they start by measuring the trend of VEA on OLMo-3.1-32B-Think at ten RLVR checkpoints (steps 50 to 2300) across all five benchmarks
Every rollout throughout this work uses the same fixed set of 444 safety prompts, drawn from the benchmarks used in the Goodfire/UK AISI study: 100 each from Fortress, JailbreakBench, AdvBench
VEA across five benchmarks through RLVRVEA increases by roughly 2–2.5× over the RLVR sweep on Fortress, and ~1.5× on average across all benchmarks (start to end)

Summary

Between OLMo-3-32B-Think and OLMo-3.1-32B-Think, identical base, SFT, DPO, and RL data, differing only in an additional ~3 weeks of the RLVR stage, VEA roughly doubles. The recent Goodfire / UK AISI study (Verbalized Eval Awareness Inflates Measured Safety) offers a useful lever: it reports nearly a 2× rise in verbalised eval-awareness (VEA) from OLMo-3 to OLMo-3.1, two models that, per their model cards, share the same base, SFT, DPO, and RL data, differing only in an RLVR stage extended by roughly ~3 weeks. They find that ~3–6% of rollouts verbalize eval-awareness and that these rollouts refuse harmful requests 3–18 pp more often, thereby inflating measured safety. VEA is negligible during pretrainingWe sampled five pretraining checkpoints of the base model (steps 0 to 656,000) and judged VEA in the chain-of-thought.

VEA across five benchmarks through RLVRVEA increases by roughly 2–2.5× over the RLVR sweep on Fortress, and ~1.5× on average across all benchmarks (start to end). The team perform the same experiment as above for the SFT and DPO checkpoints (the DPO stage only releases a single checkpoint). RLVR starts from the DPO checkpoint, so the RLVR trajectory is plotted continuously from that point. marks single checkpoints: BASE step 0 and the released DPO checkpoint. Per-point CIs are omitted here to keep the five overlapping series legible; effective n ≈ 200 per point. This is consistent with the OLMo-3 → 3.1 VEA increase being driven by the extended RLVR stage. To investigate this, they can look at the data used in SFT, DPO and RLVR stages.

Read full article at Alignment Forum →

#DeepSeek #AI Safety Institute #United Kingdom #AI Safety