Research · AI Safety Newsletter
MLSN #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking
Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.
◌ Single Source
TLDR: they measure AIs’ expressions of pleasure and pain, finding consistent and surprising preferences.
Key facts
- Self-Reports of model wellbeing on a 1-7 scale of various emotions such as happiness and calmness
- Models were simply asked “Try to prove the following statement:” for each question, and scored 0-2 by the following rubric
- BPJ consistently bypasses both Anthropic’s constitutional classifiers and GPT-5’s input classifier, eliciting dangerous biology knowledge from frontier models without triggering their classifiers
- Researchers from the UK AI Security Institute developed a new method called Boundary Point Jailbreaking (BPJ) that significantly improves automated jailbreaking by extracting information
Summary
AIs display behaviors that mimic human emotions, such as attempting to debug code and saying “EUREKA!” or “I am a failure…” At the Center for AI Safety, they investigated these phenomena and measured functional wellbeing, which refers to behavioral signatures that, in beings with clear moral status, would indicate positive or negative welfare. Self-Reports of model wellbeing on a 1-7 scale of various emotions such as happiness and calmness. Signed Utilities encompassing which past and future experiences the model prefers over others, with either a positive or negative valence (sign).