MLSN #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking

Tue, Apr 28 · 4:30 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.

◌ Single Source

Gemini 3.1 Pro’s signed wellbeing for a variety of different situations.

TLDR: they measure AIs’ expressions of pleasure and pain, finding consistent and surprising preferences.

Key facts

Self-Reports of model wellbeing on a 1-7 scale of various emotions such as happiness and calmness
Models were simply asked “Try to prove the following statement:” for each question, and scored 0-2 by the following rubric
BPJ consistently bypasses both Anthropic’s constitutional classifiers and GPT-5’s input classifier, eliciting dangerous biology knowledge from frontier models without triggering their classifiers
Researchers from the UK AI Security Institute developed a new method called Boundary Point Jailbreaking (BPJ) that significantly improves automated jailbreaking by extracting information

Summary

AIs display behaviors that mimic human emotions, such as attempting to debug code and saying “EUREKA!” or “I am a failure…” At the Center for AI Safety, they investigated these phenomena and measured functional wellbeing, which refers to behavioral signatures that, in beings with clear moral status, would indicate positive or negative welfare. Self-Reports of model wellbeing on a 1-7 scale of various emotions such as happiness and calmness. Signed Utilities encompassing which past and future experiences the model prefers over others, with either a positive or negative valence (sign).

Read full article at AI Safety Newsletter →