← Back to KHAO

Gemini · Google · AI Safety ·

SFT Drives Gemini’s Safety Properties

2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.

Key facts

Summary

In this short post, they describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. The team do not want to overstate this claim as applying to other model families, and they also note that this may change in future Gemini versions. The team perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals.

Read full article at Alignment Forum →

#Gemini #Google #AI Safety