Gemini · Google · AI Safety · Alignment Forum
SFT Drives Gemini’s Safety Properties
Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.
◌ Single Source
This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.
Key facts
- The team perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash — This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas
- The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals
- In this short post, they describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL
Summary
In this short post, they describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. The team do not want to overstate this claim as applying to other model families, and they also note that this may change in future Gemini versions. The team perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals.