Gemini · Google · AI Safety · Alignment Forum

SFT Drives Gemini’s Safety Properties

Sat, Jun 13 · 3:31 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

◌ Single Source

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.

Key facts

The team perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash — This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas
The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals
In this short post, they describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL

Summary

In this short post, they describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. The team do not want to overstate this claim as applying to other model families, and they also note that this may change in future Gemini versions. The team perform SFT using the Gemini mixture on the pre-training only versions of Gemini 3.1 Pro and Gemini 3 Flash. The main result is that the blue bars (SFT-only models) and orange bars (production models) are remarkably similar across evals.

Read full article at Alignment Forum →

#Gemini #Google #AI Safety