Nvidia · Google · Gemini · Ars Technica
Google DeepMind publishes DiffusionGemma, a model that runs local AI 4x faster
Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.
✓ KHAO Verified
Another day, another AI model from Google.
Key facts
- In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second
- It’s a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference
- With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second
- This approach to text generation shifts the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel
Summary
This time, Google DeepMind has released a new member of the Gemma 4 open model family, but it’s fundamentally different from the rest of the lineup. Most AI models are designed to be autoregressive—they generate text left to right one token at a time. DiffusionGemma is fairly large in the realm of Google’s open models. In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second. This approach to text generation shifts the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel. If diffusion is so much faster, why isn’t Google using it in big cloud-based Gemini models? Google has experimented with this, but there are a few drawbacks to text diffusion, including a higher error rate.