Google DeepMind publishes DiffusionGemma, a model that runs local AI 4x faster

Wed, Jun 10 · 7:29 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 2 sources. See llms.txt for citation guidance.

✓ KHAO Verified

Another day, another AI model from Google.

Key facts

In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second
It’s a Mixture of Experts (MoE) model with a total of 26 billion parameters, but only 3.8 billion are activated during inference
With a single Nvidia H100 AI accelerator, DiffusionGemma can produce 1,000+ tokens per second
This approach to text generation shifts the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel

Summary

This time, Google DeepMind has released a new member of the Gemma 4 open model family, but it’s fundamentally different from the rest of the lineup. Most AI models are designed to be autoregressive—they generate text left to right one token at a time. DiffusionGemma is fairly large in the realm of Google’s open models. In testing with an RTX 5090, DiffusionGemma spits out around 700 tokens per second. This approach to text generation shifts the bottleneck from memory bandwidth to compute, generating up to 256 tokens in parallel. If diffusion is so much faster, why isn’t Google using it in big cloud-based Gemini models? Google has experimented with this, but there are a few drawbacks to text diffusion, including a higher error rate.

#Nvidia #Google #Gemini