Nvidia · Google · Apple · AI Agent · Decrypt

Google's DiffusionGemma AI Hits 1,000 Tokens Per Second—And It's Free

Wed, Jun 10 · 10:01 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 source. See llms.txt for citation guidance.

★ Tier-1 Source

Google dropped DiffusionGemma today, an open model AI that generates text the way image generators create pictures: start with noise, refine until it makes sense.

Key facts

Developers with NVIDIA RTX 4090 or 5090 hardware building real-time tools—inline editors, autocomplete, code infilling, structured generation
Inception Labs shipped Mercury 2 in February 2026 as the first commercial diffusion reasoning model, claiming speeds five times faster than speed-optimized competitors
DFlash is a framework published in early 2026 that uses a small diffusion model as the drafter, enabling over 6x speedup on some tasks
Google launched Gemma 4 under Apache 2.0 in April, and DiffusionGemma continues that strategy

Summary

Google released DiffusionGemma, a free open-weight model that generates entire 256-token blocks simultaneously via text diffusion—hitting over 1,000 tokens per second on an NVIDIA H100, four times faster than standard autoregressive models. The custom drafter module DiffusionGemma needs for local inference doesn't exist in any public runtime yet—not in mlx-lm, not in LM Studio—making it effectively unrunnable on most consumer setups today. On NVIDIA NIM, the model arrived preconfigured at 8,192 tokens of context—below the 64,000-token floor that agentic frameworks like Hermes Agent require—meaning autonomous workflows won't run without manual reconfiguration. Every LLM you've used is a typewriter.

Read full article at Decrypt →

#Nvidia #Google #Apple #AI Agent