Business · Tom's Hardware
Inside Google's TPU V8 strategy, delivering two chips for two crucial tasks at incredible scale — network scales up to 1 million TPUs per cluster, an advantage over Nvidia AI accelerators
Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.
◌ Single Source
Google announced its eighth-generation Tensor Processing Units at Cloud Next on April 22, shipping two distinct chip designs for the first time in the TPU program's decade-long history.
Key facts
- In comparison, Nvidia's Vera Rubin R200 is rated at 35 FP4 PFLOPs for training with 288GB of HBM4 at 22 TB/s, and AMD's MI455X reaches 40 FP4 PFLOPs with 432GB of HBM4
- TrendForce reported back in December that MediaTek initially booked 20,000 TSMC CoWoS wafers for the program, with allocation potentially scaling to 150,000 by 2027
- TPU 8t carries 12.5% more memory capacity than the previous-gen Ironwood TPU, but delivers 11.5% less bandwidth, running slower memory to improve yield and bring down cost per chip per analysis
- In a 1,024-chip 3D Torus configuration, the worst-case packet path traverses 16 hops
Summary
The split also extends to the supply chain, with MediaTek having joined Broadcom as a silicon design partner for the eighth-gen program back in December, ending Broadcom’s exclusive role in TPU development since 2015. In terms of raw specs, TPU 8 doesn’t close the gap with Nvidia or AMD. Then there’s the choice of HBM3E over HBM4, which appears to be a deliberate cost and yield trade-off. A TPU 8t superpod packs 9,600 chips into a single cluster with two petabytes of shared HBM, connected by a proprietary inter-chip interconnect running at double the previous generation's bandwidth.