Business · Tom's Hardware

Inside Google's TPU V8 strategy, delivering two chips for two crucial tasks at incredible scale — network scales up to 1 million TPUs per cluster, an advantage over Nvidia AI accelerators

Mon, Apr 27 · 5:12 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.

◌ Single Source

Image accompanies the article at Tom's Hardware. No description was extracted from the source.

Google announced its eighth-generation Tensor Processing Units at Cloud Next on April 22, shipping two distinct chip designs for the first time in the TPU program's decade-long history.

Key facts

In comparison, Nvidia's Vera Rubin R200 is rated at 35 FP4 PFLOPs for training with 288GB of HBM4 at 22 TB/s, and AMD's MI455X reaches 40 FP4 PFLOPs with 432GB of HBM4
TrendForce reported back in December that MediaTek initially booked 20,000 TSMC CoWoS wafers for the program, with allocation potentially scaling to 150,000 by 2027
TPU 8t carries 12.5% more memory capacity than the previous-gen Ironwood TPU, but delivers 11.5% less bandwidth, running slower memory to improve yield and bring down cost per chip per analysis
In a 1,024-chip 3D Torus configuration, the worst-case packet path traverses 16 hops

Summary

The split also extends to the supply chain, with MediaTek having joined Broadcom as a silicon design partner for the eighth-gen program back in December, ending Broadcom’s exclusive role in TPU development since 2015. In terms of raw specs, TPU 8 doesn’t close the gap with Nvidia or AMD. Then there’s the choice of HBM3E over HBM4, which appears to be a deliberate cost and yield trade-off. A TPU 8t superpod packs 9,600 chips into a single cluster with two petabytes of shared HBM, connected by a proprietary inter-chip interconnect running at double the previous generation's bandwidth.

Read full article at Tom's Hardware →