Inference is giving AI chip outfits a second chance to make their mark

Sun, May 3 · 1:05 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.

◌ Single Source

Image accompanies the article at The Register. No description was extracted from the source.

AI adoption is reaching an inflection point as the focus shifts from training new models to serving them.

Key facts

Lumai expects its next-gen Iris Tetra systems will achieve an exaOPS of AI performance in a 10kW power budget by 2029
Nvidia's $20 billion acquihire of Groq back in December is a prime example
The architecture is still in its infancy, capable of running billion parameter models like Llama 3.1 8B or 70B today, but it's far enough along that the UK-based startup has opened its chips up
The startup's SRAM-heavy chip architecture meant that, with enough of them, Groq's LPUs could churn out tokens faster than any GPU

Summary

Compared to training, inference is a much more diverse workload, which presents an opportunity for chip startups to carve out a niche for themselves. Because of this, inference has become increasingly heterogeneous, certain aspects of which may be better suited to GPUs and other more specialized hardware. Nvidia's $20 billion acquihire of Groq back in December is a prime example. Nvidia side stepped this problem by moving the compute heavy prefill bit of the inference pipeline to its GPUs while it kept the bandwidth-constrained decode operations on its shiny new LPUs.

Read full article at The Register →

#Inflection #AI Inference