Inference is giving AI chip outfits a second chance to make their mark

Sun, May 3 · 1:05 PM UTC 2 min read

Compiled by KHAO Editorial — aggregated from 1 outlet. See llms.txt for citation guidance.

◌ Single Source

AI adoption is reaching an inflection point as the focus shifts from training new models to serving them.

Key facts

Lumai expects its next-gen Iris Tetra systems will achieve an exaOPS of AI performance in a 10kW power budget by 2029
Nvidia's $20 billion acquihire of Groq back in December is a prime example
The architecture is still in its infancy, capable of running billion parameter models like Llama 3.1 8B or 70B today, but it's far enough along that the UK-based startup has opened its chips up
The startup's SRAM-heavy chip architecture meant that, with enough of them, Groq's LPUs could churn out tokens faster than any GPU

Summary

Compared to training, inference is a much more diverse workload, which presents an opportunity for chip startups to carve out a niche for themselves. Because of this, inference has become increasingly heterogeneous, certain aspects of which may be better suited to GPUs and other more specialized hardware. Nvidia's $20 billion acquihire of Groq back in December is a prime example. Nvidia side stepped this problem by moving the compute heavy prefill bit of the inference pipeline to its GPUs while it kept the bandwidth-constrained decode operations on its shiny new LPUs.

Read full article at The Register →

#AI Inference #Nvidia #Google #Amazon #Llama #Intel #ChatGPT #United Kingdom