Happy Sunday and welcome to the Investing in AI newsletter! Our AI Innovator’s Podcast is up and running again so please go check it out. The most recent episode is the founders of Onton talking about how they developed a new self-learning algorithm to aid their e-commerce search tool.
Today I want to talk about AI chip company Cerebras. The company just filed their S-1 to go public and while it has apparently been delayed because of CFIUS concerns, it’s an interesting company and technology and a good primer for all the other AI hardware that is coming in the next few years.
First of all, one of the things you have to understand about the chip market is that these things take a lot longer to design and build so you have to make educated guesses about where the workloads of the future are going. For an AI chip, that means thinking about training vs inference, which are different workloads. It means thinking about the algorithms that may be used (transformer architectures are currently dominant), and about how large a model may be.
The second challenge in building hardware is that you have to make a whole bunch of tradeoffs on issues like these:
How large is the chip, physically? Some use cases require smaller footprints, like edge AI.
How much power does it consume? Most products have a power budget.
How large of a model can you run on it?
How does it perform for various workloads?
How does it scale when a model is too big for a single chip and they have to be networked together? (This interconnecting can often lead to dramatic slowdowns even for a chip that is dominant in single chip performance)
How difficult is it to manufacture?
What is the software tool chain to get a model onto the chip to run?
The third thing to understand is that measurements like transistors, on-chip memory, and number of processing cores may or may not translate into real performance. It depends on so many factors.
The best way to describe the benefits of Cerebras is to focus on the fact that it is “wafer scale.” When computer chips are created, they are etched out with UV light on wafers. Those wafers are large - say 12 inches across. Compare that to a typical GPU which might be a few tenths of an inch. Now, the GPU was created on a wafer as well of a similar size, it was just carved up into many GPUs and put on other circuit boards. Cerebras decided to use the whole wafer to make one giant chip with tons of processing power.
The benefits of this approach are that all the Cerebras compute cores are interconnected much more tightly, and the data can move much faster, than if you had to make the connections in another layer of a circuit board that the chips were attached to. That latter issue of connecting multiple cores together also creates issues with COE, which circuit engineers use to denote “coefficient of expansion.”
When you have all these different materials on a circuit board, they heat up as the board starts running. But the different materials expand at different rates as they heat up, leading to cracks and breaks in certain places. This is rumored to be one of the problems with the NVIDIA Blackwell. By adding some circuitry to connect multiple GPUs on a single card, NVIDIA had a COE problem. Cerebras gets around this because of their wafer scale design.
The other thing to understand is that when you run a model that is too large for a single GPU, it is broken down into layers to run. So one layer of the neural network is loaded into the GPU and executed, then another, etc. By having such a large chip, Cerebras can run very large models faster because the entire model can fit on one chip.
Now this creates some issues that impact how you may view Cerebras’ long term opportunities. One is, as models get bigger, some outgrow Cerebras. And connecting Cerebras chips together may be slower than connecting other types of chips together. In fact, some competitors like Groq and Sambanova were more designed to distribute large models, and may outperform Cerebras as models get bigger, depending on various factors.
But also, there is a small model movement, and so maybe the future of ever larger models doesn’t really happen. I don’t know.
From a technical perspective, Cerebas is an improvement over GPUs in a lot of ways, but depending on where AI workloads go, it may or may not be a match for the future of AI.
And then, technical issues aside, Cerebras has two other problems. One is workflow. The world is geared around programming AI models for NVIDIA GPUs. Those types of habits are hard to change, and engineers aren’t going to move over unless Cerebras can really promise massive massive performance increases over NVIDIA.
The second problem is a business one, which many other people have pointed out. Cerebras has massive customer concentration. Although, I’d also say this is normal for a chip startup. You have to find that first customer, and it’s a good thing that they buy a lot. So I don’t consider this as much of a negative as others do who have written about the IPO.
To me, the issue boils down to how well the design choices Cerebras made map to the future of where AI workloads go. I’m a bit of a skeptic there, because I still believe something is going to come along and entirely upend the transformer model approach. But hopefully this overview is a good entry point for thinking about AI hardware, and Cerebras in particular.
Thanks for reading.
“Cerebras can run very large models faster because the entire model can fit on one chip.”
I would just say “large” not “very large”.
An article about their inference:
https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed
says 20 GB models fit on a single WSE for interference. That’s a bit bigger than what you might experiment with locally on a personal computer but not much bigger.
“Now this creates some issues that impact how you may view Cerebras’ long term opportunities. One is, as models get bigger, some outgrow Cerebras.”
Possibly but note from the same article that the ground breaking speed they announced for Llama3.1-70B already involved spanning 4 Wafer Scale Engines.
I’d guess they’ll need 19+ WSEs to support Llama 3 405B (That’s within the 64 WSEs in one of their Condor Galaxy supercomputers:
https://cerebras.ai/condor-galaxy/
They don’t make it very clear (from what I’ve seen) where the KV cache is being stored.
It looks like on chip KV cache requirements for Llama3.1-70B might exceed their limits given 4 WSEs, so perhaps that’s off chip.