1 Comment

“Cerebras can run very large models faster because the entire model can fit on one chip.”

I would just say “large” not “very large”.

An article about their inference:

https://cerebras.ai/blog/introducing-cerebras-inference-ai-at-instant-speed

says 20 GB models fit on a single WSE for interference. That’s a bit bigger than what you might experiment with locally on a personal computer but not much bigger.

“Now this creates some issues that impact how you may view Cerebras’ long term opportunities. One is, as models get bigger, some outgrow Cerebras.”

Possibly but note from the same article that the ground breaking speed they announced for Llama3.1-70B already involved spanning 4 Wafer Scale Engines.

I’d guess they’ll need 19+ WSEs to support Llama 3 405B (That’s within the 64 WSEs in one of their Condor Galaxy supercomputers:

https://cerebras.ai/condor-galaxy/

They don’t make it very clear (from what I’ve seen) where the KV cache is being stored.

It looks like on chip KV cache requirements for Llama3.1-70B might exceed their limits given 4 WSEs, so perhaps that’s off chip.

Expand full comment