Hacker News new | past | comments | ask | show | jobs | submit login
Cerebras Inference: AI at Instant Speed (cerebras.ai)
174 points by meetpateltech 67 days ago | hide | past | favorite | 72 comments



>Cerebras is the only platform to enable instant responses at a blistering 450 tokens/sec. All this is achieved using native 16-bit weights for the model, ensuring the highest accuracy responses.

As near as I can tell, from the model card[1], the majority of the math for this model is 4096x4096 multiply-accumulates. So, there should be 70b/16m about 4000 of these in the Llama3-70B model.

A 16x16 multiplier is about 9000 transistors, according to a quick google. 4096^2 should thus be about 150 billion transistors, if you include the bias values. There are plenty of transistors on this chip to have many of them operating in parallel.

According to [2], a switching transition in the 7nM process node, is about 0.025 femtoJoule (10^-15 watt seconds) per transistor. At a clock rate of 1 Ghz, that's about 25 nanowatt/transistor. Scaling that at 50% transitions(a 50/50 chance any given gate in the MAC will flip), gets you about 2kW for each 4096^2 MAC running at 1 Ghz.

There are enough transistors, and enough RAM on the wafer to fit the entire model. Even if they have a single 4096^2 MAC array, a clock rate of 1 ghz should result in a total time of 4 uSec/token, or 250,000 tokens/second.

[1] https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md

[2] https://mpedram.com/Papers/7nm-finfet-libraries-tcasII.pdf


>There are enough transistors, and enough RAM on the wafer to fit the entire model.

Not the entire 70b fp16 model. It'd take 148GB of RAM to hold the entire model. Each Cerebras wafer chip has 44GB of SRAM. You need 4 of them chained together to hold the entire model.



40 GB was the CS-2. CS-3 has 44 GB

https://cerebras.ai/blog/cerebras-cs3


It’s insanely fast.

Here’s an AI voice assistant I built that uses it:

https://cerebras.vercel.app


That was interesting. I asked it to try to say something in another language, and she read it in a thick American accent. No surprise. Then I asked her to sing, and she said something like "asterisk in a robotic singing voice asterisk...", and then later explained that she's just text to speech. Ah, ok, that's about what I expected.

But then I asked her to integrate sin(x) * e^x and got this bizarre answer that started out as speech sounds but then degenerated into chaos. Out of curiosity, why and how did she end up generating samples that sounded rather unlike speech?

Here's a recording: https://youtu.be/wWhxF7ybiAc

FWIW, I can get this behavior pretty consistently if I chat with her a while about her voice capabilities and then go into a math question.


This is pretty amazing. It's fast enough to converse with, and I can interrupt the model.

The underlying model is not voice trained -- she says things like "asterisk one" (reading out point form) -- but this is a great preview for when ChatGPT GAs their Voice Mode.


Fantastic demo. Do you know what's the difference between your stack and the livekit demo? [1] it shows your voice as text so you can see when you have to correct it.

Llama3 with ears just dropped (direct voice token input) which should be awesome with cerebras [2]

[1]: https://kitt.livekit.io [2]: https://homebrew.ltd/blog/llama3-just-got-ears


Nice! What are the other pieces of the stack you'r using?


LiveKit, Cartesia, Deepgram, and Vercel


awesome - might try it out


Oh wow, this is insanely good. Are there any model details?


Is batched inference for LLMs memory bound? My understanding is that sufficiently large batched matmuls will be compute bound and flash attention has mostly removed the memory bottleneck in the attention computation. If so, the value proposition here -- as well as with other memorymaxxing startups like Groq -- is primarily on the latency side of things. Though my personal impression is that latency isn't really a huge issue right now, especially for text. Even OpenAI's voice models are (purportedly) able to be served with a latency which is a low multiple of network latency, and I expect there is room for improvement here as this is essentially the first generation of real-time voice LLMs.


Batched inference will increase your overall throughput, but each user will still be seeing the original throughput number. It's not necessarily a memory vs compute issue in the same way training is. It's more a function of the auto-regressive nature of transformer inference as far as I understand which presents unique challenges.

If you have an H100 doing 100 tokens/sec and you batch 1000 requests, you might be able to get to 100K tok/sec but each user's request will still be outputting 100 tokens/sec which will make the speed of the response stream the same. So if your output stream speed is slow, batching might not improve user experience, even if you can get a higher chip utilization / "overall" throughput.


I want to know what the current power requirements are, as well as the cost for the machine. The last time I looked at one of these it was an absolute beast (although very impressive).


> I want to know what the current power requirements are, as well as the cost for the machine.

"If you care and have to ask it's not for you".

In all seriousness I've worked with and am familiar with Cerebras, Groq, etc. Let's just say GPUs still reign supreme in terms of practicality outside of usage of their hardware via cloud for nearly all use-cases.

Groq, for example, has essentially stopped selling their "real" HW directly because the borderline absurd amount of floor space, etc was found to be challenging once they hit the market. There's enough demand and more (recurring) money to be made anyway hosting services on your chips.

Similar to the Bitcoin mining ASIC game in the heyday - sure we could sell these or we could just use them to mine, develop next gen, sell previous gen, repeat.


> though somewhat astonishingly, the WSE 2 draws 23 kW of power. To put this in perspective, the most powerful GPUs “only” draw around 450W

https://liquidstack.com/blog/breaking-the-thermal-ceiling-in...


They never made the price public as far as I know but said something like "less than a house". It's going to be something on the order of $300k.


Q: so apparently allowing LLMs to “think” by asking it to walk through and generate preamble tokens to an answer improves quality. With this kind of speedup would it be practical/effective to achieve better output quality by baking in a “thinking” step to every prompt? Say, a few thousand tokens before the actual reply.


> Thus to generate a 100 words a second requires moving the model 100 times per second – requiring vast amounts of memory bandwidth.

It's actually worse for the majority of GPU implementations for large models. The matrices don't fit shared memory so the model is loaded many, many times to shared memory (as tiles). Also, unless you are using Hopper distributed shared memory, CTAs can't even share across them.

It would be nice to see a Cerebras solution for pre-training and fine-tuning.


It seems they are doing fine-tuning and soon pre-training with their new cluster technology. They claim to be able to train Llama 3 70GB in 1 day (instead of 1 month). Impressive if true.


70b runs on 4x CS-3 estimated at $2-3m each, let’s say total system cost $10m, drawing ~100kW power. They don’t mention batch size, so let’s start with batch size 1 and see where we get. At 100% utilisation for 3 years that’d be 42 billion tokens for a cost of $10m capital plus ~$0.5m power and cooling let’s say, or $250 per million tokens. They’re claiming they can sell their API access at $0.60/million. To break even on this they’d need a batch size of 420. I don’t know how deep their pipeline is but Llama 3.1 70b has 80 layers with 6 meaningful matmuls per layer so it’s not a crazy multiple of that.

A single A100 processes at 13 t/s/u for batch 32. That costs $10k to process 39 billion tokens over 3 years = $0.25 tok/s/u. If you have batch size 420 you can do it even cheaper.

TL;DR: Cerebras are certainly advertising at a loss-leading price and will only have a viable product if they can get extraordinarily high utilisation of their system at this price. I don’t think they can, so they’re basically screwed selling tokens. Maybe this is to attract attention in the hope of selling hardware to someone willing to pay a premium for very low latency, but I suspect it’s just a means of getting one more round of funding in the hope of reducing costs in the next version.


your estimate of $2-3M per CS-3 is a price not a cost. It costs about $20K per wafer from TSMC and the price they charge reflects the NRE of designing the system and taping out the masks more than the additional costs to package up their wafers into a system.

If this business scales, they can probably afford to lower the price by a factor of 10.


My math (and google) shows a 300 mm diameter wafer, and 300,000,000 transistors/mm^2

So for $20,000 (in quantity) you get somewhere around 10 trillion transistors?

That's enough for about 50 4096x4096 multiply-accumulate chips. At a nice slow 1 Mhz clock rate, each would take about 3.5 watts, and give you 16 teraflops of performance. If you stepped up the power and cooling, you could likely get to 350 watts of power at 100 Mhz, and 1.6 Petaflops.

50 of those chips, for $20,000 --> $400 each


batch size by Q4 will be solid double digits (cerebras employee)


Is that e.g. batch 16/32 for each operation e.g. 16-row matmuls in a pipeline? Or a pipeline of vector-math ops that has 16/32 stages? Is the pipeline also double digits deep?


Ok that speeds fucking rediculous are you kidding me?!?!?! I just tried the Chat trial wtf.


This is where we always assumed the industry was going. Expensive GPUs are great for training, but inference is getting so optimized that it will run on smaller and/or cheaper processors (per token).


Cerebras' Wafer Scale Engine is the opposite of small and cheap.

https://cerebras.ai/product-chip/


That's totally nuts. How do they deal with the silicon warping around disabled cores or dark silicon? How long of hard running does it take before the chip gets fatally damaged and needs to be replaced in their system? Word on the street is that h100s fail surprisingly often, this can't be better


https://www.youtube.com/watch?v=7GV_OdqzmIU&t=1104s

This video from Cerebras perfectly explain how they solve the interconnect problem, and why their approach greatly reduces the risk of Blackwell-type hardware design challenges.


Super informative video!


The way it's mounted, there's unlikely to be warping: https://web.archive.org/web/20230812020202/https://www.youtu...

The cooling is significantly better than what you'd see on a server platform with water cooling channels going to each row of the wafer.


They have a way of bypassing bad cores, and over-provision both logic and memory by 1.5% to account for that. They get 100% yields this way.

https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...


Yes,that's exactly what would cause the problems I'm suggesting.


edit: I should have just checked their website instead of guessing. Apparently WSE has significant fabrication challenges, which makes what Cerebras has accomplished all the more impressive. But it is still surprising that no one else has attempted this in the HPC field.

I had guessed that Cerebras had made some trade-offs in process in order to make it work at scale, but then they aren't actually building these devices at scale (yet).


They're using TSMC 5-nm for WSE-3: https://spectrum.ieee.org/cerebras-chip-cs3


Cerebras is known to use TSMC, so your speculation about boutique fab is incorrect.

https://cerebras.ai/press-release/cerebras-systems-smashes-t...


It shouldn't be surprising. It's hard as fuck, and we don't know if it's worth it yet (there's something to be said for "if your compute dies you send your remote hands to swap out a high four figure component" and not "decommission a high-six-figure node"


I hope I didn't make it sound like it was easy, at least I don't think I said that anywhere. It doesn't really matter how hard something is to do (short of it being trivially proven impossible), it matters whether there's a good enough chance that the payoff exceeds the cost.

And actually there have been attempts to do it, I mentioned in an earlier version of my comment that Gene Amdahl had attemped to make WSE work something like 20 years ago, without success - but also without the clear profitability story of AI to attract the same mountains of cash being thrown around today.

What's surprising is not that it is hard, or that it's hard as fuck, but that given the potentially stratospheric rewards for success there have not been more attempts in this direction.


What makes you think you couldn't do as well on something less radical?

IIRC cerebras' design was originally for HPC workloads, so even it may not necessarily be optimized for LLMs


Is this all mostly a heat spreader efficiency requirement?


I suppose the proper term would be heat spreader thermal resistance limit, dictated by the thermal stress sums or whatever the term would be.


I updated my comment to be clearer. I meant smaller and/or cheaper per token. In this case it's cheaper per token.


Maybe the real trend is that huge parameter counts are curiosities.


"wafer scale" inference using 44GB of straight up SRAM per chip does not exactly sound "smaller and cheaper" to me. Just optimized.


Cheaper on a per token basis.


I accept that. Your original comment left me under the impression that this represented a shift closer to the edge (I still don't think the hardware is all that much smaller), but I'll agree this is cheaper per token under full utilization.


Doubtful. SRAM is not cheap, and this is entirely about SRAM vs HBM.


They list the price in this press release. So either they're taking a big loss or they're doing it cheaper per token.


It wouldn't be the first time a manufacturer ignored capital amortization to post better numbers.


Not even necessarily a loss let alone big, since their comparison includes margin.


Can really fast inference (e.g. 1M tok/sec) make LLMs more intelligent? I am imagining you could run multiple agents and can choose and discard outputs using other LLMs simultaneously. Will the output look more like a real thought process? Or will it remain just same?


It is mentioned in the post:

> Traditional LLMs output everything they think immediately, without stopping to consider the best possible answer. New techniques like scaffolding, on the other hand, function like a thoughtful agent who explores different possible solutions before deciding. This “thinking before speaking” approach provides over 10x performance on demanding tasks like code generation, fundamentally boosting the intelligence of AI models without additional training.


Is there a tool that provides functionality like this that you can layer on top of cerebras's API, given you are not worried about using 10x-50x more tokens per query.


Many of the results in the 'agent' literature require several agents and many iterations to produce an output. See some examples here [1]. Getting these results in seconds instead of minutes or hours would be incredible - and would help with iteration and experimentation to improve algorithms.

[1] https://langchain-ai.github.io/langgraph/tutorials/multi_age...


Fast inference can substitute for larger models in some circumstances. As you said, you can run multiple times. DeepMind had a detailed look, see https://arxiv.org/abs/2408.03314.


Not just that, but you could have a network of LLM's all talk and discuss a answer before answering at this sort of speed. Literally could have it generate internal thoughts and challenges to itself before responding via scripting.


The numbers are pretty incredible. Will the competition be able to match them?


Groq is claiming 284 tokens/second on Llama 3.1 70b, so they’re in the same ballpark.

https://groq.com/12-hours-later-groq-is-running-llama-3-inst...


If Groq 2 is 2x faster it will match Cerebras WSE-3.


Their CEO talks about this in the gradient dissent podcast [1]

[1]: https://m.youtube.com/watch?v=qNXebAQ6igs


Sure but can they fit a bigger model? I don’t think they can string together these to fit bigger models like llama3.1 405b


https://youtu.be/re4QqXPmfgs?t=956

CS-3 system is built for single node domain scaling to 24 Trillion parameter models. I.e., they claim you can run the same code without hand-written distributed training code to reach 24 Trillion parameter models.


I am not sure why not. From the article: "In the weeks to come we will be adding larger models such as Llama3-405B".


i wrote an article showing how to the api to build a chatbot: https://api.wandb.ai/links/byyoung3/kv7vn1wt


It would be understandable that they are focused currently on inference speed, but features like structured output and prompt caching make it possible to build more capable LLM applications.

Does Cerebras support reliable structured output like the recent OpenAI 4o?


They are running stock Llama 3.x. If the underlying models support structured output, so will they.

For example, I know the latest batch of Mistral models all have json output support.


Structured output is a token picker feature, not (just) a model feature.


This question is a bit out of context.

Cerebras is a startup producing innovative AI chips. Their chips are super cool, and I personally believe Cerebras is ahead of the industry and is on the right technical path. As a matter of fact, Cerebras started with HPC chips. Then pivoted to AI like everyone else.

They are still deep in the trench for survival.

Given that, they have very little software prowess compared to AMD (which has *terrible* software stack for AI GPUs look at https://github.com/ROCm/rdc, an equivalent to NVIDIA DCGM, which virtually has no maintainer, and no one is using it), NVIDIA (the golden standard of software stack for AI GPUs); and you are referring to structured output and prompt caching which are prominently developed by LLM research institutions (OpenAI Anthropic, each of which have way more funding than Cerebras)

In the end, educate yourself, and do not put unrealistic expectation on startups.


As OpenAI themselves admits, structured output feature in question was developed in open source world first with zero funding.


This point is moot.

The point remain that Cerebras is not in a position to focus on structured output or prompt caching.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: