Rooting for Groq. They got an AI chip that can achieve 240 tokens per second for...

pradn · on Sept 11, 2023

Tensor libraries are high-level, so anything below them can be hyper-optimized. This includes the application model (do we still need processes for ML-serving/training tasks?), operating system (how can Linux be improved or bypassed?), and hardware (general purpose computing comes with a ton of cruft - instruction de-coding, caches/cache coherency, compute/memory separation, compute/GPU separation, virtual memory - how many of these thins can be elided, with extra transistors put to better use?). There's so much money in generative AI that we're going to see a bunch of well-funded startups doing this work. It's very exciting to be back at the "Cambrian explosion" of the early mainframe/PC era.

EvgeniyZh · on Sept 12, 2023

240 tokens for 70b requires 16.8 * (bytes per parameter) TB memory bandwidth. So unless it's like 4 bit quantized, it doesn't sound plausible?

In the same spirit, llms are memory-bound, so what possible hardware advantage can chip firm have? Buying faster memory?

mathisfun123 · on Sept 12, 2023

Groq is out of runway and will probably shutter soon.