Hacker News new | past | comments | ask | show | jobs | submit login

Rooting for Groq. They got an AI chip that can achieve 240 tokens per second for Llama-2 70B. They built a compiler that supports pytorch and have an architecture that scales using synchronous operations. They use software defined memory access - no hardware caching L1, L2,.. and same for networking, it runs directly from the Groq chip in synchronous mode having its activity planned by the compiler. Really a fresh take.

https://youtu.be/A3qbcwasEUY?t=571




Tensor libraries are high-level, so anything below them can be hyper-optimized. This includes the application model (do we still need processes for ML-serving/training tasks?), operating system (how can Linux be improved or bypassed?), and hardware (general purpose computing comes with a ton of cruft - instruction de-coding, caches/cache coherency, compute/memory separation, compute/GPU separation, virtual memory - how many of these thins can be elided, with extra transistors put to better use?). There's so much money in generative AI that we're going to see a bunch of well-funded startups doing this work. It's very exciting to be back at the "Cambrian explosion" of the early mainframe/PC era.


240 tokens for 70b requires 16.8 * (bytes per parameter) TB memory bandwidth. So unless it's like 4 bit quantized, it doesn't sound plausible?

In the same spirit, llms are memory-bound, so what possible hardware advantage can chip firm have? Buying faster memory?


Groq is out of runway and will probably shutter soon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: