Rooting for Groq. They got an AI chip that can achieve 240 tokens per second for Llama-2 70B. They built a compiler that supports pytorch and have an architecture that scales using synchronous operations. They use software defined memory access - no hardware caching L1, L2,.. and same for networking, it runs directly from the Groq chip in synchronous mode having its activity planned by the compiler. Really a fresh take.
Tensor libraries are high-level, so anything below them can be hyper-optimized. This includes the application model (do we still need processes for ML-serving/training tasks?), operating system (how can Linux be improved or bypassed?), and hardware (general purpose computing comes with a ton of cruft - instruction de-coding, caches/cache coherency, compute/memory separation, compute/GPU separation, virtual memory - how many of these thins can be elided, with extra transistors put to better use?). There's so much money in generative AI that we're going to see a bunch of well-funded startups doing this work. It's very exciting to be back at the "Cambrian explosion" of the early mainframe/PC era.
https://youtu.be/A3qbcwasEUY?t=571