Hacker News new | past | comments | ask | show | jobs | submit login

> RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3. The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number

Also a fact worth noting but is routinely ignored in the popular press is that, these astronomical peak floating-point ratings of modern hardware are only achievable for a small selection of algorithms and problems. In practice, realizable performance is often much worse, efficiency can be as low as 1%.

First, not all algorithms are best suited for the von Neumann architecture. Today, the memory wall is higher than ever. The machine balance (FLOPS vs. load/store) of modern hardware is around 100:1. To maximize floating-point operations, all data must fit in cache. This requires the algorithm to have a high level of data reuse via cache blocking. Some algorithms do it especially well, like dense linear algebra (Top500 LINPACK benchmark). Other algorithms are less compatible with this paradigm, they're going to be slow no matter how well the optimization is. Examples include many iterative physics simulation problems, sparse matrix code, and graph algorithms (Top500 HPCG benchmark). In the Top500 list, HPCG is usually 1% as fast as LINPACK. Best-optimized simulation code can perhaps reach 20% of Rpeak.

This is why both Intel and AMD started offering special large-cache CPUs, either using on-package HBM or 3D-VCache. They're all targeted for HPC. Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall. Doing inference is a relatively cache-friendly problem, many HPC simulations are much worse in this aspect.

Next, even if the algorithm is well-suited for cache blocking, peak datasheet performance is usually still unobtainable because it's often calculated from the peak FMA throughput. This is unrealistic in real problems, you can't just do everything in FMA - 70% is a more realistic target. In the worst case, you get 50% of the performance (disappointing, but not as bad as the memory wall). In contrast to datasheet peak performance, the LINPACK peak performance Rpeak is measured by a real benchmark.




When you measure peak FLOPS, especially "my desktop computer has X FLOPS", you're generally computing N FMA units * f frequency, theoretical maximum FLOPS unit. This number, as you note, has basically no relation to anything practical: we've long been at the point where our ability to stamp out ALUs greatly exceeds our ability to keep those units fed with useful data.

Top500 measures FLOPS on a different basis. Essentially, see how long it takes to solve an N×N equation Ax=b (where N is large enough to stress your entire system), and use a synthetic formula to convert N into FLOPS. However, this kind of dense linear algebra is an unusually computation-heavy benchmark--you need to do about n^1.5 FLOPS per n words of data. Most kernels tend to do more like O(n) or maybe as high as O(n lg n) work for O(n) data, which requires a lot higher memory bandwidth than good LINPACK numbers does.

Furthermore, graph or sparse algorithms tend to do really bad because the amount of work you're doing isn't able to hide the memory latency (think one FMA per A[B[i]] access--you might be able to do massive memory bandwidth fetches on the first B[i] access, but you end up with a massive memory gather operation for the A[x] access, which is extremely painful).


> Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall

FP16 doesn't work any faster than mixed precision on Nvidia or any other platform(I have benchmarked GPUs, CPUs and TPUs). For matrix multiplication, computation is still the bottleneck due to N^3 computation vs N^2 memory access.


With FP16 you can fit twice as much weights in cache, and also fetch twice as much weights from memory

Also this depends on the size of the matrix


The 4090 provides over 80 tflops in bog standard raw FP32 compute no tensor cores or MAD/FMA or any fancy instructions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: