*>A single modern games console has more RAM than all of the Atari 2600s ever ma...

segfaultbuserr · 2023-08-05T16:59:22

> RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3. The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number

Also a fact worth noting but is routinely ignored in the popular press is that, these astronomical peak floating-point ratings of modern hardware are only achievable for a small selection of algorithms and problems. In practice, realizable performance is often much worse, efficiency can be as low as 1%.

First, not all algorithms are best suited for the von Neumann architecture. Today, the memory wall is higher than ever. The machine balance (FLOPS vs. load/store) of modern hardware is around 100:1. To maximize floating-point operations, all data must fit in cache. This requires the algorithm to have a high level of data reuse via cache blocking. Some algorithms do it especially well, like dense linear algebra (Top500 LINPACK benchmark). Other algorithms are less compatible with this paradigm, they're going to be slow no matter how well the optimization is. Examples include many iterative physics simulation problems, sparse matrix code, and graph algorithms (Top500 HPCG benchmark). In the Top500 list, HPCG is usually 1% as fast as LINPACK. Best-optimized simulation code can perhaps reach 20% of Rpeak.

This is why both Intel and AMD started offering special large-cache CPUs, either using on-package HBM or 3D-VCache. They're all targeted for HPC. Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall. Doing inference is a relatively cache-friendly problem, many HPC simulations are much worse in this aspect.

Next, even if the algorithm is well-suited for cache blocking, peak datasheet performance is usually still unobtainable because it's often calculated from the peak FMA throughput. This is unrealistic in real problems, you can't just do everything in FMA - 70% is a more realistic target. In the worst case, you get 50% of the performance (disappointing, but not as bad as the memory wall). In contrast to datasheet peak performance, the LINPACK peak performance Rpeak is measured by a real benchmark.

jcranmer · 2023-08-05T17:53:01

When you measure peak FLOPS, especially "my desktop computer has X FLOPS", you're generally computing N FMA units * f frequency, theoretical maximum FLOPS unit. This number, as you note, has basically no relation to anything practical: we've long been at the point where our ability to stamp out ALUs greatly exceeds our ability to keep those units fed with useful data.

Top500 measures FLOPS on a different basis. Essentially, see how long it takes to solve an N×N equation Ax=b (where N is large enough to stress your entire system), and use a synthetic formula to convert N into FLOPS. However, this kind of dense linear algebra is an unusually computation-heavy benchmark--you need to do about n^1.5 FLOPS per n words of data. Most kernels tend to do more like O(n) or maybe as high as O(n lg n) work for O(n) data, which requires a lot higher memory bandwidth than good LINPACK numbers does.

Furthermore, graph or sparse algorithms tend to do really bad because the amount of work you're doing isn't able to hide the memory latency (think one FMA per A[B[i]] access--you might be able to do massive memory bandwidth fetches on the first B[i] access, but you end up with a massive memory gather operation for the A[x] access, which is extremely painful).

YetAnotherNick · 2023-08-05T22:59:01

> Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall

FP16 doesn't work any faster than mixed precision on Nvidia or any other platform(I have benchmarked GPUs, CPUs and TPUs). For matrix multiplication, computation is still the bottleneck due to N^3 computation vs N^2 memory access.

nextaccountic · 2023-08-06T00:03:38

With FP16 you can fit twice as much weights in cache, and also fetch twice as much weights from memory

Also this depends on the size of the matrix

dogma1138 · 2023-08-05T22:01:41

The 4090 provides over 80 tflops in bog standard raw FP32 compute no tensor cores or MAD/FMA or any fancy instructions.

kabdib · 2023-08-05T17:10:50

An SF book published in the 1950s (I have forgotten title and author, sigh) featured a then-imagined supercomputer with

- 1M bits of storage

- A mile on a side

- Located in Buffalo, NY, and cooled by Niagra Falls (vacuum tubes, natch)

- Able to surveil every citizen in the nation for suspicious activity

No mention of clock speed, cache-line size, or instruction set. I guess SF writers aren't computer designers :-)

pdimitar · 2023-08-05T12:21:44

The writers could still turn out to be right. I am not sure we are making good use of all that hardware yet.

koolba · 2023-08-05T14:29:54

The only thing keeping us alive is that skynet is an electron app.

automatic6131 · 2023-08-05T13:12:25

>I am not sure we are making good use of all that hardware yet. Dunno, working out the color of 8 million pixels every 6ms seems pretty good to me

pdimitar · 2023-08-05T14:09:50

True, though I was talking about the AI workloads.

MegaSec · 2023-08-05T12:31:33

Damm high level programming languages. Just go back to assembly, that'll fix everything.

pdimitar · 2023-08-05T12:36:40

Yeah how dare they. ;)

Truth be told though, I believe we are in for some more innovation in the area, especially with the advent of ARM lately. It's always kinda funny how these mega-machines we have still manage to stutter.

SmellTheGlove · 2023-08-05T22:12:54

> It's always kinda funny how these mega-machines we have still manage to stutter.

I just figured that’s the trade-off for general purpose computing. We can optimize for whatever covers the wide swath of use cases, but we can’t optimize for everything, and some will continue to be mutually exclusive. Mind you I’m no expert, I’m just extrapolating on how differently what CPUs and GPUs are optimized for these days and historically.

pdimitar · 2023-08-06T00:44:25

Nah you are very correct, I just feel that our compilers can still do a better job with the general purpose code because they are usually completely blind to the systems the compiled code runs in (as in, I/O takes orders of magnitude more time for example and this can be used to auto-parallelize code; but I know compiler authors will never auto-spawn threads). I feel this can be improved a lot but for various (and likely good) reasons our tooling is not as good as it can be.

SmellTheGlove · 2023-08-06T06:58:18

Good points. I wish I knew more about compilers sometimes, and this is one of those times.

merob · 2023-08-05T20:22:17

> The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number for the world-ending AI, that nothing could possibly come close to it, and 20 years later consumers can have twice more computing power in their home PCs.

If you look at the top500 supercomputer list of the time [1], they actually nailed it, the #1 rank at the time hitting a peak of 40TFLOPS

[1] https://www.top500.org/lists/top500/2003/06/

iamgopal · 2023-08-05T12:52:21

Isn’t it a great reminder that technology is not progressed enough to even take advantage of 60TFLOPS.

FirmwareBurner · 2023-08-05T13:26:19

> to even take advantage of 60TFLOPS

Rendering Electron apps and mining Dodgecoins?

segfaultbuserr · 2023-08-05T17:35:26

In scientific computing, it has become a serious problem. Because of the memory wall, many important algorithms can never take advantage of 60 TFLOPS due to their low arithmetic intensity. The only solutions are (1) stop using these algorithms, (2) stop using von Neumann computers (e.g. in-memory computing). The stop-gap solution is HBM or 3D-VCache.

CyberDildonics · 2023-08-05T14:23:58

In other mind boggling movie lore, an RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3[1].

That isn't really mind boggling since you are quoting fiction.

FirmwareBurner · 2023-08-05T19:11:37

>That isn't really mind boggling since you are quoting fiction

Fiction of the past plays an important role in seeing how far tech has progressed, that what was once fiction is now a commodity.

CyberDildonics · 2023-08-05T19:55:22

How does this opinion explain a made up number as "mind boggling"?

FirmwareBurner · 2023-08-05T22:15:26

What makes you think it's a made up number? Just because it's been featured in a movie doesn't mean the number can't be grounded in the reality of the era. Yes, there's exaggerations but big buget movies usually hire technical consultants to aid writers, prop builders and art directors with setting scenes that look realistic and don't just pull radom numbers out of thin air which could be embarrassing mistakes for tech-savvy movie goers.

60 TFLOPS is the equivalent of 10.000x PS2s of processing power, the most powerful console at the time, or 2x NEC Earth Simulator, the most powerful supercomputer at the time, which seems about right for what would be a virus taking over all the compute power of the DoD.

So definitely the writers consulted with some people who knew something about computers to get a figure grounded in reality at the time and not just pulled a random number out of thin air, especially that at the time even average joes were hearing about FLOPS as a measure of compute power, being advertised in PC and gaming console specs, so naturally they had to come up with a number that seemed very impresive but was also believable.

CyberDildonics · 2023-08-05T23:11:23

What makes you think it's a made up number?

It's a fictional prediction of the future. Even people trying to predict the future get it wrong. People being wrong isn't mind blowing.

Yes, there's exaggerations but big buget movies usually hire technical consultants to aid writers,

Is that what happened here?

prop builders and art directors with setting scenes that look realistic

That has nothing to do with the script

don't just pull radom numbers out of thin air

Yes they do

60 TFLOPS is the equivalent of 10.000x PS2s of processing power

60 TFLOPS was also about where the biggest super computer already was at 2003, so this was silly even using the present. That's fine, but it isn't "mind boggling" to base it on fiction.

https://en.wikipedia.org/wiki/TOP500

Why not say that in 20 years the biggest super computer in the world is now where a home graphics card is? That's actually mind boggling. No need to live your life based off of the fiction of someone else.

which seems about right for what would be a virus taking over all the compute power of the DoD.

Why does that "seem about right". Again, this is fiction vs reality. That is a science fiction scenario that should make no sense to anyone experienced with computers. Why would a virus need a super computer?

So definitely the writers consulted with some people

No you've move from 'technical consultants exist' to 'definitely the writers consulted people'. What are you basing this on?

so naturally they had to come up with a number that seemed very impresive but was also believable.

Which part in the made up number is mind blowing again?

Reality is 'mind blowing' enough, there is no need to mix reality and fiction.

pests · 2023-08-05T15:38:43

It's not fiction that the writers thought 60TFLOPS would be huge today.

7speter · 2023-08-06T01:52:59

It kinda seems like the writers (writer?) either consulted with or did the math and calculated where pretty powerful computers would be by now, and that the t800 was more of a mid tier model and that there were higher tier models (or ai’s that ran in data centers) that individually ran on 4090 power and above

CyberDildonics · 2023-08-05T16:03:01

So what?

It's a made up number that's supposed to sound fancy. It is for people who don't know much about computers. It's probably just there because people have heard the prefix 'tera', but wouldn't know what 'exa' or any other prefix means.

It doesn't mean anything. Documentation made by people having more pages than a CPU which was also made by people is interesting because these are real things made for specific purposes, not a number pulled out of thin air for fiction.

There is nothing 'mind blowing' about an uninformed person just being wrong. Is it 'mind blowing' that the original terminator was supposed to run on a 6502?

In Johnny Mnemonic 320 GB was supposed to be a lot of data in 2021 when it costs the same as lunch for two people.

https://www.imdb.com/title/tt0113481/plotsummary/