In other mind-boggling stats: A single modern games console has more RAM than al...

FirmwareBurner · 2023-08-05T12:05:22

>A single modern games console has more RAM than all of the Atari 2600s ever manufactured put together

In other mind boggling movie lore, an RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3[1].

The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number for the world-ending AI, that nothing could possibly come close to it, and 20 years later consumers can have twice more computing power in their home PCs.

It's also a nice reminder how far technology has progressed in the last decades even if the pace has slowed down in the last years.

[1]https://youtu.be/_Wlsd9mljiU?t=155

segfaultbuserr · 2023-08-05T16:59:22

> RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3. The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number

Also a fact worth noting but is routinely ignored in the popular press is that, these astronomical peak floating-point ratings of modern hardware are only achievable for a small selection of algorithms and problems. In practice, realizable performance is often much worse, efficiency can be as low as 1%.

First, not all algorithms are best suited for the von Neumann architecture. Today, the memory wall is higher than ever. The machine balance (FLOPS vs. load/store) of modern hardware is around 100:1. To maximize floating-point operations, all data must fit in cache. This requires the algorithm to have a high level of data reuse via cache blocking. Some algorithms do it especially well, like dense linear algebra (Top500 LINPACK benchmark). Other algorithms are less compatible with this paradigm, they're going to be slow no matter how well the optimization is. Examples include many iterative physics simulation problems, sparse matrix code, and graph algorithms (Top500 HPCG benchmark). In the Top500 list, HPCG is usually 1% as fast as LINPACK. Best-optimized simulation code can perhaps reach 20% of Rpeak.

This is why both Intel and AMD started offering special large-cache CPUs, either using on-package HBM or 3D-VCache. They're all targeted for HPC. Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall. Doing inference is a relatively cache-friendly problem, many HPC simulations are much worse in this aspect.

Next, even if the algorithm is well-suited for cache blocking, peak datasheet performance is usually still unobtainable because it's often calculated from the peak FMA throughput. This is unrealistic in real problems, you can't just do everything in FMA - 70% is a more realistic target. In the worst case, you get 50% of the performance (disappointing, but not as bad as the memory wall). In contrast to datasheet peak performance, the LINPACK peak performance Rpeak is measured by a real benchmark.

jcranmer · 2023-08-05T17:53:01

When you measure peak FLOPS, especially "my desktop computer has X FLOPS", you're generally computing N FMA units * f frequency, theoretical maximum FLOPS unit. This number, as you note, has basically no relation to anything practical: we've long been at the point where our ability to stamp out ALUs greatly exceeds our ability to keep those units fed with useful data.

Top500 measures FLOPS on a different basis. Essentially, see how long it takes to solve an N×N equation Ax=b (where N is large enough to stress your entire system), and use a synthetic formula to convert N into FLOPS. However, this kind of dense linear algebra is an unusually computation-heavy benchmark--you need to do about n^1.5 FLOPS per n words of data. Most kernels tend to do more like O(n) or maybe as high as O(n lg n) work for O(n) data, which requires a lot higher memory bandwidth than good LINPACK numbers does.

Furthermore, graph or sparse algorithms tend to do really bad because the amount of work you're doing isn't able to hide the memory latency (think one FMA per A[B[i]] access--you might be able to do massive memory bandwidth fetches on the first B[i] access, but you end up with a massive memory gather operation for the A[x] access, which is extremely painful).

YetAnotherNick · 2023-08-05T22:59:01

> Meanwhile in machine learning, people also made the switch to FP16, BF16 and INT8 largely because of the memory wall

FP16 doesn't work any faster than mixed precision on Nvidia or any other platform(I have benchmarked GPUs, CPUs and TPUs). For matrix multiplication, computation is still the bottleneck due to N^3 computation vs N^2 memory access.

nextaccountic · 2023-08-06T00:03:38

With FP16 you can fit twice as much weights in cache, and also fetch twice as much weights from memory

Also this depends on the size of the matrix

dogma1138 · 2023-08-05T22:01:41

The 4090 provides over 80 tflops in bog standard raw FP32 compute no tensor cores or MAD/FMA or any fancy instructions.

kabdib · 2023-08-05T17:10:50

An SF book published in the 1950s (I have forgotten title and author, sigh) featured a then-imagined supercomputer with

- 1M bits of storage

- A mile on a side

- Located in Buffalo, NY, and cooled by Niagra Falls (vacuum tubes, natch)

- Able to surveil every citizen in the nation for suspicious activity

No mention of clock speed, cache-line size, or instruction set. I guess SF writers aren't computer designers :-)

pdimitar · 2023-08-05T12:21:44

The writers could still turn out to be right. I am not sure we are making good use of all that hardware yet.

koolba · 2023-08-05T14:29:54

The only thing keeping us alive is that skynet is an electron app.

automatic6131 · 2023-08-05T13:12:25

>I am not sure we are making good use of all that hardware yet. Dunno, working out the color of 8 million pixels every 6ms seems pretty good to me

pdimitar · 2023-08-05T14:09:50

True, though I was talking about the AI workloads.

MegaSec · 2023-08-05T12:31:33

Damm high level programming languages. Just go back to assembly, that'll fix everything.

pdimitar · 2023-08-05T12:36:40

Yeah how dare they. ;)

Truth be told though, I believe we are in for some more innovation in the area, especially with the advent of ARM lately. It's always kinda funny how these mega-machines we have still manage to stutter.

SmellTheGlove · 2023-08-05T22:12:54

> It's always kinda funny how these mega-machines we have still manage to stutter.

I just figured that’s the trade-off for general purpose computing. We can optimize for whatever covers the wide swath of use cases, but we can’t optimize for everything, and some will continue to be mutually exclusive. Mind you I’m no expert, I’m just extrapolating on how differently what CPUs and GPUs are optimized for these days and historically.

pdimitar · 2023-08-06T00:44:25

Nah you are very correct, I just feel that our compilers can still do a better job with the general purpose code because they are usually completely blind to the systems the compiled code runs in (as in, I/O takes orders of magnitude more time for example and this can be used to auto-parallelize code; but I know compiler authors will never auto-spawn threads). I feel this can be improved a lot but for various (and likely good) reasons our tooling is not as good as it can be.

SmellTheGlove · 2023-08-06T06:58:18

Good points. I wish I knew more about compilers sometimes, and this is one of those times.

merob · 2023-08-05T20:22:17

> The writers back then probably though 60 TFLOPS is such a ridiculously high sci-fi number for the world-ending AI, that nothing could possibly come close to it, and 20 years later consumers can have twice more computing power in their home PCs.

If you look at the top500 supercomputer list of the time [1], they actually nailed it, the #1 rank at the time hitting a peak of 40TFLOPS

[1] https://www.top500.org/lists/top500/2003/06/

iamgopal · 2023-08-05T12:52:21

Isn’t it a great reminder that technology is not progressed enough to even take advantage of 60TFLOPS.

FirmwareBurner · 2023-08-05T13:26:19

> to even take advantage of 60TFLOPS

Rendering Electron apps and mining Dodgecoins?

segfaultbuserr · 2023-08-05T17:35:26

In scientific computing, it has become a serious problem. Because of the memory wall, many important algorithms can never take advantage of 60 TFLOPS due to their low arithmetic intensity. The only solutions are (1) stop using these algorithms, (2) stop using von Neumann computers (e.g. in-memory computing). The stop-gap solution is HBM or 3D-VCache.

CyberDildonics · 2023-08-05T14:23:58

In other mind boggling movie lore, an RTX 4090 has 2x the TFLOPS computing power than the killer AI Skynet used in Terminator 3[1].

That isn't really mind boggling since you are quoting fiction.

FirmwareBurner · 2023-08-05T19:11:37

>That isn't really mind boggling since you are quoting fiction

Fiction of the past plays an important role in seeing how far tech has progressed, that what was once fiction is now a commodity.

CyberDildonics · 2023-08-05T19:55:22

How does this opinion explain a made up number as "mind boggling"?

FirmwareBurner · 2023-08-05T22:15:26

What makes you think it's a made up number? Just because it's been featured in a movie doesn't mean the number can't be grounded in the reality of the era. Yes, there's exaggerations but big buget movies usually hire technical consultants to aid writers, prop builders and art directors with setting scenes that look realistic and don't just pull radom numbers out of thin air which could be embarrassing mistakes for tech-savvy movie goers.

60 TFLOPS is the equivalent of 10.000x PS2s of processing power, the most powerful console at the time, or 2x NEC Earth Simulator, the most powerful supercomputer at the time, which seems about right for what would be a virus taking over all the compute power of the DoD.

So definitely the writers consulted with some people who knew something about computers to get a figure grounded in reality at the time and not just pulled a random number out of thin air, especially that at the time even average joes were hearing about FLOPS as a measure of compute power, being advertised in PC and gaming console specs, so naturally they had to come up with a number that seemed very impresive but was also believable.

CyberDildonics · 2023-08-05T23:11:23

What makes you think it's a made up number?

It's a fictional prediction of the future. Even people trying to predict the future get it wrong. People being wrong isn't mind blowing.

Yes, there's exaggerations but big buget movies usually hire technical consultants to aid writers,

Is that what happened here?

prop builders and art directors with setting scenes that look realistic

That has nothing to do with the script

don't just pull radom numbers out of thin air

Yes they do

60 TFLOPS is the equivalent of 10.000x PS2s of processing power

60 TFLOPS was also about where the biggest super computer already was at 2003, so this was silly even using the present. That's fine, but it isn't "mind boggling" to base it on fiction.

https://en.wikipedia.org/wiki/TOP500

Why not say that in 20 years the biggest super computer in the world is now where a home graphics card is? That's actually mind boggling. No need to live your life based off of the fiction of someone else.

which seems about right for what would be a virus taking over all the compute power of the DoD.

Why does that "seem about right". Again, this is fiction vs reality. That is a science fiction scenario that should make no sense to anyone experienced with computers. Why would a virus need a super computer?

So definitely the writers consulted with some people

No you've move from 'technical consultants exist' to 'definitely the writers consulted people'. What are you basing this on?

so naturally they had to come up with a number that seemed very impresive but was also believable.

Which part in the made up number is mind blowing again?

Reality is 'mind blowing' enough, there is no need to mix reality and fiction.

pests · 2023-08-05T15:38:43

It's not fiction that the writers thought 60TFLOPS would be huge today.

7speter · 2023-08-06T01:52:59

It kinda seems like the writers (writer?) either consulted with or did the math and calculated where pretty powerful computers would be by now, and that the t800 was more of a mid tier model and that there were higher tier models (or ai’s that ran in data centers) that individually ran on 4090 power and above

CyberDildonics · 2023-08-05T16:03:01

So what?

It's a made up number that's supposed to sound fancy. It is for people who don't know much about computers. It's probably just there because people have heard the prefix 'tera', but wouldn't know what 'exa' or any other prefix means.

It doesn't mean anything. Documentation made by people having more pages than a CPU which was also made by people is interesting because these are real things made for specific purposes, not a number pulled out of thin air for fiction.

There is nothing 'mind blowing' about an uninformed person just being wrong. Is it 'mind blowing' that the original terminator was supposed to run on a 6502?

In Johnny Mnemonic 320 GB was supposed to be a lot of data in 2021 when it costs the same as lunch for two people.

https://www.imdb.com/title/tt0113481/plotsummary/

andai · 2023-08-05T13:45:33

> Anyway, at the time I did these measurements, my 4.2 GHz kaby lake had the fastest single-threaded performance of any machine you could buy but had worse latency than a quick machine from the 70s (roughly 6x worse than an Apple 2), which seems a bit curious. To figure out where the latency comes from, I started measuring keyboard latency because that’s the first part of the pipeline. My plan was to look at the end-to-end pipeline and start at the beginning, ruling out keyboard latency as a real source of latency.

> But it turns out keyboard latency is significant! I was surprised to find that the median keyboard I tested has more latency than the entire end-to-end pipeline of the Apple 2. If this doesn’t immedately strike you as absurd, consider that an Apple 2 has 3500 transistors running at 1MHz and an Atmel employee estimates that the core used in a number of high-end keyboards today has 80k transistors running at 16MHz. That's 20x the transistors running at 16x the clock speed -- keyboards are often more powerful than entire computers from the 70s and 80s! And yet, the median keyboard today adds as much latency as the entire end-to-end pipeline as a fast machine from the 70s.

https://danluu.com/keyboard-latency/

KronisLV · 2023-08-05T14:49:03

> https://danluu.com/keyboard-latency/

This might be a bit off topic, but it was surprising to see a Logitech K120 have the same latency as a Unicomp Model M or other keyboards that are 5-10x more expensive than it.

No wonder I liked using it for work years ago: as far as membrane keyboards go, it's pretty dependable and decently nice to use, definitely so for it's price.

_trampeltier · 2023-08-05T13:52:50

An USB-C charger has much more computing power than an Apollo Moonlander.

https://www.theverge.com/tldr/2020/2/11/21133119/usb-c-anker...

FredPret · 2023-08-05T14:47:21

We'll have computronium soon if we carry on like this!

ddingus · 2023-08-05T18:39:36

But it is seriously i/o deficient!

rcxdude · 2023-08-08T08:52:02

The measurement methodology seems a bit odd for the purposes of measuring the difference between old and new computers: if a large fraction of the latency measured is due to the key travel, that's latency which is also present in the older computers (AFAICT buckling spring has a lot more key travel before activation than the scissor-switch keys of the apple and most laptop keyboards) Surely for the purposes of the comparison you would want to look at switch activation-to-bus-activity latency.

ddingus · 2023-08-05T18:46:55

Why have that kind of resource in a keyboard?

Some keyboards were made with 4 bit processors. I have yet to look one up and perhaps I should.

Pretty much any 8 bit CPU would be luxurious. And low latency due to the single task, respectful code density, and rapid state changes for interrupts.

ImAnAmateur · 2023-08-05T20:27:51

That write up is fantastic but it's undated and probably from 2016/2017.

zwirbl · 2023-08-05T11:51:23

For reference, the Atari 2600 had 128 bytes of RAM with about 30 million devices sold

kristopolous · 2023-08-05T12:06:45

I thought I sort of understood how computers work until I saw that.

I really can't figure out how to do a full screen video game with state in 128B

Tuna-Fish · 2023-08-05T12:24:47

The program and assets are stored in a ROM cartridge, so only mutable data needs RAM.

Actually drawing things on screen depends on two things:

The first is racing the beam. The display output traces across the entire screen at 60Hz, one scanline at a time. At no point does a complete image exist, instead you just make sure to apply changes so that the machine can draw what needs to be drawn just before the beam traces that part of the image. You will need to cycle count so that your program takes exactly the right time to execute every section, because you certainly won't have time for interrupts or other synchronization.

The second is using dedicated hardware, where you store the location on screen, color and memory address of a sprite, and the hardware draws it for you. There are a very limited amount of sprites available, which limits the amount of things that can happen in a single line.

FartyMcFarter · 2023-08-05T12:11:36

There was no framebuffer in those consoles [1]. So you pretty much only have to store game state and some auxiliary data in those 128 bytes, which starts sounding a lot easier.

[1] https://en.wikipedia.org/wiki/Television_Interface_Adaptor

jdblair · 2023-08-05T14:11:34

or, a lot harder, since your code can only draw a line a time, not work with the whole frame buffer!

charcircuit · 2023-08-05T17:14:33

Modern games now have programmers deal with drawing a frame a pixel at a time when writing shaders. The GPUs themselves render a tile at a time and not the whole buffer.

TheRealSteel · 2023-08-05T12:18:13

Look up 'racing the beam' if you haven't before. The answer is... you can't! It didn't have a frame buffer and lines had to be written to the display one at a time. There was a lot of sprite flicker as many games had more on screen than the console could actually display in one frame.

beebeepka · 2023-08-05T12:42:09

Pacman was horrible.

ddingus · 2023-08-05T18:51:44

There is no frame buffer. The graphics are all drawn by manipulating a few registers in the video chip.

Everything is scan lines and cycles. You get a one bit per pixel 40 pixel wide background, a couple of single color 8 bit sprites and a couple more two bit wide sprites and that is pretty much it. A lot can be done by simply changing a color register at specific times too. Reusing sprites happens regularly as well. (Sprite drawn at left of screen can be repositioned to the right to be seen again. That is the "racing the beam" part you may have heard people mention.

Most of the CPU run time available for each frame is spent generating the display a scan line at a time.

The real game happens during the vertical blanking period.

Almost everything comes from ROM, leaving ram for game state and the few objects that may need to be dynamic, and even those are in ROM when there is room for all the states.

It is an odd feeling when you run out of room. The phrase, "I used every BIT of RAM" is literal! Happened to me once. No more bits and I had to either leave out a feature, or take a speed penalty by packing multiple states into single bytes.

fredoralive · 2023-08-05T12:19:10

It's basically only the state in the RAM, the game code is in ROM on the cartridge (you can have up to 4KB of ROM before have to rely on bank switching tricks). Video on the 2600 is weird, there isn't any video memory to speak of, you basically set up the video chip line by line in code.

gerwim · 2023-08-05T12:40:21

Great video [1] on how some clever tricks are used to stay within memory constraints.

[1]: https://www.youtube.com/watch?v=sw0VfmXKq54

csours · 2023-08-05T13:16:31

https://humbletoolsmith.com/2017/07/08/learning-from-the-ata...

karmakaze · 2023-08-05T18:22:34

Would be more impressive if the 2600 had more than 128 bytes RAM--that's bytes not KB.