Not an intuitive thing but the data is fascinating. A couple of notes of people who are confused by it:
1) The 'ns' next to the box is a graph legend not a data label (normally that would be in a box labeled legend to distinguish it from graph data)
2) The weird box and rectangle thing on the top is a slider, I didn't notice that until I was looking at the code and said "what slider?"
3) The only changes from 2005 to present are storage and networking speeds.
What item #3 tells you is that any performance gains in the last decade and a half you've experienced have been driven by multi-core, not faster processors. And that means Amdahl's Law is more important than Moore's Law these days.
"Seek + rotational delay halves every 10 years" just isn't true. We got rid of the >7200 drives and are basically sitting still on that metric. And hard drives weren't at 400MB/s in 2012 either. (The disk bandwidth section also cites a 2012 presentation with correct numbers, but something went wrong in translation.)
Yeah, I was wondering where the heck 2ms disk seek times came from. That's not a thing and will never be a thing; disks only spin so fast, and nobody even uses >7200RPM drives any more (because we've all moved to SSDs instead).
They're used, but rare because they got to a weird position in between slower drives that are cheaper to produce and operate and SSDs. E.g. 15K RPM drives were a thing but while you can still buy some models they're typically more expensive than SSDs, so you mostly buy them to replace drives in existing arrays.
I have no opinion on the matter, just pointing out that your comment doesn't follow from the previous one. But I imagine that when they said "nobody" uses faster disks, they were exaggerating a bit. The question is whether the default set of "latency numbers everyone should know" should reference SSDs or uncommon spinning disks.
I would start by going through the Anandtech reviews of the major processor and storage announcements. They generally do a great job of benchmarking and go into pretty good detail. Datasheets from manufacturers would also be useful although not all data is routinely provided.
The interesting part is that it say nothing about performance. Single-core benchmarks have gotten significantly faster over that time period.
If anything, the takeaway is that things like memory/cache access, branch prediction failures, and mutexes have gotten more expensive. They didn't scale while the rest of the CPU sped up!
But even that isn't really true, because it doesn't tell you anything about branch prediction hit/miss rate, memory prefetching, instruction reordering, hyperthreading, etc. For a fair historic comparison you'd have to look at something like the cumulative time the execution units of a core are stalled due to waiting for something.
Honestly, the actual numbers aren't even all that relevant. It isn't really about the time individual operations take, but about the budget you have to avoid a latency penalty.
This is a valid point, and this is an overly long response because it distracts me from watching frightening current events.
There are two ways to look at these sorts of numbers, "CPU performance" and "Systems performance". To give an example from my history;
NetApp was dealing with the Pentium P4 being slower than the Pentium 3 and looking at how that could be. All of the performance numbers said it should be faster. They had an excellent OS group that I was supporting who had top notch engineers and a really great performance analysis team as well, the results of their work was illuminating!
Doing a lot of storage (and database btw) code means "chasing pointers." That is where you get a pointer, and then follow it to get the structure it points to and then follow a pointer in that structure to still another structure in memory. That results in a lot of memory access.
The Pentium 4 had been "optimized for video streaming" (that was the thing Intel was highlighting about it and benchmarking it with.) in part because videos are sequential memory access and just integer computation when decoding. So good sequential performance and good integer performance gives you good results on benchmarking video playback.
The other thing they did was they changed the cache line size from 64 bytes to 128 bytes. The reason they did that is interesting too.
We like to think of things a computer does as "operations" and you say "this operation takes 0.x second, I can do 1/x operations per second." And that kind of works except for something I call "Channel semantics" (which may not be the official name for it but it's in queuing theory somewhere :-).
Channel semantics have two performance metrics, one is how much bandwidth (in bytes/second) a channel has, and the other is what is the maximum channel operation rate (COR) in terms of transactions per second. Most engineers before 2005 or so, ran into this with disk drives.
If you look at a serial ATA, aka SATA, drive it was connected to the computer with a "6 Gb" SATA interface. Serial channels encode both data and control bits into the stream so the actual bytes that go through a 6 gigabit line can be < 600 Mb when the encoding is 10 bits in the channel for every 8 bits sent (called 8b/10b encoding for 8 data bits per 10 channel bits (or bauds)). That means that the channel bandwidth of a SATA drive is 600MB per second. But do you get that? It depends.
The other thing about spinning rust is that the data is physically located around the disk, each concentric ring of data is a track, and moving from track to track (seeking) takes time. Further you have to tell the disk what track and sector you want, so you have to send it some context. So, if you take the "average" seek time, say 10mS, then the channel operation rate (COR) 1/.010 or 100 operations per second.
So let's say you're reading 512 byte (1/2K) sectors from random places on the disk, then you can read 100 of them per second, but wait 100? That would mean you are only transferring 50 kB per second from the disk, what happened to 600MB?
Well as it turns out your disk is slow when randomly accessed, it can be faster if you access everything sequentially because 1) the heads don't have to seek as often, and 2) the disk controller can make guesses about what you are going to ask for next. You can also increase the size of your reads (since you have extra bandwidth available) so if you read, say 4 kb sectors, then 100 x 4 kB is 400 kB/second. And 8 fold increase just by changing the sector size. Of course the reverse is also true, if you were reading 10 Mb per read, at a 100 operations per second that would be 1000 Mb per second which is 400 Mb more than your available bandwidth on the channel!
So when your channel request rate is faster than the COR and/or the data size requests are greater than the available bandwidth, you are "channel limited" and you won't get any more out of the disk no matter how much faster the source of requests improves its "performance" in terms of requests/second.
So back to our story.
Cache lines are read in whenever you attempt to access virtual memory that has not been mapped to the computer's cache. Some entry in the cache is "retired" (which means over written, or written out first if it has been modified, and then overwritten) and the new data is read in.
The memory architecture of the P4 has a 64 bit memory bus (in 72 data bits if you have ECC memory) That means every time you fetch a new cache line, the CPU's memory controller would to two memory requests.
Guess what? The memory bus on a modern CPU is a channel (they are even called "memory channels in most documentation") that are bound by channel semantics. And while Intel often publishes it's "memory bandwidth" number, it rarely would publish its channel operation limits.
The memory controller on the P4 was an improvement over the P3, but it didn't have double the operation rate of the P3. (it was like 20% faster as I recall, but don't quote me on that.) But the micro-architecture of the cache doubled the number of memory transactions for the same workload. This was especially painful on code that was pointer chasing because the next pointer in the chain shows up in the first 64 bytes and that means the second 64 bites the cache fetched for you are worthless, you'll never look at them.
As a result, on the same workload, the P3 system was faster than the P4 even though on a spec basis the P4's performance was higher than that of a P3.
After doing the analysis some very careful code rewriting and non-portable C code which packed more data in the structures into the 128 byte "chunks" where both 64 byte halves had useful data in them. Improved the performance enough for that release. It also was that analysis that gave me confidence that recommending Opteron (aka Sledgehammer) from AMD with its four memory controllers and thus 4x memory operations per second rate was going to vastly outperform anything Intel could offer. (spoiler alert: it did :-))
Bottom line, there are performance" numbers and there is system performance* which are related, but not as linearly as certainly Intel would like.
Just for anyone curious, the way computer engineers / processor architects deal with the above is to measure the effective Instructions Per Cycle on a battery of benchmarks, captured traces/samples, synthetic workloads etc.
The P4 was a design that reached higher cycle rates, but at the cost of a lower IPC such that the net performance was a regression. In particular the pipeline depth was such that any dependency stalls or flushes eroded CPI severely.
There's a lot that went into this mistake. Having a high Ghz number for marketing purposes was part of it, but also in this era Intel Labs had demonstrators of 10 Ghz ALUs running, so there was some reason to think this architecture might prevail eventually. In any case, by the time it launched retail, Intel knew they had a lemon, but couldn't change course without building up a new microarchitecture.
Back in the early 2000's several of the P4 architects were quite active on comp.arch, and quite forthright about the mistakes they made. There were lots of interesting and insightful posts, but sadly I think you'd have to do some considerable digging in the deja archives or whatever it's called now to find the subthreads.
If you're interested in this sort of stuff, Hennessy and Patterson's Computer Architecture: A Quantitative Approach is the go to textbook.
Something else to consider is that the data doesn't seem consistent.
For example, the top right label, "Read 1,000,000 bytes sequentially from SSD" shows as 31,000ns when I load the page, but when I move the slider to the left and back to 2020, it shows a different value, 49,000ns.
This happens for about half of the values for the data labels in chrome, firefox, and safari. The initially shown values are neven shown again once the slider is moved, even when it is moved back to the starting position.
What item #3 tells you is that any performance gains in the last decade and a half you've experienced have been driven by multi-core, not faster processors. And that means Amdahl's Law is more important than Moore's Law these days.
Uh or storage and networking? Not sure why you would leave that out, since they're the bottleneck in many programs.
The slowest things are the first things you should optimize
Yeah... SSDs are so much faster than spinning disk it's not even funny.
I literally refuse to run a machine that boots its main OS from spinning disk anymore. The 60 bucks to throw an SSD into it is so incredibly cheap for what you get.
My wife's work gave her a (fairly basic but still fine) thinkpad - except they left the main drive as a fucking 5400rpm hdd. Then acted like assclowns when we repeatedly showed them that the machine is stalling on disk IO, while the rest of the system is doing diddly squat waiting around. I finally got tired of it and we "accidentally" spilled water on it, and somehow just the hdd stopped working (I left out the part where I'd removed it from the laptop first...). Then I just had her expense a new SSD and she no longer hates her work laptop.
Long story short - Storage speeds are incredible compared to what they were when I went to school (when 10k rpm was considered exorbitant)
The living room media/gaming machine at home is an 8 terabyte spinning rust. I didn't bother with a separate SSD boot partition.
It's currently been running for 23 days. Booting takes ~15 seconds even on spinning rust for a reasonable linux distro, so I'm not going to stress about those 15 seconds every couple of months.
total used free shared buff/cache available
Mem: 31Gi 4.6Gi 21Gi 158Mi 5.1Gi 25Gi
Swap: 37Gi 617Mi 36Gi
5.1 gigabytes mostly just file cache. As a result, everything opens essentially instantly. For a bit better experience, I did a:
find ~/minecraft/world -type f -exec cat {} > /dev/null \;
Hah, if you can fit the whole OS plus running applications easily in RAM, and you don't boot often - fine. But you're basically doing the same thing but with extra steps :P
I can get a 60gb ssd for under 20 bucks (I just checked 18.99 on amazon - even I'm a little surprised at how cheap that is)
That 32gb of ram is at least a crisp hundred (which also - fucking mind blowing. First time I ever added a part to a PC, it was to add 256mb of ram, and that was huge deal)
Yes, but the RAM is mostly for the gaming. It means I can easily switch between snappy desktop orrr burning a bunch on minecraft worlds and clients or whatever random game is out there. So. A lot more useful than an SSD ;)
Also I'd be concerned about any SSD you can get for that price.
And, frankly, you don't need that much RAM to have a pleasant desktop experience. Most of that is just for the gaming. Just a gig or two of file cache for the hot stuff. Which is what your OS usually does anyway.
The tower my workplace bought me is also spinning rust, and they for some reason put an enormous amount of RAM in it too. Only 5 gigs are actually being used for caching. The OS can't figure out what to do with the rest after having, I suppose loaded browser (few hundred megs), VMWare, OpenOffice etc. Even if I was running something more bloaty than Mate I doubt I could find a way to burn more than another gigabyte :)
I mean... at the risk of asking a loaded question - have you tried an ssd?
Because, I'm also spoiled on RAM so I don't have to pick, but if I haaad to pick - I'd take a machine with 8gb of RAM and an SSD over 32gb of RAM and an HDD almost any day (actually - literally any day except when I'm running our full service stack at work, which needs about 12gb of RAM to be workable)
If I have a bunch of media - sure, throw it on a spinning 2tb drive or something. I'm not saying don't use cheap storage, I'm just saying... try the SSD. It costs about the same as coffee for the week.
Yep. All the laptops in the house are SSD.
I really haven't noticed much of a difference, sorry.
Everything is linux too, but I find it hard to believe that would matter. I mean, NTFS is a notoriously slow file system (Hedgewars, Firefox and Minecraft all had to rearchitect to handle its horrendous handling of many small files), but the reasons for it I think would not change that much with SSD.
The thing is. How would you really notice a difference after first load of an app? SSD, spinning rust, linux, everything is mem caching these days... It's really hard to tell unless you're actually memory constrained.
It's quite likely that if I had to work on an enormous number of new files, I might notice something? But my software projects aren't like that - the games and videos are, but they wouldn't fit on an SSD anyway. That's why I have an 8 terabyte hard drive.
As I think about this, maybe it really is a Windows vs Linux thing. On raw read speeds, SSD is only about 2-3x faster than HDD. This is barely noticeable compared to the speed from RAM, or for almost anything you could reasonably launch. (Videos are huge, but those are basically streamed anyway)
But, random access reads of many small files scattered across the disc, obviously SSD will do much better. Sequential reads (large files, non-fragmented filesystem) it's not so clear.
Maybe it's simply that Linux, as well as being better at file caching and not wasting RAM, is also better at not being fragmented. Perhaps I have my OS filesystems to thank for my indifference.
On that front I guess it's also saving me money by allowing me to enjoy ginormous cheap discs :)
Hmmm another thought. Linux also (typically and traditionally) does a lot more library reuse thanks to the open source ecosystem. So, there's less DLL hell and fragmentation there. That could be another reason for my not noticing.
Most of my system libraries that any random app needs are already loaded and ready for use after boot.
Yeah - you might be onto something there - I suspect fragmentation matters a lot for the spinning disks. If you've got huge disks in there, you probably have mostly sequential reads for just about everything.
I mean - I also run the ginormous cheap discs. I don't think anyone is saying to load your media library all onto an ssd, it just doesn't matter enough. But the OS/applications are a big deal.
And 2-3x faster is low. For my wife's machine, moving from a 5400rpm spinning disk to a relatively cheap samsung ssd made effective read times about 12x faster. Technically a 5400 rpm drive can do ~75mb/s, but that mostly ignores seek time, and assumes sequential reads. In reality, it was often doing less than 40mb/s. The SSD caps at about 540mb/s, but in reality it mostly does around 450.
It's the difference between having MS teams take ~5 seconds to load, vs more than a minute (locking most of the machine for that period as well).
I just launched Microsoft Teams for Linux on this spinning rust work machine here.
It launched a meeting link (letting me into the room) in between 2 and 3 seconds (one-onethousand two-onethousand)
I last launched Teams on this machine a couple of days ago.
For a "fair" comparison (keep in mind, again, the whole point I was making that RAM dominates these days), I then ran:
sync; echo 3 > /proc/sys/vm/drop_caches
And launched Teams again.
It opened in between 8-10 seconds with ALL OS caching of disc dropped.
I think what you're seeing here is Windows being helped along by SSD making it usable...
It reminds me of the change we made in Hedgewars. We had no idea it was taking 20-30 seconds to launch our gui in Windows since it opened instantaneously in Linux even with all caches flushed. We had to add lazy page loading to get reasonable performance on Windows - it was probably all about NTFS being insanely slow (and maybe crap at laying out data reasonably).
*Edit - I updated my time estimate upwards slightly for teams after testing the flush several times and my own counting skills.
The flush was also a good test of launching firefox, since I did it once between to update nightly, it also opened very quickly post flush. Barely noticeable delay.
I agree about this. SSD leads to impressively fast boot/early load times, and it might help if you have to run workloads that stress both RAM and I/O, but other than that it doesn't really matter if you're running a proper OS. And you can put a swap file/partition on a spinning disk without worrying about endurance or having less space for files. So the performance argument goes both ways, and the extra storage space of spinning rust is very convenient.
> First time I ever added a part to a PC, it was to add 256mb of ram, and that was huge deal
I remember the first memory stick I bought (a measly 4mb of RAM) and even wholesale via a friend who worked for a large PC repair shop, it was still $450 give or take.
Now 30 years later, my most recent memory purchase was eight 32gb ECC RDIMMs for about the same price. So not only did the amount of ram per dollar increase 64,000x, it’s also WAY faster and more resilient.
Linux still works OK on spinny disk (though exponentially worse every year as applications trend towards bloated Electron), but Win10 is nigh unusable on a spinny disk (technically works but very unpleasant). It works quite comfortably on ancient hardware though given a SSD and sufficient RAM.
You're a good husband. I did the same for my wife, and her 2012 core I7 HP Elitebook acted like it was ten years younger.
But I went a step further: the mother-in-law. She had a lower grade I3 laptop from around 2008 that took 20 minutes to boot Windows 10. It now boots in 15 seconds with a stock 512GB SSD.
I am being rewarded by both (but in very different ways).
Lusting after those 10k VelociRaptor drives back in high school... man I have a funny amount of nostalgia for, in hindsight, pretty average hardware once the world moved forward!
(your hard drive story is the story of my life, up to about 10 years ago. I have eliminated all but one hard drive from my house and that one doesn't spin most of the time)
Lately my vendor discussions have centered around how much work you can get done with a machine that has half a terabyte of RAM, 96 cores, and 8 NVME SSDs (it's a lot). my college box: 40MB disk, 4MB RAM, 1 66MHz CPU.
My wife was a teacher and we had a similar story. They gave her a perfectly fine MacBook with spinning disk and the smallest memory possible (this was about 10 years ago). I upgraded the RAM to max and the disk to SSD and just put the original pieces away. She was happy, and when it came time to turn it in, we just put the original bits back in.
There is some improvement from processors being faster, as more instructions are done at once and more instructions get down towards that 1ns latency that l1 caches provide. You see it happen in real life but the gains are small.
If you look at single-threaded processor benchmarks, CPUs are still getting faster. The new i9's are 20% faster than the next-fastest chips. It's definitely not doubling every two years but it's still noticeable.
SIMD - compressing has gotten faster, but (assuming OP is correct rather than just missing info) the reference algorithm didn't have room to take advantage of SIMD. The relevant improvements since 2010 or so mostly look like bandwidth improvements not latency, and coincide with increasing ubiquity of SIMD instructions and SIMD-friendly algorithms.
That wouldn't make sense - compression has very cache-friendly access patterns, and would benefit greatly from the observed improvements in memory bandwidth.
That surprises me to hear. I would expect it to jump around in ram a lot. And at higher compression settings, some compressors use a lot of ram, more than will fit in cache.
While the numbers may not be absolutely correct as stated in other comments, on a relative terms the conclusion from the trend is the same.
SSD was the biggest upgrade in performance in the past 15 years. There are still lots of system running on old HDD.
And the chart shows why I have been calling attention to latency computing for 10+ years. I hope the next stage is to reduce latency from Input, from keyboard to touch screen, and display input, and the actual display panel itself. On a 120hz screen even if you assume zero latency in everything else you still have a minimal of 8.3ms delay. Which is still quite obvious [1] once you put things in motion. And I think it is so hard there is enough room for at least another 10 years of hardware improvement.
[1] Applied Sciences Group: High Performance Touch
> What item #3 tells you is that any performance gains in the last decade and a half you've experienced have been driven by multi-core, not faster processors
I think there are other effects too. Things like larger caches, smarter branch predictors, wider cores, etc. My 2021 processor (Apple M1 Pro) benches roughly twice as fast as my 2015 processor (Intel Core i5 2.7Ghz) in both synthetic and real-world single-core workloads.
That's not as big an improvement as we would have seen over that time period in the 90s, but it's a significant improvement none-the-less.
> And that means Amdahl's Law is more important than Moore's Law these days.
100%, we can no longer rely on faster processors to make our code faster, and must instead write code that can take advantage of the hardware's parallelism.
For those interested in learning more about Why Amdahl's Law is Important, my friend wrote an interesting article on this very topic - https://convey.earth/conversation?id=41
"And that means Amdahl's Law is more important than Moore's Law these days."
idk, sure seems like we could have 1-2 cores (permanently pegged?) at 5 ghz for UI/UX then ($money / $costPerCores) number of cores for showing off/"performance" by now. But the OEMs haven't gone that way.
We probably see things differently. As I understand it, this is exactly the use case for "big/little" microarchitectures. Take a number of big fast cores that are running full bore, and a bunch of little cores that can do things for them when they get tasked. So far they've been symmetric but with chiplets they needn't be.
Yes, for 'computational' loads. I've read though UI/UX benefits the most from fastest response times. I'm talking about the cores which actually draw the GUI the user sees/uses being optimized for the task at the highest possible rate. Then have a pool of cores for the rest of it.
You are talking about the GPU? Okay, really random tidbit here; When I worked at Intel I was a validation engineer for the 82786 (which most people haven't heard about) but was a graphics chip that focused on building responsive, windowed user interfaces by using hardware features to display separate windows (so moving windows moved no actual memory, just updated a couple of registers) to draw the mouse, and to process character font processing for faster updates. Intel killed it but if you find an old "Number9 video card" you might find one to play with. It had an embedded RISC engine that did bitblit and other UI type things on chip.
EVERYTHING that chip did, could in fact be done with a GPU today. It isn't, for the most part, because window systems evolved to be CPU driven, although a lot of phones these days do the UI in the GPU, not the CPU for this same reason. There is a fun program for HW engineers called "glscopeclient" which basically renders its UI via the GPU.
So I'm wondering if I misread what you said and are advocating for a different GPU micro architecture or perhaps an integrated more general architecture on the chip that could also do UI like APUs?
The problem is in the, "Pegged at 5ghz" part. 5Ghz is hard!
Not sure the OEM's have what it takes to do that without a serious rethink, and that is what an M1 basically is.
And say they do...
Amdhal's law is still going to be a thing because people are unlikely to believe having two cores running balls out whether essentially idling or pushing high priority interaction... they will see it very poor energy management, and would want tasks moved onto and off those always fast cores at a minimum to reduce what they see as waste.
If you could have responsive and rich UI/UX 30-40 years ago you don't need anything but a highly efficient core prioritizing the UI. No need to waste energy on a super fast core.
It also tells us that the speed of light has not increased.
(well, speed of signal on a PCB track is roughly 2/3 light and determined by the transmission line geometry and the dielectric constant, but you all knew that)
And Speed of light within Glass. We could still get that 1/3 back when we could somehow make ultra long distance hollow cable. We could get that Packet round trip from 150ms to 100ms.
But I dont think hollow cable are anywhere close to commercial launch.
At that level of optimization, the problem is not isolated. With Starlink the gain by using lasers is potentially lost with the additional processing that needs to happen when bouncing over multiple satellites.
It wasn't the speed of light, it was the size of atoms that was the issue here. As old-style scaling (the kind used up until about 2003) continued, leakage power was increasing rapidly because charge carriers (electrons / holes) would tunnel through gates (I'm simplifying a bit here, other bad effects were also a factor). It was no longer possible to keep increasing clock frequency while scaling down feature size. Further reduction without exploding the power requirement meant that the clock frequency had to be left the same and transistors needed to change shape.
No, at the speed of light 150 ms is about long enough to send a packet from SF to Amsterdam and then send the reply back the long way around the entire rest of the planet.
Yes, that shows that the time could theoretically be roughly halved without even bothering to transmit in a medium with faster propagation than glass fiber.
An instructive thing here is that a lot of stuff has not improved since ~2004 or so, and working around those things that have not improved (memory latency from ram all the way down to l1 cache really) requires fine control of memory layout and minimizing cache pollution, which is difficult to do with all of our popular garbage collected languages, even harder with languages that don't offer memory layout controls, and jits and interpreters add further difficulty.
To get the most out of modern hardware you need to:
* minimize memory usage/hopping to fully leverage the CPU caches
* control data layout in memory to leverage the good throughput you can get when you access data sequentially
* be able to fully utilize multiple cores without too much overhead and with minimal risk of error
For programs to run faster on new hardware, you need to be able to do at least some of those things.
It's pretty remarkable that, for efficient data processing, it's super super important to care about memory layout / cache locality in intimate detail, and this will probably be true until something fundamental changes about our computing model.
Yet somehow this is fairly obscure knowledge unless you're into serious game programming or a similar field.
> Yet somehow this is fairly obscure knowledge unless you're into serious game programming or a similar field.
Because the impact in optimizing hardware like that can be not so important in many applications. Getting the absolute most out of your hardware is very clearly important in game programming, but web apps where scale being served is not huge (vast majority)? Not so much. And in this context developer time is more valuable when you can throw hardware at the problem for less.
Traditional game programming you had to run on the hardware people used to play, you are constrained by the client's abilities. Cloud gaming might(?) be changing some of that, but GPUs are super expensive too compared to the rest of the computing hardware. Even in that case the amounts of data you are pushing you need to be efficient within the context of the GPU, my feeling is it's not easily horizontally scaled.
IMO we are only scratching the surface of cloud gaming so far. Right now it’s pretty much exclusively lift-and-shift, hosted versions of the same game, in many cases running on consumer GPUs. Cloud gaming allows for the development of cloud-native games that are so resource intensive (potentially architected so that more game state is shared across users) that they would not possible to implement on consumer hardware. They could also use new types of GPUs that are designed for more multi tenant use cases. We could even see ASICS developed for individual games!
I think the biggest challenge is that designing these new types of games is going to be extremely hard. Very few people are actually able to design performance intensive applications from the ground up outside of well-scoped paradigms (at least web servers, databases, and desktop games have a lot of prior art and existing tools). Cloud native games have almost no prior art and almost limitless possibilities for how they could be designed and implemented, including as I mentioned even novel hardware.
I've thought about this off and on, and there's certainly interesting possibilities. You can imagine a cloud renderer that does something like a global scatter / photon mapping pass, while each client's session on the front end tier does an independent gather/render. Obviously there's huge problems to making something like this work practically, but just mention it as an example of the sort of more novel directions we should at least consider.
If the "metaverse" ever gets anywhere beyond Make Money Fast and reaches AAA title quality, running the client in "the cloud" may be useful. Mostly because the clients can have more bandwidth to the asset servers. You need more bandwidth to render locally than to render remotely.
The downside is that VR won't work with that much network latency.
TBH I don't think cloud gaming is a long term solution. It might be a medium term solution for people with cheap laptops but eventually the chip in cheap laptops will be able to produce photo realistic graphics and there will be no point going any further than that
Photo realistic graphics ought to be enough for anybody? This seems unlikely, there's so many aspects to graphical immersion that there's still plenty of room for improvement and AAA games will find them. Photo realistic graphics is a rather vague target, it depends on what and how much you're rendering. Then you need to consider that demand grows with supply, with eg. stuff like higher resolutions, even higher refresh rates.
There are diminishing returns. If a laptop could play games at the quality of a top end PC today, would people really want to pay for an external streaming service, deal with latency, etc just so they can get the last 1% of graphical improvements?
We have seen there are so many aspects of computing where once it’s good enough, it’s good enough. Like how onboard DACs got good enough that even the cheap ones are sufficient and the average user would never buy an actual sound card or usb dac. Even though the dedicated one is better, it isn’t that much better.
1) you still need to install and maintain it and there are many trends even professionally that want to avoid that
2) just cause you could get it many may not want it. I could easily see people settle for a nice M1 MBA or M1 iMac and just stream the games if their internet is fine. Heck, wouldn't it be nicer to play some PC games in the living room like you can do with SteamLink?
3) another comment brings a big point that this unlocks a new "type" of game which can be designed in ways that take advantage of more than a single computer's power to do games with massively shared state that couldn't be reliably done before.
I think to counter my own points: 1) I certainly have a beefy desktop anyways 2) streaming graphics are not even close to local graphics (a huge point) 3) there is absolutely zero way they're gonna steam VR games from a DC to an average residential home within 5 years IMHO.
I think the new macbooks are more a proof that cloud streaming won't be needed. Apple is putting unreal amounts of speed in low power devices. If the M9 Macbook could produce graphics better than the gaming PCs of today, would anyone bother cloud streaming when the built in processing produces a result which is good enough. I'm not sure maintenance really plays much of a part, there is essentially no maintenance of local games since the clients take care of managing it all for you.
Massive shared state might be something which is useful. I have spent some time thinking about it and the only use case I can think of is highly detailed physics simulations with destructible environments in multi player games where synchronization becomes a nightmare traditionally since minor differences cascade in to major changes in the simulation.
But destructible environments and complex physics are a trend which came and went. Even in single player games where its easy, they take too much effort to develop and are simply a gimmick to players which adds only a small amount of value. Everything else seems easier to just pass messages around to synchronize state.
> If a laptop could play games at the quality of a top end PC today, would people really want to pay for an external streaming service, deal with latency, etc just so they can get the last 1% of graphical improvements?
Think of it a different direction: if/when cloud rendering AAA graphics is practical, you can get a very low friction Netflix like experience where you just sit down and go.
IMO the service of netflix is the content library and not the fact it's streaming. If the entire show downloaded before playing, it would only be mildly less convenient than streaming it. But I don't think the streaming adds that much convenience to gaming. If your internet is slow enough that downloading the game beforehand is a pain, then streaming is totally out of the question. And gaming is way way less tolerant of network disruption since you can't buffer anything.
Cloud gaming seemingly only helps in the case when you have weak hardware but want to play AAA games. If we could put "good enough" graphics in every device, there would be no need to stream. And I think in 10 years probably every laptop will have built in graphics that are so good that cloud gaming is more trouble than its worth. It might sound unrealistic to say there is a good enough but I think a lot of things have already reached this point. These days screen DPI is largely good enough, sound quality is good enough, device weight/slimness is good enough, etc.
I'd (gently) say you may be generalizing your own behavior too much. I often just have say 45 minutes to kill and will just browse Netflix to find something to start immediately. Having to wait for a download would send me to something else most likely. Since COVID started, one thing I've heard repeatedly from friends with kids is they manage to carve out an hour or such for a gaming session, sit down, and then have to wait through a mandatory update that ends up killing much of their gaming session. Now add to that the popularity of game pass, and the possibility that "cloud console" offers something similar... there's plenty of people that would love that service imo.
Cloud gaming allows for more shared state and computationally intensive games (beyond just graphics). Maybe eventually clients will easily be able to render 4k with tons of shaders but the state they’re rendering could still be computed remotely. In a way that’s kind of what multiplayer games are like already
Differentiable programming is becoming more and more popular. If there were accurate models of memory/cache behavior, we could predict how code changes would change performance due to CPU behavior, and might be able to get programming tools that can make this way more visible to people who don't know about it. But I have my doubts it will be as easy as I make it sound - and I don't think I make it sound all that easy either :)
I'm always disappointed that no one has come up with a more realistic model for asymptotic time complexity comparisons, one using a computation model with asymptotically increasing memory access times.
It's a pretty sad state of affairs when the main way we talk about algorithm performance suggests that traversing a linked list is as fast as traversing an array.
It is physics that is the cause behind the memory latency. As such, the fundamental aspect of this won't go away ever - you can randomly access small amount of data faster than you can do a much larger amount of data. This is because storing information takes up space, and the speed of light is limited.
The logical conclusion is that most fields have so much data processing capacity relative to the problem size, they don't need to worry about efficient data processing.
It’s interesting that L2 cache has basically been steady at 2MB/core since 2004 aswell. It hasn’t changed speed in that time, but is still an order of magnitude faster than memory across that whole timeframe. Does this suggest that the memory speed bottleneck means that there simply hasn’t been a need to increase availability of that faster cache?
Some of these numbers are clearly wrong. Some of the old latency numbers seem somewhat optimistic (e.g. 100 ns main memory ref in 1999), some of the newer ones are pessimistic (e.g. 100 ns main memory ref in 2020). The bandwidth for disks is clearly wrong, as it claims ~1.2 GB/s for a hard drive in 2020. The seek time is also wrong. It crossed 10 ms in 2000 and has reduced to 5 ms in 2010 and is 2 ms for 2020. Seems like linear interpolation to me. It's also unclear what the SSD data is supposed to mean before ~2008 as they were not really a commercial product before then. Also, for 2020 the SSD transfer rate is given as over 20 GB/s. Main memory bandwidth is given as 300+ GB/s.
Cache performance has increased massively. Especially bandwidth, not reflected in a latency chart. Bandwidth and latency are of course related; just transferring a cache line over a PC66 memory bus takes a lot longer than 100 ns. The same transfer on DDR5 takes a nanosecond or so, which leaves almost all of the latency budget for existential latency.
The oldest latency numbers were based on actual hardware Google had on hand at the time. Most of them came from microbenchmarks checked into google3. Occasionally an engineer would announce they had found an old parameter tuned for pentium and cranking it up got another 1-2% performance gain on important metrics.
Many of the newer numbers could be based on existing google hardware; for example, Google deployed SSDs in 2008 (custom designed and manufactured even before then) because hard drive latency wasn't getting any better. That opened up a bunch of new opportunities. I worked on a project that wanted to store a bunch of data, Jeff literally came to me with a code that he said "compressed the data enough to justify storing it all on flash, which should help query latency" (this led to a patent!).
Bigger caches could help but as a rule of thumb cache hit rate increases approximately with the square root of cache size, so it diminishes. Then the bigger you make a cache, the slower it tends to be so at some point you could make your system slower by making your cache bigger and slower.
The great circle distance between SF and Amsterdam is about 5,450 miles. A round trip is 10,900 miles. At the speed of light, that's 58.5 milliseconds. So in theory, infinitely efficient computers (and some sort of light-in-a-vacuum wire) and networking equipment could knock 60% off that time, but no more.
However, I suspect one big problem is that there are a number of hops between the Netherlands and San Francisco. I suspect there's not a fiber line that goes right through Idaho to Greenland, then, Scotland, and then the Netherlands. I'm curious what the actual route is be and how many miles are involved.
Okay, so 80ms of light speed travel, or about 120ms assuming today's fiber optic medium, so that only leaves maybe 30ms of potential improvement. Sounds very believable, thanks!
Yeah, but the faster speed of light through vacuum can still be enough to make up for the longer path.
Back-of-the-envelope calculation: at an altitude of 550km and minimum elevation of 25 degrees above the horizon, a single Starlink hop could cover a maximum of 1880km of distance along the earth's surface, with a ground-satellite-ground length of 2068km.
So the total distance traveled through space is about 10% farther, but the signal goes 50% faster than through fiber, which is enough to cut your round-trip time from about 19ms to 14ms (plus any extra latency introduced by routers). That's nothing to sneeze at.
The 3rd parent was suggesting hollow core fiber, which I _think_ is supposed to reduce the distance travelled to near fiber length through band gap effects i.e eliminating the extra distance travelled in regular fiber core due to total internal reflection, hence the lower latency. Light still travels about 2/3 the speed in silica core compared to in a vacuum, so it would make the fastest possible speed 2*10^8 m/s (per meter of actual fiber)
So in order for LEO satellites to compete, the total distance from A to B must be less than 1.5 times the equivalent of hollow core fiber on the ground.
Oh god I have to do trig! So finding the ratio of the the horizontal (ground) to vertical (altitude) when the hypotenuse is 1.5 * the horizontal =
(1.5**2-1)**0.5 = 1.118033988749895.
i.e 1:1.11 half-ground:altitude
550 / 1.118033988749895 * 2 = 984km
i.e A single hop Satellite at 550km altitude would beat a straight line (ignoring curvature, bored through the ground) hollow core fiber at 984km (I _think_ :P). Realistically you can probably lower the distance since we don't actually get straight line A-B fiber, but that's still quite a long minimum distance.
Disclaimer: there are too many assumptions and approximations in here, it's just for fun.
[edit]
whoops, hollow core is supposed to be almost speed of light, so actually it's not much of a competition any more in the idealised case.
Direct point to point conduits carrying fiber would reduce latency to a worst case of 21ms, but requires a fiber that doesn't melt at core temps (around 5200C).
No. The deepest we've ever sent anything is a little over 12Km; the crust's minimum thickness is about 40Km, and the diameter of the Earth is about 12700Km.
the code appears to just do a smooth extrapolation from some past value.
it claims that (magnetic) disk seek is 2ms these days. since when did we get sub-4ms average seek time drives?
it also seems to think we're reading 1.115GiB/s off of drives now. transfer rate on the even the largest drives hasn't exceed 300MiB/s or so, last i looked.
("but sigstoat, nvme drives totally are that fast or faster!" yes, and i assume those fall under "SSD" on the page, not "drive".)
The "commodity network" thing is kind of weird. I'd expect that to make a 10x jump when switches went from Fast Ethernet to Gigabit (mid-late 2000s?) and then nothing. I certainly don't feel like they've been smoothly increasing in speed year after year.
Although I'd expect more jumps than Fast Ethernet to Gigabit, it's true the consumer space is kind of stuck on Gigabit (with 2.5 GbE just becoming more common now). Datacenters have had at least 10 GbE for many years.
Datacenter speeds were a different part of the graph.
I thought about this some more and it might be referring to WiFi speeds. Those have been steadily increasing over time with the transition from 802.11b networks of old to modern WiFi 6. Of course WiFi only became a thing in the very late 90s/early 2000s.
It seems to be entirely based on NIC bandwidth, rather than the actual time it takes for information to travel between servers via ethernet or optics within a LAN.
Okay since we're not going to improve the speed of light any time soon, here's my idea for speeding up CA to NL roundtrip: let's straight shot a cable through the center of the earth.
From CA you will end up off the coast of Madagaskar, and from the NL somewhere near New Zealand. You do not have to go very deep inside the earth to get straight from CA to NL.
According to this all latencies improved dramatically except for SSD random read (disk seek only improved by 10x as well). Reading 1 million bytes sequentially from SSD improved 1000x and is now only 2-3x slower than a random read and for disk reading 1 million bytes is faster than a seek. Conclusion: avoid random IO where performance matters.
CPU and RAM latencies stopped improving in 2005 but storage and network kept improving.
I doubt that same-facility RTT has been fixed at 500µs for 30 years. In EC2 us-east-1 I see < 100µs same-availability-zone RTT on TCP sockets, and those have a lot of very unoptimized software in the loop.
function getDCRTT() {
// Assume this doesn't change much?
return 500000; // ns
}
I show 180-350µs between various machines on my network, all of which have some fiber between them. devices with only a switch and copper between them somehow perform worse, but this is anecdotal because i'm not running something like smokeping!
Oh, additionally between VMs i'm getting 180µs, so that looks to be my lower bound, for whatever reason. my main switches are very old, so maybe that's why.
Are you measuring that with something like ICMP ping? I think the way to gauge the actual network speed is to look at the all-time minimum RTT on a long-established TCP socket. The Linux kernel maintains this stat for normal TCP connections.
I wish there was a version that normalized the times somehow instead of showing absolute times. Most of these numbers will get smaller simply because CPUs get faster, but it would be super helpful to get an accurate read whether the ratios of e.g. l1 to l2 to l3 to main memory lookup latencies changed significantly.
It looks like almost everything is blazing fast now. I'm not sure how long the first X takes though - how long does it take to establish a TCP/IP connection? How long does it take an actual program to start reading from disk?
This isn't particularly helpful/insightful. Considering that, for example, waiting for cache is meaningfully expressed in processor cycles, having that piece of data in ns isn't particularly useful.
Mutex locking perf also is a bit suspect, since probably nobody had multi-core CPUs pre-2003, meaning mutex locks/unlocks were simple memory writes, no shared cache invalidation involved.
The only latency stats that can be meaningfully expressed in ns are the network and disk latencies.
This could be a really great visualization and really interesting but I literally nad to look at he comments here to figure out how to read it and use it. For example, you can edit the year text box.
I would suggest adding graphs to show how different things have changed over time.
it is interesting that the only major improvement over the last 15+ years seems to have been the invention of SSDs.
I believe that sometime around 2010 we peaked on the best software solution for high performance, low-latency processing of business items when working with the style of computer architecture we have today.
I have been building systems using this kind of technique for a few years now and I still fail to wrap my brain around just how fast you can get 1 thread to go if you are able to get out of its way. I caught myself trying to micro-optimize a data import method the other day and made myself to do it the "stupid" way first. Turns out I was definitely wasting my time. Being able to process and put to disk millions of things per second is some kind of superpower.
> made myself to do it the "stupid" way first. Turns out I was definitely wasting my time
I think you put into words something that has been apparent for a while now with all the bloated software out there.
Software has become stupid.
Not because we're not capable of doing it smart, rather it's a accumulation of small instances of 'I'll do it fast and stupid' like that import, which 10-20 years ago would have been a long enough delay for the developer to do something about it.
Then across the multiple dependencies and imports you end up with an overall slow system since every one of those was implemented with a 'stupid but fast enough for this' mindset.
How are people practically taking advantage of the increase in speed of SSDs these days compared to network latencies? It looks like disk caches directly at the edge with hot data would be the fastest way of doing things.
I'm more familiar with the 2001-2006 era where redis-like RAM caches for really hot data made a lot of sense, but with spinning rust a disk drives, it made more sense to go over the network to a microservice that was effectively a big sharded RAM cache than to go to disk.
Seems like you could push more hot data to the very edge these days and utilize SSDs like a very large RAM cache (and how does that interact with containers)?
I guess the cost there might still be prohibitive if you have a lot of edge servers and consolidation would still be a big price win even if you take the latency hit across the network.
I don’t know, I have observed in my workloads, booting, game load, and building programs, that super fast ssds make almost no difference compared to cheap slow ssds. But any ssd is miraculous compared to a spinny drive
Presumably video editing or something might get more of a win but I don’t know.
I'm talking about back in the day when I worked at Amazon with thousands of servers and in the 2001-2006 era we had basically no SSDs and it was all spinny rust (set the slider to 2006 and that is about my mental model). Spinny rust was always shit, so it was all about RAM and network and then the long tail siting on disk.
I'm wondering how datacenter SSDs impact architectural decisions with e.g. 'microservices' by having that new latency layer which is cheaper than RAM and faster than network.
Oh, SSDs have definitely had a big impact. On a cost per iop basis, they've been faster for a while, so any workload that's iop limited is on SSDs these days (assuming it doesn't fit in ram).
Re your prior comment about Redis style caches, it's worth noting Memcached now supports an SSD optimized external store. It identifies values where it'd be beneficial to spill them to SSD and does so, holding metadata in ram such that the object can be served in a single io (size permitting), and that there's no io required at all to delete/expire an item. It can saturate just about any SSD array you can build right now assuming your network can keep up.
It'll take a while, but if you skim the last decade or so of VLDB papers, you can see lots of compelling research into databases that are SSD optimized. It'll take some time for this to coalesce into proven industry solutions but it'll definitely happen.
And even with existing databases, I think a lot of people still underestimate what can be done with a sql database on a machine with a beefy SSD array. With more modest scale systems lots of people cargo culting an architecture built around say a cassandra cluster when something much more straightforward would easily suffice.
In any case, having millions of ios/sec be totally affordable in a server is a huge shift, and I feel like we still have a lot of layers and inertia in the way of fully utilizing that.
Yeah what you're talking about with memcached spilling to the SSD for an extended cache is sort of what I was thinking about.
Bit surprised that isn't already much more common of a pattern.
And yeah, I'm a fan of sqlite whenever it can be used, then postgresql when you need some beefier.
A frightening about of Amazon ran on pre-chewed-up BDBs that were pushed out to every server that needed them (originally to literally every webserver, then pushed out to every server in a microservice cluster), effectively as caches in front of SQL databases. Using sqlite these days should be much better than that since BDBs were buggy and awful.
If it was an hour earlier I'd save everyone time with a graph instead of trying to get an idea by eyeballing blocks while playing with a slider, but alas it's 1:09am and I really should sleep. I think a simple multi-line plot/graph made using a simple spreadsheet or whatever would be a lot more useful than having to keep in mind different values and moving time forwards manually, so if anyone else feels encouraged...
Writes are much slower than reads, but modern CPUs have a write-buffer for each core so they don't need to wait for the write to propagate back down the caches and into main memory before continuing execution.
Complete laggard questions:
* Did SSD exist in 1993?
* Why using an "approximately equal" sign (as opposed to a real = sign) between e.g. 1000ns and 1 micros ? (I would undersatnd between binary and decimal multiples of bytes...but decimal fractions of seconds don't work this way)
I don't understand the disk I/O numbers here. As others have pointed out, SSD
metrics before ~2008 or so are not meaningful, and IOPS on pre-SSD-era machines an order of magnitude worse, but this time traveler doesn't seem to show that.
Diminishing returns over the last decade, as expected. It would be interesting to look at the energy consumed by each of these operations across the same time periods.
We have memory mapped files now which look to the program like ordinary RAM but are backed by disk storage. (You still need to call sync to actually write back to disk.)
It takes at least one clock cycle to do anything, and clock frequency stopped increasing in the 2003-2005 time frame, mainly because of the horrible effects on power with very small feature size.
The memory number is measuring access time while the network number is measuring average bandwidth. The two values can't be compared even though they are presented using the same unit.
There is plenty I don't know. It's not me reveling in my ignorance.
My point is that programming is an incredibly diverse field and yet people, even people who supposedly should know better, are obsessed with making global laws of programming. I know relative comparisons of speeds that have been useful in my day jobs, but I'd wager that needing to know the details of these numbers, how they've evolved, etc. is a relatively niche area.
Regarding learning, I try to constantly learn. This is driven by two things: (1) need, such as one finds in their day job or to complete some side project; (2) interest. If something hits either need or interest or hopefully both, I learn it.
This was a doc handed to incoming Google engineers (along with Jeff's resume) to help assimilate them. MOst if not all of these came up at some point in the development of google.com and making decisions about (for example) whether to send an intercontinental RPC or look something up locally, or to put some part of the web index in RAM/disk.
Don't read it as "you should have known this", read it as "you should know this" to be a more effective programmer. These sorts of guidelines have been invaluable for making arch decisions throughout my career.
I don’t think it’s important to know the absolute numbers but rather the relative values and rough orders of magnitude.
I can’t tell you how many times I’ve had to explain to developers why their network attaches storage has higher latency than their locally attached NVME SSD
The absolute numbers are also important. I can't tell you how many times I've had someone coming from a front-end world tells me 5ms for some trivial task (e.g. sorting a 1000ish element list) is "fast" just because it happened faster than their reaction time.
That's reasonable. My comment is a reaction to the unneeded title. Knowing relative speeds of things as well as ways to characterize and debug them is generally important.
1) The 'ns' next to the box is a graph legend not a data label (normally that would be in a box labeled legend to distinguish it from graph data)
2) The weird box and rectangle thing on the top is a slider, I didn't notice that until I was looking at the code and said "what slider?"
3) The only changes from 2005 to present are storage and networking speeds.
What item #3 tells you is that any performance gains in the last decade and a half you've experienced have been driven by multi-core, not faster processors. And that means Amdahl's Law is more important than Moore's Law these days.