Gallery of Processor Cache Effects

raverbashing · on Dec 21, 2014

Very interesting (but not exactly news)

But I suppose in this "modern world" most people forget how their processors work

jacquesm · on Dec 21, 2014

Not sure why you're downvoted, your observation mimics what I experience day-to-day. Memory management is magical, network bandwidth is infinite and networks have zero latency, cpu cycles are infinite, caches are infinite and harddrive transfer rates and seek times are so fast it doesn't matter. Until it does.

Case in point: I just got called in on a project that was on the skids. 8 HP blades, 128G each, HP 3Par storage unit underneath it, network capacity by the gallon and it still wouldn't perform with a measly 9000 users per day for a pretty simple website. Loadspikes and restarts to get the system to be available again were the order of the day. Existential threat to the company that had ordered the system to replace their previous system.

After reducing the 130 or so VMs they were using to one and getting rid of all the superfluous hardware and interconnect little by little the system started to work better. Another 3 weeks of fixes later we're running comfortably on that single machine, loads are < 1 during any 24 hour period. The waste and impedance mismatch between the applications and the hardware was simply painful.

And all this was built by so called 'experts', one guy calling himself a DBA (wonder if he realized that you can have indices on more than just the primary key field), a bunch of other gurus and a remote team of developers. Each and every one of them may have their qualities but as a collective with crappy management they managed to make a system that could support a few million daily users crawl trying to support a few thousand.

Understanding how your processor works is only one small element in that story, I think that if you add to that forgetting how your hard drive works, how the computer as a whole works and being unfamiliar with the cost of communications as well as with how operating system level tuning works you might be closer to the truth.

What bugs me is not that this stuff happens. What bugs me is how often it happens.

CrLf · on Dec 21, 2014

What you're describing is business as usual in any enterprise shop...

1. hardware requirements call for an absurd capacity for applications with just a few hundred users

2. developers then proceed to utterly waste the available resources because they seem to be infinite and "nobody cares"

3. the application ends up being slow as molasses and the hardware requirements for the next similar project become even bigger

(There's also a number 4 in here, which is the belief that every high-traffic website has a Google-sized datacenter behind it, which is simply not true. You can serve millions of users with just a few machines.)

This is a never-ending cycle of wastefulness just because nobody ever stops to think about how the machines work and how to apply proper algorithms. The classic argument of developer time vs. hardware costs is a funny one: often bad solutions take just as long as good solutions to implement. But the latter require knowledge that many developers seem to lack and it's easy to think that smart == time.

Take ORMs like Hibernate. Developers use them because it saves them time, but then weeks are wasted trying to optimize the applications in the wrong places just to avoid having to think about what's under the ORM. I used to be a DBA and performance improvements of over 1000x were pretty common just by creating a few indexes in 5 minutes of my time (this was actually fun... "your operation that took 30 seconds now takes 100ms, you're welcome"). All while developers looked around their own code trying to fix the wrong thing.

Memory initialization is another example: I've been in the situation where an application was generating hundreds of MB of new objects per second, which I pointed might be the cause for it to be so slow (it was certainly the cause for it to eventually crash as the GC couldn't catch up). They still wasted days looking at other things until finally understanding that this behavior meant the CPU was working as if it had no cache at all.

Oh, and the constant complaints that the machines were slow when they were sistematically under 10% utilization? Locking and IO latency are also alien concepts for some...

Sami_Lehtinen · on Dec 22, 2014

It's wonderful to see posts like this. I've been often very frustrated by exactly similar issues. But this seems to be much more common than I thought.

One guy used select max(id) from table; for over 5gb table without id column index, periodically. It took quite a while every time maxing the disk I/O.

Usually developers claim that the server doesn't have enough memory or it should jave more cores and we should buy enterprise SSDs. Which is simply insane when you could fix the issue very easily.

rayiner · on Dec 21, 2014

Was there really so much overhead communicating between the VM's? Or was it a bunch of software doing the wrong thing too?

jacquesm · on Dec 23, 2014

A mix of both. I should do a longer write-up on this whole thing.

TillE · on Dec 21, 2014

I learned all about how computers work in a couple semesters at university; it was one of my favorite courses. But it took a long time for the cache-related parts to really click, probably because I've spent most of my time writing code (C++ code, even) which was constrained by I/O and system calls.

Now that I'm doing more CPU-intensive stuff where DRAM is the incredibly slow bottleneck, all those memories of caches and pipelining have been a huge help in understanding articles written from a practical C/C++ perspective.

rdc12 · on Dec 22, 2014

An article doesn't have to be news, to be useful to somebody. Everybody has to start learning somewhere.

WayneS · on Dec 21, 2014

I like to combine the graphs together into one picture and then another processor structure appears the picture: https://dl.dropboxusercontent.com/u/4893/mem_lat3.jpg

This graph show the memory latency for a linked list walk that strides across a certain size buffer in memory. I used to use this picture for interviews and ask people to explain as much as they can about the processor from the picture. It end works for people who know nothing about processor architecture as I can walk them over what it says and see how they think and react to new information.

phaker · on Dec 21, 2014

What processor is it? Or at least: how old is it?

WayneS · on Dec 21, 2014

Old now '95. That is the an early pre-production version Pentium Pro running at 100 Mhz. The 256k L2 cache was really fast relative to memory because it was in that second die in the same chip.

joseraul · on Dec 21, 2014

Most examples are classical cache effects, but the last one is such a puzzle.

  A++; C++; E++; G++; 	448 ms
  A++; C++; 	        518 ms

How can incrementing 2 variables be slower than incrementing 4 variables?

to3m · on Dec 21, 2014

Perhaps the result is an average rather than the best result?

I wondered about this too so I tried it on my PC, though I had to make up my own timing code since the author doesn't say what he was doing. Results for me were more in line with what I'd expect, with both 4-variable cases being the same speed and the 2-variable one being a bit quicker. (All the data for the loop will fit into 2 or 3 cache lines so I don't think there's much chance you'll see any memory effects being measured.)

This is compiling for x64.

shizcakes · on Dec 21, 2014

This feels like a Nagle-esque[1] issue (maybe delaying actions until a buffer fills, for example?)

[1] http://blogs.msdn.com/b/windowsazurestorage/archive/2010/06/...

hintss · on Dec 21, 2014

Not quite processor cache effects, but Duff's Device is pretty cool too