Latency Numbers Every Programmer Should Know

jorangreef · on Sept 5, 2018

Since I came across Jeff Dean's tip a few years ago about back-of-the-envelope calculations and latency numbers (http://highscalability.com/blog/2011/1/26/google-pro-tip-use...), I can't count the number of times that sketching out a design's complexity has saved me from implementing designs that can't possibly work, and eventually resulted in designs that are orders of magnitude faster.

In fact, it's not just about designs that work, or designs that are fast, but getting into the practice of estimating complexity in terms of hardware numbers also makes for safer code, especially where validating user data is concerned.

Just recently even, it kept me back from what might have been a potential denial of service in https://github.com/ronomon/mime, and lead to discovering a vulnerability in several popular email parsers (https://snyk.io/blog/how-to-crash-an-email-server-with-a-sin...).

I think Martin Thompson summarized it well as "Mechanical Sympathy": https://mechanical-sympathy.blogspot.com/2011/07/why-mechani...

mdpopescu · on Sept 5, 2018

Great article, thanks for that (and for evidence #1236576 that I have no clue when it comes to software security).

One rather off-topic observation: April 23 to June 25 is somewhat shorter than the 90-day window you mentioned. ("A few days before the 90-day public disclosure deadline...") What was the reason for that? It doesn't appear to be because those who were going to fix it had already done so - they published their fixes after the public disclosure.

(I'm just curious, not criticizing or anything.)

jorangreef · on Sept 6, 2018

Thanks!

Regarding the 90-day window, you are spot on. I never realized that until now. I made a mistake with the month, it should have been July 25 not June 25, so it came out after 60 days, not 90 days as I intended.

That's evidence #1236577 that I have no clue at all!

kristianov · on Sept 6, 2018

More like "Mechanical Empathy"?

dvh · on Sept 5, 2018

Grace Hopper explaining nanosecond: https://m.youtube.com/watch?v=JEpsKnWZrJ8

pwaivers · on Sept 5, 2018

Great video. I've never seen her speak before. She is very likable!

gambler · on Sept 5, 2018

Great video. I really like when lecturers find ways to convey abstract concepts without dumbing them down with analogies. She achieved that perfectly by relating time to the speed of light, which is a perfectly sensible conversion.

Edit: I've never seen any of her videos before. Now I did. She had an amazing sense of humor.

zawerf · on Sept 5, 2018

I think the coolest thing about this visualization is that it is now "broken".

I remember when I first saw this, there were still visible red and green squares on the top rows. Today those numbers are so small, those squares are missing completely!

(That said whoever owns this page should update the scale of the blocks so it's more useful going forward)

e12e · on Sept 5, 2018

Thank you for linking to your attachment blog post. What's the status of ronomon? It seems https://ronomon.com/ assume existing user - and gives no more information?

jorangreef · on Sept 5, 2018

Thanks! Ronomon is in private beta. My email is in my profile if you'd like to chat more.

e12e · on Sept 5, 2018

> (That said whoever owns this page should update the scale of the blocks so it's more useful going forward)

Fwiw the "fork me on github" banner points to:

https://github.com/colin-scott/interactive_latencies

tw04 · on Sept 5, 2018

3ms for a disk drive isn't something you should plan to. 3ms is assuming a 15k drive with little to no workload. Under load your likely 10ms and under heavy load you may be seeing 20+. SATA drives? 5ms best case, under load with anything but sequential workloads and you might as well just take the afternoon off.

gamegoblin · on Sept 5, 2018

It says 3ms for a "disk seek", not a disk IO. It's definitely misleading, but not totally wrong. An average disk seek time of 3 or 4ms is pretty common. The rotational latency will be another few ms which is how we arrive at your number of 10ms, which is the total time of the IO.

bufferoverflow · on Sept 5, 2018

Maybe it's an average and they include the SSDs.

Here's a better list:

https://gist.github.com/eshelman/343a1c46cb3fba142c1afdcdeec...

MR4D · on Sept 6, 2018

Great link - it’s much more readable than the OP.

Highly recommended!

jsty · on Sept 5, 2018

That's a pretty good point. Is anyone aware of one of these sets of numbers with error bars / ranges?

gamegoblin · on Sept 5, 2018

With hard drives, the main factors are how random (in terms or random access) the workload is, read/write distribution, IO size, and queue depth.

Some assorted facts:

Rotational latency -- The most common hard drive models are 7200 RPM. That is 120 rotations per second, or 1 rotation per 8.33 milliseconds. So, all things being equal, the average rotational latency will be half of that (best case: 0, worse case: 8.33 ms).

Seek time -- The most common hard drive models have 1 head on an actuator (though multi-actuator models are being developed). The actuator has to physical move the head across the surface of the platters so that it's over the bits you want. The worst case seek time is around 8ms on a typical hard drive, for an average of 4ms. The less random your workload is, that is, the smaller subset of the disk you are touching, the smaller your seek time will be.

IO size -- the amount of data you can read in a single hard drive rotation actually depends on where on the platter it is. If it's on the outer diameter of the drive, it might be e.g. 2MB, but on the inside diameter, only 1MB. For the reason you can actually get significantly higher throughput when reading from the outside diameter (typically the lower LBAs). Once you start doing even larger IOs, you're talking about waiting for multiple rotations for the IO to complete, which equals more time.

Read/Write distribution -- Hard drives have buffers on them to absorb writes, and it is asynchronously flushed in a very efficient manner (either when the hard drive is not doing anything, or is on the way to do another IO and happens to pass over the right part of the disk).

Queue depth -- Hard drives allow you to enqueue multiple concurrent IOs (most commonly from different threads/processes). SATA allows for a maximum of 32 concurrent IOs, whereas NVMe goes all the way up to 65K. Usually with an HDD though, you'll be on SATA. The number of concurrent IOs is referred to as the queue depth. You can improve your average latency by increasing queue depth, at the cost of peak latency. The reason for this is that hard drives will re-order queued IOs to reduce rotational latency and seek time. For example, imagine you issue 4 concurrent reads to LBAs 0, 10, 2, 5. The drive will likely reorder this to 0, 2, 5, 10 so that it doesn't have to spend as much time seeking. This will reduce the average latency, but the at particularly high queue depths, a particularly unlucky IO might get deprioritized for 100+ ms. If you care a lot about peak latency, bound your queue depth to 4 or so.

preinheimer · on Sept 5, 2018

If ping times are your jam, we've collected a few between major cities across the globe: https://wondernetwork.com/pings

Iem3ohvi · on Sept 5, 2018

Averages are not particularly interesting. CDFs or violin graphs are since hundreds of requests to load a site can turn your 99th percentile into what the user actually experiences.

Edit: I just noticed that clicking on it shows box plots. Good enough!

preinheimer · on Sept 5, 2018

Thanks!

We've got all of our data going back years in AWS Athena, we're just waiting to have time to do something fun with it.

krn · on Sept 6, 2018

Thanks for this tool. It was extremely helpful in trying to find the best locations for data-centers across the world. Who knew, that the presence in New York, London, and Singapore alone is enough to be within 100ms reach for 80% of the world's population.

ape4 · on Sept 5, 2018

Cool. I don't think the 3 decimal places are significant.

mkesper · on Sept 5, 2018

This is a good example for bad display of data: ten lines for example don't get recognized by the mind as having that much greater value than a single line.

_9vzr · on Sept 5, 2018

It will never change but another latency to think of with regards to hardware is that electrical signals go at the speed of 6in/ns in circuit boards.

dmayle · on Sept 5, 2018

This is great for server development, but I'd love to see the numbers for mobile phones nowadays. Rather than year, mobile phone by year class...

jandrese · on Sept 5, 2018

I'm surprised how cheap a branch mispredict is. I had thought the relatively long pipelines in modern processors would have made that more painful. It seems weird Intel devotes so much silicon to improving the branch predictor when the penalty is so light.

glangdale · on Sept 5, 2018

[ full disclosure: ex-Intel, although I did not work on h/w ]

It is about 14 cycles on modern hardware, which is consistent with the 3ns if you have a very high-clocked machine. I think 3ns is rounding down otherwise.

The penalty is not that light if you have heavily optimized code.

Those 14 cycles will be spent with a completely empty pipeline and the code that follows the miss will be starting from scratch - that is, it will likely take time to spin up to filling the pipeline effectively. That is, you have a deep (~14) and wide (~4) pipeline and you're going to be progressively filling it with new work - maybe some of that work has data dependencies so it will be a while before you're filling most slots.

Under enormously ideal conditions you can issue and retire 4 uops (not quite 1:1 with instructions) per cycle - so 14 cycles is passing up the opportunity to execute 56 uops (again, under ideal conditions). Very few codes will get anywhere near this.

However, that is one of the prime reasons that having a really good branch predictor - and it is very good - is so important. It's not just the time spent waiting, it's the fact that you mostly wipe out your pipeline every time.

OskarS · on Sept 5, 2018

Try it for yourself! Create an array with a million random bytes from 0 to 255, then count (in a for loop) with an "if" statement how many of them are less than 128. This branch will be essentially impossible to predict, and the branch predictor will fail roughly 50% of the time.

Then try again, but sort the array before you begin (which will make the branch trivial to predict). Time the difference. I think you'll be surprised.

jandrese · on Sept 5, 2018

Looks like the 3x penalty is about right:

  There are 5001834 entries out of 10000000 less than 128
  Exeuction took 65335182 nanoseconds
  There are 5001834 entries out of 10000000 less than 128
  Exeuction took 26693952 nanoseconds

Using the following code I hacked together:

  #include <stdio.h>
  #include <stdlib.h>
  #include <time.h>
  #include <sys/types.h>
  
  #define DATASIZE 10000000
  #define NSEC_PER_SEC 1000000000
  
  int cmp(const void* a, const void* b)
  {
  	return *(unsigned char*)a - *(unsigned char*)b;
  }
  
  int countsmall(unsigned char* data)
  {
  	struct timespec start;
  	struct timespec end;
  	u_int64_t delta;
  	int lcv;
  	int numsmall;
  
  	numsmall = 0;
  	clock_gettime(CLOCK_MONOTONIC, &start);
  	for ( lcv = 0; lcv < DATASIZE; lcv++ )
  		if ( data[lcv] < 128 )
  			numsmall++;
  	clock_gettime(CLOCK_MONOTONIC, &end);
  
  	delta = (end.tv_sec   * NSEC_PER_SEC + end.tv_nsec) - 
  		(start.tv_sec * NSEC_PER_SEC + start.tv_nsec);
  			
  	printf("There are %d entries out of %d less than 128\n", numsmall, DATASIZE);
  	printf("Exeuction took %ld nanoseconds\n", delta);
  
  	return 0;
  }
  
  int main(int argc, char** argv)
  {
  	unsigned char data[DATASIZE];
  	int lcv;
  	srandom(time(NULL));
  	
  	for ( lcv = 0; lcv < DATASIZE; lcv++ )
  		data[lcv] = random();
  
  	countsmall(data);
  
  	qsort(data, DATASIZE, sizeof(unsigned char), cmp);
  
  	countsmall(data);
  
  	return 0;
  }

AceJohnny2 · on Sept 5, 2018

Well, I took the bait. Time to go through an unsorted vs sorted array on an iMac17,1 with 4Ghz Core i7 (not sure which CPU exactly).:

* 1MB: 6.52ms vs 2.16ms (3x speedup)

* 1GB: 5.73s vs 2.25s (2.5x speedup)

Tested with this C code, called as "bpredict-time [size] [sort]":

    #include <stdio.h>
    #include <stdint.h>
    #include <stdlib.h>
    #include <sys/time.h>
    
    #define SIZE (1024*1024)
    
    int cmp (const void *val1,
             const void *val2)
    {
        return *(uint8_t*)val1-*(uint8_t*)val2;
    }
        
    int
    main (int argc,
          char **argv)
    {
        uint8_t *array;
        size_t size = SIZE;
        struct timeval start, end;
    
        if (argc > 1) {
            size = atoi(argv[1]);
        }
        
        array = malloc(sizeof(*array) * size);
        if (array == NULL) { return -1;}
    
        for (size_t i=0; i<size; i++) {
            array[i] = random();
        }
    
        if (argc > 2) {
            qsort(array, size, sizeof(*array), cmp);
        }
    
        (void)gettimeofday(&start, NULL);
        int cnt = 0;
        for (size_t i=0; i<size; i++) {
            if (array[i] < 128) { cnt++;}
        }
        (void)gettimeofday(&end, NULL);
    
        
        printf("N ints < 128: %d\n", cnt);
        printf("Time: %f\n", (float)(end.tv_sec-start.tv_sec)\
            +(float)(end.tv_usec-start.tv_usec)/1000000);
        free(array);
        
        return 0;
    }

Sohcahtoa82 · on Sept 6, 2018

Why do you use "(void)gettimeofday(&start, NULL);" instead of just "gettimeofday(&start, NULL);"?

AceJohnny2 · on Sept 13, 2018

To avoid the compiler complaining about the unused return value, by telling it I'm explicitly choosing to ignore the value.

The complaint doesn't show up on the default warning levels. I think only with -W ? (fun fact: -Wall actually has fewer warnings enabled than -W, at least with Clang)

Edit: actually, the warning is enabled by default, but only triggers for functions that have the attribute 'warn_unused_result' set, which is almost none of them, which is why this warning is not well-known. Indeed, in my example my (void) isn't necessary, and I couldn't get the warning to trigger even with -Wpedantic.

Sohcahtoa82 · on Sept 6, 2018

A lot of people tested this on Stack Overflow a while back:

https://stackoverflow.com/questions/11227809/why-is-it-faste...

mooman219 · on Sept 5, 2018

Speculative execution probably plays a large role in that. I think this is more of an average that you can expect to see. Additionally, if eager execution is not at play, if a mispredict is triggered when data is finally fetched from main memory, the 5ns (handful of cycles) might just be the cost of unwinding, meaning the actual cost was 105ns.

gpderetta · on Sept 5, 2018

It estimates 10 clock cycles penalty, which at 3GHz is ~3ns. I think that it is the correct order of magnitude, but 15-20 clocks is probay more accurate. The penalty is variable though and hard to characterize. For example recovering from the uop cache shaves 3-4 cycles.

CalChris · on Sept 5, 2018

These latencies should include register access (nominally 1 clock cycle). Compilers work very hard to keep things in registers and it's nice to know the relative benefit of register vs L1.

jdoliner · on Sept 5, 2018

There are a few things in this diagram that always drive me a little crazy. Why is it that 1,000ns ≈ 1μs while 1,000,000ns = 1ms (almost equals vs. regular equals). It kinda makes sense to use almost equals for some of the real values, but they do it for pure values that are just there for conversions... but not with all of them, only with some.

All this being said, this is actually a very useful and well done diagram. But, imo, it would be even better if it used equal signs in a consistent way.

drewmol · on Sept 5, 2018

My guess is it's a typo in: 10,000ns ~ 10us = [green cube] and it's supposed to read 10,000ns = 10us = [green cube] and this is a key as are black red and blue, all of the speed measures are approximate.

akuji1993 · on Sept 5, 2018

I thought it's funny to not see the roundtrips increase at all. All the other measurements exploding in size, the more you go back and then just the roundtrips stay the same.

exDM69 · on Sept 5, 2018

It's the speed of light staying constant despite best efforts of physicists and engineers. The distance between California and the Netherlands hasn't changed either.

lloeki · on Sept 5, 2018

Good time to dig out some lore:

http://www.ibiblio.org/harris/500milemail.html

the8472 · on Sept 5, 2018

Physicists and engineers actually are working on increasing the speed of light for fibers.

https://www.nature.com/articles/nphoton.2013.45?error=cookie...

Iem3ohvi · on Sept 5, 2018

Technically neutrino signalling could make it faster, but those extra milliseconds are not worth it, even for the HFT folks... but thanks to them we have shorter fiber routes across the atlantic at least.

bcaa7f3a8bbc · on Sept 5, 2018

> neutrino signalling could make it faster

you need a nuclear reactor for transmission and a deep array of detectors buried underground for reception...

mr_toad · on Sept 6, 2018

And somehow distinguish between the signal and the flood of solar neutrinos.

supahfly_remix · on Sept 5, 2018

> thanks to them we have shorter fiber routes across the atlantic at least.

Can you explain how HFT caused a shorter fiber route across the Atlantic? Is this route open to the non-HFT public?

I read Flash Boys and am aware of a custom fiber link between Weehauken, NJ and Chicago, IL, but I thought they are were moving to microwave. I thought there were some HFT links (fiber or microwave) within Europe.

daddylonglegs · on Sept 5, 2018

There are microwave links across Europe:

https://sniperinmahwah.wordpress.com/2016/01/26/hft-in-the-b...

There are apparently shortwave links across the Atlantic:

https://sniperinmahwah.wordpress.com/2018/05/07/shortwave-tr...

https://sniperinmahwah.wordpress.com/2018/07/16/shortwave-tr...

These should have a lower latency than fibre links although the shortwave link will probably have rather a low data rate.

hermitdev · on Sept 5, 2018

Microwave is great for latency, but the bandwidth sucks. When you need bandwidth & low latency, you need a more direct fiber link.

Iem3ohvi · on Sept 5, 2018

I recall a new transatlantic cable constructed a few years ago being advertised as shaving off a few milliseconds. Trading and cloud providers were mentioned as target customers. I don't know whether they also route public internet traffic.

scurvy · on Sept 5, 2018

It's actually about 60% the speed of light because there's no vacuum in buried fiber.

But yes the rest of your point is accurate.

umvi · on Sept 5, 2018

Also, presumably light has to travel farther in fiber because it's constantly bouncing off of walls vs. straight line travel?

e12e · on Sept 5, 2018

I was going to dismiss it as insignificant, but if I read this right:

https://en.m.wikipedia.org/wiki/Total_internal_reflection

an angle higer than ~42 degrees seems to indicate roughly a 50% longer path?

That said, I wonder what the average "fractal dimension" of a city to city fiber link is - I suspect wavering path around buildings and along roads dominate the increase in path VS theoretical straight (curved along the earth) path?

scurvy · on Sept 5, 2018

Don't forget fiber splices for accidental cuts. Engineers have to try their best to match the existing refraction when splicing in the new patch. Tools help, but there's a lot of skill and craftsmanship there.

e12e · on Sept 5, 2018

But a splice should either work, or not work? It is literally welding two glass strands into one?

Hikikomori · on Sept 5, 2018

A splice may or may not work, or something in between. The splice might cause enough attenuation that you go over your optical budget (signal strength and receiver sensitivity) so no link comes up. But it might come up just barely and then you may have some bad frames as a result.

Hikikomori · on Sept 5, 2018

This is true for multimode, anywhere outside short distances in datacenters you use singlemode where light does not bounce at an angle. A singlemode fiber acts as a waveguide, the signal travels in a single mode, this is mainly why it's good for longer distances, no modal dispersion.

scurvy · on Sept 7, 2018

That is not true. Single mode fiber definitely comes in at an angle and bounces off the cladding. You still have a cone of acceptance at play. You're confusing multiple terms and concepts here.

jimmaswell · on Sept 5, 2018

The point seemed more along the lines of "the ping is a function of the speed of light" to me, so not necessarily inaccurate. Although that does leave room for improvement like introducing a vacuum or making the lines more direct.

rocqua · on Sept 5, 2018

I mean, it is 4844 nautical miles as the crow flies, but we could make that 4458 nmi (nautical miles) by taking a direct route through the crust.

I have to say I am disappointed/surprised by how limited those savings are.

harry8 · on Sept 5, 2018

Seems like in keeping a thing up to date it hasn't kept pace with the way computing has changed. Multicore, numa, extra cache level in heirarchy etc.

Lacking:

- L3 cache reference - major fault - tlb miss - syscall overhead - Inter-thread latency 64b - threads on same package - Inter-thread latency 64b- across packages - latency to ethernet tx 64b - latency between 2 machines attached to local switch 64 byte transfer

Comments?

mmirate · on Sept 5, 2018

With SSDs, don't the critical determiners of latency and bandwidth become the data transfer buses, rather than the storage media themselves?

(especially if one uses an external USB SATA controller, in the vain hope of having storage that is all three of durable, fast and easily-replaced)

barrystaes · on Sept 5, 2018

Wow, it seems they did a great job in 1990 with the packet roundtrip CA to Netherlands, it still lists ~150ms for 2018. I think its closer to half or less than that, depending on what CA this is refering to. Assuming its not in Tokyo.

dmayle · on Sept 5, 2018

The trip is dominated by the speed of light (or more accurately, the speed of electricity, which is close to the speed of light, at about 300,000 kilometers per second)

The circumference of the earth is about 40,000 kilometers. A round trip from CA to Netherlends and back will have electricity going roughly that distance. Speed of electricity times 150 milliseconds is 44,000 kilometers.

That means that the number won't change over time.

e12e · on Sept 5, 2018

> (or more accurately, the speed of electricity, which is close to the speed of light, at about 300,000 kilometers per second)

Surely, the dominant part of the path city to city is via fiber, so laser light?

That doesn't mean that's were most of the time goes (but ideally it should) - for latency things like light-to-signal, buffers, switching, carrier grade nat, all takes its toll.

Would be interesting to see a detailed breakdown of typical data center(... Cell phone, home broadband user) to same, though. Including every "point" in the path from ram, via cache, pci, network card, routers and switches, media converters and up again... For bonus something like USB keyboard touch to glyph on screen over udp or other low latency protocol...

Hikikomori · on Sept 5, 2018

Yes, it's more or less all fiber. And in that case the speed is 2/3 of the speed of light.

Delay will in most part be because of distance, forwarding delay in routers are a few 10's of microseconds while switches are at the sub-microsecond level. Stuff like this is quick when they have to handle 100Gbit interfaces.

e12e · on Sept 5, 2018

(I assume you mean 2/3 of light speed in vacuum - it's 100% the speed of light in acrylic (?))

hhmc · on Sept 5, 2018

Wont the majority of the trip be over the fibre optic medium - which would require the speed of light (albeit in a glass medium), rather than the speed of electricity?

This also implies that you could materially change that rtt number if you used a faster medium - e.g via air by way of microwave towers.

Xylakant · on Sept 5, 2018

> This also implies that you could materially change that rtt number if you used a faster medium - e.g via air by way of microwave towers.

Which is done for high-speed trading networks: https://arstechnica.com/information-technology/2016/11/priva...

Hikikomori · on Sept 5, 2018

Funnily enough signal speed in fiber and most copper cablea is the same, 2/3 of the speed of light in vacuum.

pkaye · on Sept 5, 2018

Yup, speed of light in fiber is roughly 2/3*c.

bufferoverflow · on Sept 5, 2018

It might change, if we figure out neutrino-based communications. In which case we reduce that timing to roughly 48ms.

ebikelaw · on Sept 5, 2018

    function getDCRTT() {
        // Assume this doesn't change much?
        return 500000; // ns
    }

:thinking face emoji:

High by 10x in my experience.

philipodonnell · on Sept 5, 2018

It looks like from 2006-2018 the only things that have improved are disk read speeds and the "commodity network" packet speed. Most of this chart is unchanged over that time. Has progress on CPUs, memory, processing speed really stopped over the last 12 years?

exDM69 · on Sept 5, 2018

It measures latency, not throughput. CPU or memory latency has not improved a lot, but throughput has through increased core count and instruction level parallelism. If you look at throughput per watt, the improvement is greater.

bostonpete · on Sept 5, 2018

I assume the "send 2000 bytes over a commodity network" is measuring throughput rather than latency. If not, I really don't understand what it's describing...

gpderetta · on Sept 5, 2018

It is litterally the lenght of a 2000*8 bit backet in a serial link, in ns. The number seems way off unless my math is wrong or 200Gb ethernet counts as commodity today.

Waterluvian · on Sept 5, 2018

Changing the scale using colour but keeping the block the same really makes it hard for me to gain any sense of scale beyond what I already know.

dj43nq · on Sept 5, 2018

This page needs some work my iPad doesn’t render it properly. The Linux box didn’t like it either.

skyde · on Sept 6, 2018

can someone explain this part to me

Send 2000 byte over network = 44ns Roundtrip in same datacenter = 500000 ns

Isn't a rountrip simply a send from A to B and then a send from node B to node A?

detaro · on Sept 6, 2018

The first might be just the time it occupies on the wire (although it'd be a 400G network, that's not quite "commodity" yet IMHO...), vs roundtrip being transfer time, time through switches, time through network stacks, ... to send a packet and reply to it as fast as possible.

skyde · on Sept 6, 2018

not sure what you mean by "on the wire".

But even if this is assumed to be the time to transfer a UDP packet from one network card buffer to another network card buffer directly connected by a cable this seem extremely low.

detaro · on Sept 7, 2018

"on the wire" as in what you'd see if you'd connect an oscilloscope to the "wire" (which would be fiberoptics at that speed...) and watched how long it took from packet start to end. That could work, but it could also just be an error on the page.

ecesena · on Sept 5, 2018

Is there a cloud version, e.g. what we should expect on AWS?

valarauca1 · on Sept 5, 2018

These numbers are based on values from Google's production stack. Insofar as these numbers are based on Jeff Dean's post: http://highscalability.com/blog/2011/1/26/google-pro-tip-use...

That being said, these numbers are roughly identical to what you can expect on AWS.

Mauricio_ · on Sept 9, 2018

What about your average call to malloc() / new?

ssvss · on Sept 5, 2018

Is there a similar chart for bandwidth numbers.

lucio · on Sept 5, 2018

So there's no more room for improvement at the microchip level. Will software become leaner? Is Moore's law reaching a plateau?

stevage · on Sept 6, 2018

If you're a web developer, only one of these is even slightly relevant.

bluetomcat · on Sept 5, 2018

Unless the programmer is programming at a very low level, the listed events occurring are out of his control. CPU caching is mostly transparent in the ISA, disk seeking is scheduled by the kernel or in storage controllers, file buffers are cached in the kernel, application frameworks also provide layers of caching.

For the majority of programmers who want to get their shit done with straightforward code, few dependencies and acceptable performance, this is "interesting to know" but not "should know".

dragontamer · on Sept 5, 2018

> Unless the programmer is programming at a very low level, the listed events occurring are out of his control

> CPU caching is mostly transparent in the ISA

Nonsense. Even in a relatively high language like Java, you can use primitive types like int[] to ensure that certain elements are close to each other in memory. As such, you can have good memory access patterns even in a high language like Java or C#.

I'm fairly certain this stuff is important when choosing data-structures: Vector vs Linked List for instance. Linked Lists are harder to cache than Vectors, and this chart helps explain why both of these O(1) traversals can have dramatically different performance characteristics.

> disk seeking is scheduled by the kernel or in storage controllers, file buffers are cached in the kernel

But you can read a file from beginning to end. Even in a very, very high language like SQL, you can often ensure a high-speed sequential table scan if you write your joins properly. And knowledge of sequential scans can assist you in knowing which indexes to setup for your tables.

Knowing that you have SSDs vs Disks can be helpful in the architecture of SQL architectures.

bluetomcat · on Sept 5, 2018

> As such, you can have good memory access patterns even in a high language like Java or C#.

CPU and data intensive heavy lifting is rarely done in such programs, it is delegated either to specialised libraries or some middleware in the form of a RDBMS. Most of these programs spend most of their time waiting for some IO event, so the few microseconds gained from the vector with a few hundred elements are negligible.

> But you can read a file from beginning to end.

That's what most programs actually do most of the time because files are essentially a stream abstraction. Programs that jump around a file would map it into memory, then the CPU and the kernel would do their best to cache the hot regions, even if the access to these regions is temporally or spatially distant.

OskarS · on Sept 5, 2018

> CPU and data intensive heavy lifting is rarely done in such programs

That's absurd. They aren't as high-performance as C or C++, but Java and C# both have screamingly fast JIT compilers and plenty of high-performance code is written in them. We're not talking about Prolog here. And memory access patterns absolutely makes a huge difference in performance in these languages.

Sure, you CAN ignore that kind of stuff if you want to, but good programmers don't.

phakding · on Sept 5, 2018

> CPU and data intensive heavy lifting is rarely done in such programs

How could you make such a blanket statement? is bashing java the new hipster thing to do in programming world nowadays?

dragontamer · on Sept 5, 2018

Other posters have discussed the first part of your post.

But the second...

> That's what most programs actually do most of the time because files are essentially a stream abstraction. Programs that jump around a file would map it into memory, then the CPU and the kernel would do their best to cache the hot regions, even if the access to these regions is temporally or spatially distant.

Just an FYI: you should always use mmap (Linux / POSIX), or File-based Section Objects (Windows). I don't think streams have any performance benefit aside from backwards compatibility, and maybe clarity in some cases.

MMap and the Windows equivalent allows the kernel to share the virtual memory associated with that file across processes. So if the user opens a 2nd, or 3rd version of your program, more of it will be stored "hot" in the RAM.

Since mmap and section objects only use "virtual memory" (of which we have 48-bits worth on Intel / AMD x64 platforms), we are at virtually no risk of running out of ~256TB of virtual RAM available to our platforms.

guipsp · on Sept 5, 2018

Quite a bit of HFT software is written in java.

Iem3ohvi · on Sept 5, 2018

> Unless the programmer is programming at a very low level

If you write unixy tools that do just one job then you often have to deal with this. For example rsync and tar put files into sequential mode and perform readaheads or writebehind-drop the page cache.

And it's not just that kind of tool. At $JOB I did a fairly simple optimization to significantly reduce loadtimes (from NFS) in a render farm by importing a 3rd-party library which provided the necessary libc bindings for readaheads. It's only a dozen lines of code but reduces user-perceived latency from minutes to seconds. The PO was quite happy about not having to pay for hundreds of NVMe SSDs.

sebazzz · on Sept 5, 2018

In general small scale Web development the only thing that matters are: 1. Are there not too many queries fired at the database (N+1) 2. Do the queries perform well

Once you use SQL or any external service to your application what you do else often does not matter very much. As long as you keep the big O in the back of your mind, you would not optimize to cache lines etc.

phakding · on Sept 5, 2018

Not sure what very low level is, but if you are writing anything with low latency requirements you need to be cognizant of these things. As someone mentioned, using an array over a list helps CPU in memory striding. Cache misses are expensive.

beached_whale · on Sept 5, 2018

You often can control your data access patterns. Array like access, in order, is often cache friendliest

TruffleLabs · on Sept 5, 2018

The visual could help programmers think about the kinds of latencies that have to work through, maybe helping highlight them in an infographic for their teams/partners.

jiveturkey · on Sept 5, 2018

Tufte should be required reading if one is publishing something like this. I assume this is the work of an undergrad because it's horrible.

EDIT: downvote is absurd. it's a no-question horrible visualization. I don't know which is worse, the poor presentation or the lack of credit to the original author (Dean).

mmirate · on Sept 5, 2018

> the lack of credit to the original author (Dean)

The linked GitHub repo's description reads "Jeff Dean's latency numbers plotted over time".

jiveturkey · on Sept 6, 2018

yes, the github. 5% of the target audience might read that. trivial to add the credit on the graphical page, and no excuse not to do so.

third fault, which to me should ban this page from the internet: he has dared to put a "2018" text on his page, insinuating that it's new data, or some new insight, as opposed to being 15+ years old.

when the original text is more understandable than your visualization, you dun goofed