How CPU load averages work, and using them to triage webserver performance

heinrichhartman · on Feb 8, 2016

Be aware, that load averages are exponentially damped averages not moving averages over windows of length 1,5,10 mintues.

More quirks with linux performance tools are outlined in Brendan Gregg's talk:

http://www.slideshare.net/brendangregg/broken-linux-performa...

https://www.youtube.com/watch?v=9U4jFpsEyYE&t=150

catchmrbharath · on Feb 8, 2016

One of the best articles I read on the subject was this http://blog.paralleluniverse.co/2014/02/04/littles-law/

sserrano44 · on Feb 8, 2016

Load average measures how long is the queue of processes waiting for the CPU.

Most web applications aren't CPU intensive, rather they depend on many other services like the database, S3, ... If a web worker is waiting for another service it should release the CPU until the service responds.

So is entirely possible that all workers are waiting for other services, your app is not responding and the CPU is free.

adamt · on Feb 8, 2016

> Load average measures how long is the queue of processes waiting for the CPU

I don't think that's quite right.

Load averages measure the size of queue of processes runnable and waiting for CPU or disk.

If you ever have a webserver/cache that has disks that have become the bottleneck, you can often get to a high load (lots of runnable processes, blocked on disk) but low CPU util (everything is blocked on disk, CPU relatively idle).

In my experience high UNIX load on busy systems is far more caused by disk IO utilisation rather than CPU utilisation.

E.g. if you run "du -s / &" 4 times, you can watch your UNIX load go to 4 (after 1 minute), but you will have idle CPU time. This will be more obvious if you have slow magnetic disks rather than a fast SSD.

creshal · on Feb 8, 2016

> If a web worker is waiting for another service it should release the CPU until the service responds.

Emphasis on "should". With your average drivel (Wordpress e.g.) this doesn't happen and the load on the application server scales nicely with the response time of the database server. 10 requests waiting on a locked table for 2 seconds = load of 10.

icebraining · on Feb 8, 2016

Are you saying that the PHP mysql driver waits for the results with a busy loop? I have to say I find that hard to believe.

creshal · on Feb 8, 2016

> I have to say I find that hard to believe.

¯\_(ツ)_/¯

Steps to reproduce: Set up one mysql server. Set up one PHP application server (nginx+PHP fpm in my case, shouldn't make a difference with mod_php) on a different machine. Configure Wordpress to use said mysql server. Lock the database in some way (say, mysqldump of a Wordpress installation with millions of records in wp_options), then hit the application server with regular traffic (or ab), watch as load reaches pm.max_children. On the application server, not the database server.

Arnt · on Feb 8, 2016

If a service waits for another service, then that other service often contributes to CPU load.

Not always, no...

creshal · on Feb 8, 2016

Not when the other service is on another machine, like in the case I described.

Arnt · on Feb 8, 2016

Sorry, I was sloppy. I meant that it generates CPU load somewhere (or I/O load, for those OSes where those two are cleanly separated).

If you measure the CPU load on half the CPUs that are involved in serving a user, you're not really measuring anything. Ditto if you're outsourcing (parts of) the backend and not measuring the outsourced service level.

creshal · on Feb 8, 2016

> Sorry, I was sloppy. I meant that it generates CPU load somewhere (or I/O load, for those OSes where those two are cleanly separated).

Not necessarily. In the scenario I outlined to icebraining, "load" is generated (on the app server) by simple busy waits due to lock congestion: Full table locks by mysqldump block the database server to clients, so the database server itself is seeing negligible load (it's only serving one client after all).

> If you measure the CPU load on half the CPUs that are involved in serving a user, you're not really measuring anything

I'd rather argue that the Unix/Linux CPU load metric is meaningless either way, except in a vague "something's wrong somewhere, probably" way.

Arnt · on Feb 8, 2016

IMO if there's enough busywaiting that it shows up on the CPU load, then there's so much busywaiting that that is a problem, because a CPU that's busy with such chores is not available for real work. The CPU load is not (as you say) very lucid, a bit like the question mark in the "unix car" joke, but in that particular case it's caused by a real problem: Too much busywaiting.

feld · on Feb 8, 2016

Load averages are complex and also quite different between different *nix operating systems. This old post on undeadly is pretty good.

http://undeadly.org/cgi?action=article&sid=20090715034920&mo...

daveguy · on Feb 8, 2016

Related, but not the same... There was an article on HN about various tools to evaluate resource usage in the last month or two: top, lsof, netstat, etc. I think it was put out by a big name in web services: Netflix, Amazon, etc (but I'm not sure). I was sure I had it bookmarked, but I haven't been able to find it. Does anyone remember that post or have it readily available?

severine · on Feb 8, 2016

This? https://news.ycombinator.com/item?id=10654681

jbert · on Feb 8, 2016

Isn't the calculation (assuming (for simplicity) one CPU, running at 100% only on application load)

60 requests/sec => each request takes 1/60s CPU-second == 16.6ms of CPU time to process? (This is time-on-cpu, and doesn't include time-waiting-for-cpu. I think time-on-cpu is the number you want if you're looking at optimising your codebase)

deathanatos · on Feb 8, 2016

She does mention:

> each request was taking 6 / 60 = 0.1s = 100ms of time using-or-waiting-for-the-CPU.

(emphasis mine)

In my original read, I thought her core count was greater than her load, so that would also be her direct time-on-cpu. Now I'm not so sure.

And while time-waiting-for-cpu might not be important to optimizing the codebase, you probably still want to know that your serving processes are waiting for CPU; after all, it is that number that your user's browser is seeing (at least, between the two it is moreso that one). Such a result might indicate a larger machine or more machines are required, for example.

cperciva · on Feb 8, 2016

Pretty much. If you want to get fancy you can make assumptions about the distribution of request arrival times and use the mean queue length of 6 to estimate the fraction of the time when the queue length drops to zero; you probably come out somewhere around 10% CPU idle time, so each request is taking 15 ms to process rather than 16.6 ms.

But cpu-time-per-request is definitely the number you want to pay attention to. If you cut that by a factor of 2, you won't decrease the load average from 6 to 3; you'll decrease it from 6 to less than 1.

halayli · on Feb 8, 2016

Any decent web server shows the response time. This can easily be verified. I am not sure why such number isn't tracked in the first place.