Can you imagine if they loaded 1000lbs of lead onto an airplane before they took...

jaylevitt · on July 13, 2012

If your system is complex, you don't know what your capacity is, or whether the curve has a sharp knee. Think of disk I/O; if 100K simultaneous users keep your disks 50% busy, how many simultaneous users can you actually support? The answer is not 200K; the answer is "it depends".

0xbadcafebee · on July 15, 2012

First of all, it's pointless to monitor how much of your system is in use by 'number of users'. That's not a metric. You look at your iops metrics to figure out how loaded it is. Once you've gathered trending data you can then come up with an average iop load for a given number of users.

Secondly, you should know what your capacity is. Stress testing exists for a reason.

jaylevitt · on July 15, 2012

First of all, no, it isn't. At a large enough scale, any metric that goes up with usage is a perfectly fine proxy for many conversations. Every day for a decade we had the same 3% of our user base online at peak doing the same transactions as yesterday. Over longer periods, the number crept up and the transaction mix changed, but "number of simultaneous users" got me accurate, if not precise, capacity planning from 27 to 1.5 million users.

Stress-testing, shared-nothing and dollar-scalable are platonic ideals, and they're not always achievable. If Dropbox had three infrastructure engineers, they probably weren't able to build proper capacity planning models, and probably couldn't afford to build a full production work-alike for stress testing anyway. (And at some scales, that's literally impossible. Our vendors couldn't physically manufacture enough servers to build a full test environment, cost aside.) I'm sure they did some simulated tests as well, but those won't tell you the whole story.

You're focused on IOPS, but you have no idea if that's what Dropbox's bottlenecks were. (Not to mention: What does IOPS mean on an EBS and S3 infrastructure?) Complex systems fall over in complex ways. You can predict the next bottleneck, but not the one after that; by the time you get there, your fix for the first bottleneck will have changed the dynamics.

It sounds like they did do stress testing, using real-world loads, on a system that was 100% similar to their production system. They ran continuous just-in-time stress tests in the Big Lab.

0xbadcafebee · on July 16, 2012

Maybe your application/environment was stable enough that site performance didn't change drastically over a decade. I've seen performance shift dramatically with the same number of users for different reasons (high-post flattened comment threads being slammed when Michael Jackson died, software upgrades which mysteriously load up servers on some requests, and of course feature adds). When a site issue occurs, I don't reach for "how many users are logged in?", I reach for my sorted list of machine and service-specific metrics and look for irregularities.

That being said, trends in user visits are of course great numbers for capacity planning because you have an idea how much growth to expect in the near future. But it's only a vague multiplier; you need to know how beefy a box to get (by stress testing to determine capacity) and then multiply by the growth factor. But it's usually more complicated than this.

Stress testing doesn't have to be a formal process in all environments. You might just have a developer with a new chat server and they want to get a benchmark of how many users can join and chat before CPU peaks. An hour or two of coding should provide a workable test on like-hardware, which can then be generalized with tests of other software to give an idea of the capacity when a certain number of users are logged in and performing the same operations. The point isn't to know 100% when you will fall over, but to have at least an idea when you're going to fall over, so you don't have to actually fall over to figure out when and where to scale.

I have no problems with very-short-term big lab stress testing. We had the same issue at my last place, and with lots of caution, it worked fine. But jesus christ, if I told my bosses "I think we should run all the servers with extra load until they fall over, then re-evaluate", they'd look at me like I had antlers growing out of my head.

eranki · on July 13, 2012

It's actually more like dumping fuel for an emergency landing instead of dumping passengers, if we really insist on using completely irrelevant analogies.

0xbadcafebee · on July 13, 2012

Are planes in the habit of carrying much more fuel than they need, just in case they will weigh too much when landing, in which case they dump the fuel? No. In the inverse example, we don't put extra load on servers in anticipation that load goes up, just so we can dump the extra load and put it back to where it would have been had we not had extra load. The idea is moronic. It is literally the same as putting lead in your servers until you have to dump it, just because you can. Real admins have alerts on their servers to tell them when they will need additional capacity.

Incidentally, fuel dump systems were initially added due to a rule by the FAA that a plane's structural landing weight not be exceeded by its takeoff weight. Many commercial planes never had this problem, so dumping systems were not installed. As a result, most planes just circle until they've burned up enough fuel, or land anyway overweight. You could dump fuel to lessen the chance of explosion, but only if your plane is equipped with a fuel dump system, and such incidents are so rare it's not even a safety consideration.

eranki · on July 13, 2012

lol why are you so mad -- perhaps you couldn't make this out through your tears of rage:

"Why not just plan ahead? Because most of the time, it was a very abrupt failure that we couldn’t detect with monitoring."

0xbadcafebee · on July 15, 2012

lol why are you retarded? What does monitoring have to do with adding extra load onto your servers? If there's a failure there's a failure. What does adding or removing load have to do with detecting it?

So you have a system, and you have monitoring in place. Let's say the monitors were set up for 1 minute polls, because somebody thought that was a good idea. Suddenly you find out one of your servers is down. Oh noes! There's 45 seconds until the monitor finds this out, which would be horrible.

Since we have doubled the reads on the existing servers, we now no longer have capacity and connections are stacking up. Shit :'( But not to worry! Let's just quickly kill the extra reads - now we have more capacity! Hooray!

Except, if the extra reads weren't happening, they would have already had extra fucking capacity and not had to flip a switch in the first place.

Now you see why i'm mad, bro?

batista · on July 14, 2012

>Can you imagine if they loaded 1000lbs of lead onto an airplane before they took off, just to see if the plane takes off with the headroom filled? "Oh crap, the plane is falling, let's dump the lead."

They actually do this kind of stuff (except for the "lets dump the lead" part), in stress tests, especially in cargo and millitary planes. And they do similar tests not only in aviation, but in most kinds of engineering.

So maybe misplaced sarcasm?

0xbadcafebee · on July 15, 2012

That's stress and load testing. You don't do that on every single flight. That's the stupid part.