Scaling lessons learned at Dropbox, part 1

jgannonjr · on July 12, 2012

Great post, but this part scares me a bit...

I think a lot of services (even banks!) have serious security problems and seem to be able to weather a small PR storm. So figure it out if it really is important to you (are you worth hacking? do you actually care if you’re hacked? is it worth the engineering or product cost?) before you go and lock down everything.

Just because you can "afford" to be hacked, doesn't mean you shouldn't take all the steps necessary to proactively protect your data. In the end, security is not about you, it is about your users. This is exactly the type of attitude that leads to all the massive breaches we have been seeing recently. Sure your company is "hurt" with bad PR, but really your users are the ones who are the real victims. You should consider their risk (especially with something as sensitive as people's files!) before you consider your own company's well being.

Edit: formatting

jandrewrogers · on July 13, 2012

Yeah, that point significantly underestimates the cost of cleaning up once your systems have been penetrated. By the time you notice that one system has been compromised, there is no guarantee that every system at your company is not compromised, particularly if so little effort is put into a robust security architecture. I've seen companies that took the attitude the author does and ended up paying for it down the road.

Systems get compromised, it happens. Organizations with weak security architectures can become so compromised that cleanup becomes a nightmare because it is difficult to isolate the threat(s) without serious disruption in services. A strong security architecture is not so much to ensure breaches never happen but to limit the amount of damage likely to occur when breaches do happen.

And yes, this happens even to organizations that think they have nothing worth hacking.

jgannonjr · on July 13, 2012

You are absolutely correct; I have consulted with several companies, large and medium sized, who have this exact thing happen. Just to quote the article again:

Having internal firewalls between servers that don’t need to talk to each other — again a good idea. But if your service doesn’t actually need this, don’t necessarily do it

I can not think of any reason why "your service doesn't actually need this" and "don't necessarily do it". I understand that it costs money to do these things, but setting up a firewall is relatively cheap, significantly less than the cost of the additional cleanup if the breach is not contained.

Security, in a way, can be compared to insurance. Sure, if you are young and live a healthy life style you may not necessarily see the need to spend $100+ a month for a health insurance policy, you can save a bunch of money... but if an accident does happen, you can rest assured it will cost you significantly more than if you had just bought the insurance in the first place.

This, in a sense, is the security tradeoff.

I think really smart engineers who are well versed in security can know where security needs to be, and yes it is possible to go overboard, but I think this is the exception rather than the rule. Advising readers that it's ok to not worry too much about security because:

lot of services (even banks!) have serious security problems

is absolutely ridiculous and is horrible advise.

cma · on July 12, 2012

That's an externality; shareholders can sue the board of directors if they find out it is company policy to waste resources on it beyond its potential costs to the company: PR + legal liability + probability of causing new regulation * the cost of adhering to said regulation.

Sucks, but that's capitalism. However, there are a few states now which allow you to have some charitable clauses in your corporate charter.

brc · on July 12, 2012

The idea of running extra load - it sounds good in theory but I can't help thinking that it's a bit like setting your watch forwards to try and stop being late for things. Eventually you know your watch is 5 minutes fast so start compensating for it. I wonder if this strategy starts to have the same effect - putting fixes off because you know you can pull the extra load before it becomes critical. In the same way you leave for the train a couple of minutes later because you know your watch is actually running fast.

apu · on July 13, 2012

I actually purposely used to set all the various clocks at home ahead by anywhere from 0 - 15 minutes. At first, I could remember which ones were ahead by how much, but then soon I started to forgot and had to just assume they were running at the right time. It worked great.

After a few years of this, I set them all back to right time and found that I had trained myself to just leave at the right time, with no more trickery needed.

inerte · on July 13, 2012

I only have one alarm. If it fails I am late. I found out that depending on complex systems work against you.

Once I had three wake up alarms, at different points at the bedroom. Didn't work.

Being late is lame. Suffering its consequences is the best teacher one can have.

chipsy · on July 13, 2012

Were you the college roommate I had who suspended the alarm on a string over the bed so that standing was required to turn it off?

pestaa · on July 13, 2012

I saw an alarm that had a propeller on top of it. The clock made it fly and you had to hunt it down (even if it falls behind the furniture) to shut the alarm.

andrewflnr · on July 14, 2012

I've heard of alarms on wheels that run away from you but that takes the cake.

Game_Ender · on July 13, 2012

This is just an adaption of the "margin" system commonly used in systems engineering. Basically you never spec everything out to perfectly match the load it's expected to bare, always leave some headroom.

For example if you are designing an aircraft your first design is never perfect. So you when you do the initial design, you do it as if the aircraft has to weight 70% of what it really will. As errors are corrected in your original design (or features creep in) you will slowly eat away at that 30% margin. Hopefully by the time you finish the have some left, or the aircraft will never get off the ground.

0xbadcafebee · on July 13, 2012

Can you imagine if they loaded 1000lbs of lead onto an airplane before they took off, just to see if the plane takes off with the headroom filled? "Oh crap, the plane is falling, let's dump the lead."

OP's "extra reads" is dumb because he could have had normal metrics for memcached load and planned to only support like 70% capacity or somesuch, and when load hit that number, he would immediately increase capacity. Instead he's running with a handicap. It's just useless.

jaylevitt · on July 13, 2012

If your system is complex, you don't know what your capacity is, or whether the curve has a sharp knee. Think of disk I/O; if 100K simultaneous users keep your disks 50% busy, how many simultaneous users can you actually support? The answer is not 200K; the answer is "it depends".

0xbadcafebee · on July 15, 2012

First of all, it's pointless to monitor how much of your system is in use by 'number of users'. That's not a metric. You look at your iops metrics to figure out how loaded it is. Once you've gathered trending data you can then come up with an average iop load for a given number of users.

Secondly, you should know what your capacity is. Stress testing exists for a reason.

jaylevitt · on July 15, 2012

First of all, no, it isn't. At a large enough scale, any metric that goes up with usage is a perfectly fine proxy for many conversations. Every day for a decade we had the same 3% of our user base online at peak doing the same transactions as yesterday. Over longer periods, the number crept up and the transaction mix changed, but "number of simultaneous users" got me accurate, if not precise, capacity planning from 27 to 1.5 million users.

Stress-testing, shared-nothing and dollar-scalable are platonic ideals, and they're not always achievable. If Dropbox had three infrastructure engineers, they probably weren't able to build proper capacity planning models, and probably couldn't afford to build a full production work-alike for stress testing anyway. (And at some scales, that's literally impossible. Our vendors couldn't physically manufacture enough servers to build a full test environment, cost aside.) I'm sure they did some simulated tests as well, but those won't tell you the whole story.

You're focused on IOPS, but you have no idea if that's what Dropbox's bottlenecks were. (Not to mention: What does IOPS mean on an EBS and S3 infrastructure?) Complex systems fall over in complex ways. You can predict the next bottleneck, but not the one after that; by the time you get there, your fix for the first bottleneck will have changed the dynamics.

It sounds like they did do stress testing, using real-world loads, on a system that was 100% similar to their production system. They ran continuous just-in-time stress tests in the Big Lab.

0xbadcafebee · on July 16, 2012

Maybe your application/environment was stable enough that site performance didn't change drastically over a decade. I've seen performance shift dramatically with the same number of users for different reasons (high-post flattened comment threads being slammed when Michael Jackson died, software upgrades which mysteriously load up servers on some requests, and of course feature adds). When a site issue occurs, I don't reach for "how many users are logged in?", I reach for my sorted list of machine and service-specific metrics and look for irregularities.

That being said, trends in user visits are of course great numbers for capacity planning because you have an idea how much growth to expect in the near future. But it's only a vague multiplier; you need to know how beefy a box to get (by stress testing to determine capacity) and then multiply by the growth factor. But it's usually more complicated than this.

Stress testing doesn't have to be a formal process in all environments. You might just have a developer with a new chat server and they want to get a benchmark of how many users can join and chat before CPU peaks. An hour or two of coding should provide a workable test on like-hardware, which can then be generalized with tests of other software to give an idea of the capacity when a certain number of users are logged in and performing the same operations. The point isn't to know 100% when you will fall over, but to have at least an idea when you're going to fall over, so you don't have to actually fall over to figure out when and where to scale.

I have no problems with very-short-term big lab stress testing. We had the same issue at my last place, and with lots of caution, it worked fine. But jesus christ, if I told my bosses "I think we should run all the servers with extra load until they fall over, then re-evaluate", they'd look at me like I had antlers growing out of my head.

eranki · on July 13, 2012

It's actually more like dumping fuel for an emergency landing instead of dumping passengers, if we really insist on using completely irrelevant analogies.

0xbadcafebee · on July 13, 2012

Are planes in the habit of carrying much more fuel than they need, just in case they will weigh too much when landing, in which case they dump the fuel? No. In the inverse example, we don't put extra load on servers in anticipation that load goes up, just so we can dump the extra load and put it back to where it would have been had we not had extra load. The idea is moronic. It is literally the same as putting lead in your servers until you have to dump it, just because you can. Real admins have alerts on their servers to tell them when they will need additional capacity.

Incidentally, fuel dump systems were initially added due to a rule by the FAA that a plane's structural landing weight not be exceeded by its takeoff weight. Many commercial planes never had this problem, so dumping systems were not installed. As a result, most planes just circle until they've burned up enough fuel, or land anyway overweight. You could dump fuel to lessen the chance of explosion, but only if your plane is equipped with a fuel dump system, and such incidents are so rare it's not even a safety consideration.

eranki · on July 13, 2012

lol why are you so mad -- perhaps you couldn't make this out through your tears of rage:

"Why not just plan ahead? Because most of the time, it was a very abrupt failure that we couldn’t detect with monitoring."

0xbadcafebee · on July 15, 2012

lol why are you retarded? What does monitoring have to do with adding extra load onto your servers? If there's a failure there's a failure. What does adding or removing load have to do with detecting it?

So you have a system, and you have monitoring in place. Let's say the monitors were set up for 1 minute polls, because somebody thought that was a good idea. Suddenly you find out one of your servers is down. Oh noes! There's 45 seconds until the monitor finds this out, which would be horrible.

Since we have doubled the reads on the existing servers, we now no longer have capacity and connections are stacking up. Shit :'( But not to worry! Let's just quickly kill the extra reads - now we have more capacity! Hooray!

Except, if the extra reads weren't happening, they would have already had extra fucking capacity and not had to flip a switch in the first place.

Now you see why i'm mad, bro?

batista · on July 14, 2012

>Can you imagine if they loaded 1000lbs of lead onto an airplane before they took off, just to see if the plane takes off with the headroom filled? "Oh crap, the plane is falling, let's dump the lead."

They actually do this kind of stuff (except for the "lets dump the lead" part), in stress tests, especially in cargo and millitary planes. And they do similar tests not only in aviation, but in most kinds of engineering.

So maybe misplaced sarcasm?

0xbadcafebee · on July 15, 2012

That's stress and load testing. You don't do that on every single flight. That's the stupid part.

dools · on July 13, 2012

it sounds good in theory

Don't you mean "it sounds good in practice"? This entire post is about practical experience.

I don't think this is like setting your watch forward 5 mins. I think it's more like RAID. When you get a warning that one of your drives has died, you know you have to get in and replace it.

Depending on how critical the machine is, the cost of getting to the data centre etc. you might leave now, in the middle of the night and drive like a bat out of hell, or you might leave it til next week when you'll be in there anyway.

Either way you know your risk just went up a hell of a lot. Depending on how risk averse you are, you will act accordingly.

joshma · on July 13, 2012

It's not exactly the same, though. If you're late 5 minutes by your watch you're on time naturally, without any action. If you're "overloaded" on your servers, you at least have to consciously decrease the extra load or risk having real consequences.

jaylevitt · on July 13, 2012

I think (I assume) his point was that when you have a complex system that's growing rapidly, it's difficult to predict what the next breaking point will be. Building in extra load gives you a window of opportunity: You discover the breaking point, turn off the extra load, re-engineer, and turn back on the canary so you can find the next breaking point.

In theory, you'd think you can do load-testing and simulations and capacity planning, and find these breaking points ahead of time. In practice, it's not always feasible, and this seems like a simple-enough hack that gets you much of the way there.

nl · on July 12, 2012

I wish he'd left the security advice out.

The whole post was excellent, but all the useful points will now be overshadowed by the armchair quarterbacking about security by people who mostly don't understand that ALL security is a compromise, and it is as important to understand and make deliberate decisions about your security as it is to try to make a secure system in the first place.

eranki · on July 13, 2012

I thought about it, but honestly, I think it's important to try to fight against all the sanctimony and handwringing that surrounds security. People should feel comfortable talking about security as a tradeoff without diluting the argument with gratuitous qualifications and apologies.

bestes · on July 13, 2012

I'm glad he put the security notes in. It is so hard to get true facts about how things are actually done.

mturmon · on July 13, 2012

Looking again at the post, I think the author was in fact rather careful to not give away anything about security practices at Dropbox when he was there, for obvious reasons.

He keeps many comments at a high level (security/convenience) and refers to a few non-Dropbox examples.

dools · on July 13, 2012

but I really hate ORM’s and this was just a giant nuisance to deal with

I like object relational mapping as a theory (ie. I have an object of type Author which has 1 or more books I can loop over), but I hate ActiveRecord implementations. Eventually, they just end up implementing almost all of SQL but in some arcane bullshit syntax or sequence of method calls that you have to spend a bunch of time learning.

I also seriously doubt that anyone has ever written a production system of any reasonable complexity and been able to use the exact same ORM code with absolutely any backend (if you have an example please correct me on this). This barely even works with something like PDO in PHP which is a bare bones abstraction across multiple SQL backends.

When it comes down to it, the benefits of ActiveRecord are all but dead on about the third day of development. The data mapper pattern adopted by SQLAlchemy (et. al.) takes all of the shitness of ActiveRecord and adds mind bending complexity to it.

SQL is easy to learn and very expressive. Why try and abstract it?

I spent years working with an ActiveRecord ORM I wrote myself in my feckless youth and thought that it was the answer to the world's problems. I didn't really understand why it was so terrible until I did a large project in Django and had to use someone else's ORM.

When I really analysed it, there were only three things that I really wanted out of an ORM:

1) Make the task of writing complex join statements a bit less tedious

2) Make the task of writing a sub-set of very basic where clauses slightly less tedious

3) Obviate the need for me to detect primary key changes when iterating over a joined result set to detect changes in an object (for example, looping over a list of Authors and their Books)

To that end, I wrote this:

https://github.com/iaindooley/PluSQL

It's written in PHP because I like and use PHP but it's a very simple pattern that I would like to see elaborated upon/taken to other languages as I think it provides just the bare minimum amount of functionality to give some real productivity gains without creating a steep learning curve, performance trade-off or any barrier to just writing out SQL statements if that's the fastest way to solve the problem at hand.

arohner · on July 14, 2012

> I also seriously doubt that anyone has ever written a production system of any reasonable complexity and been able to use the exact same ORM code with absolutely any backend (if you have an example please correct me on this).

You're entirely right here, because databases are different. For example, (I forget the exact details), "select count(*)..." in MySQL is O(1), but it's O(log n) or O(n) in Postgres, depending on indices. That's a detail no ORM is going to save you from.

> SQL is easy to learn and very expressive.

Strongly disagree. The reason everyone keeps trying to write ORMs is because 1) SQL is a shitty language and 2) it's not the language that programmers want to use. Write a better frontend language for Postgres, and the ORMs would disappear.

I strongly suspect that would take some of the wind out of the NoSQL crowd. There are certainly NoSQL deployments that would have a hard time on traditional RDMBS, but there are a lot of other places that use Mongo just because they don't like SQL-the-language, rather than Postgres-the-DB.

batista · on July 14, 2012

No, actually the only reason is "its not a language programmers want to use".

It is very much non shitty.

Its just that lots of programmers, especially OO minded cannot get into its mindset, and use it for what it is, they have to put a lame OO abstraction on top.

Functional programmers shoud fare better in this regard (or Prolog programmers, if they still exist).

If you really want to abstract it, something like LINQ is a better way.

jmathai · on July 14, 2012

I agree. I see SQL similarly to regular expressions. There's a handful of commands which let you do a lot of stuff.

The hard part in SQL is optimization which requires really understanding how the underlying database engine optimizes and executes the query.

Optimizing complex queries is no joke. It's one of the reasons noSQL seems nice at glance. You can do the optimizations by adding lots of indexes or using application logic. In reality, it's a tradeoff for other problems.

zzzeek · on July 13, 2012

ORMs aren't about saving one the need to know how to write SQL, they are about automating the task of constructing all the redundant SQL needed to marshal data to and from domain objects, as well as all the redundant automation involved in the actual marshaling of data from database client libraries into domain objects.

I think you are exaggerating quite a bit when you refer to SQLAlchemy's patterns adding "mind-bending complexity". Object relational mapping is a complex affair to start with. Have you much experience with modern versions of SQLAlchemy directly (and if not, how fair are comments like that) ?

thezilch · on July 13, 2012

All I want from an ORM is to manage caching intelligently. I'll learn some arcane bits to assist the ORM's pathfinding, but I simply can't imagine ORM's strong suits being in writing less tedious queries. Granted, the less-strokes and less-verbose nature of an ORM query is still a nice benefit.

    There are only two hard things in Computer Science:
    cache invalidation and naming things.

    -- Phil Karlton

dools · on July 13, 2012

The simplification of writing join queries where the primary key relationships are obvious also has the nice side effect of meaning that if you change relationships between entities (for example from 1-to-many to many-to-many the code won't have to be changed.

At any rate what I'm saying really is that reducing the amount of keystrokes writing and maintaining joins is the only part of SQL where I see there can be significant gains in productivity through automation of the task.

Most ORMs implement where clauses, from clauses, aggregate functions, grouping, having, etc. etc. etc. ie. they wind up basically re-implementing SQL and abstracting it so that your previous knowledge of SQL is basically obsoleted and in order to debug problems or create complex queries you either have to switch entirely to SQL (in which case you lose all query building functionality) or map in your head the SQL you want to achieve, to the arbitrary syntax provided by the ORM software.

lucian1900 · on July 14, 2012

I've found SQLAlchemy to be very nice, actually. It first provides very basic abstractions on top of SQL, things like defining tables/columns and querying without having to mess around with strings.

That alone is most of the usefulness of SQLAlchemy, as it lets you write subqueries and joins extremely easily.

On top of that, the (optional) ORM is built as models on top of SQLAlchemy's table/relationship API. These models can be queried almost exactly like the raw tables.

misiti3780 · on July 13, 2012

Great advice:

"pick lightweight things that are known to work and see a lot of use outside your company, or else be prepared to become the “primary contributor” to the project."

prayag · on July 12, 2012

Fabulous post. Thanks for writing.

One point it misses though is to test your backup strategy often. When you scale fast things break very often and it's good to be in practice of restoring from backups every now and then.

mirkules · on July 12, 2012

Just started reading a book called "High Performance MySQL" and in one of the early pages, the following advice appears:

"It's an excellent idea to run a realistic load simulation on a test server and then literally pull the power plug. The firsthand experience of recovering from a crash is priceless. It saves nasty surprises later."

Same goes for testing network connectivity and failover. I can't tell you how many times I've heard things like "The automatic recovery _should_ have kicked in but..."

Having a recovery procedure and backup strategy is completely different from having actually restored a backup and recovered from a failure.

RegEx · on July 12, 2012

Reading High Performance MySQL as well. Loving it so far!

eranki · on July 12, 2012

Thanks! Good point. We actually repurposed our offsite database recovery to clone slaves off a master (after LVM was no longer performing), so that's a great way to get more testing in.

akent · on July 12, 2012

I noticed that a particular “FUUUCCKKKKKasdjkfnff” wasn’t getting printed where it should have

Why not take the extra half a second to make those random strings meaningful and hidden behind a DEBUG log level?

ephemeralgomi · on July 12, 2012

Probably most of their logging _is_ meaningful, but deciding how to professionally phrase each and every log message will eventually get you to decision fatigue.

The point that he was making with this was that over-logging is a good thing - this probably wasn't something the initial author thought was going to be terribly informative, hence the random string. And yet it ended up diagnosing a real world problem.

In a perfect world, by all means properly write out your messages - but if you're stalling on a log message because you're not sure how to phrase it, you may get concrete benefit from just dropping a FUUUCCKKKKKasdjkfnff and moving on.

christoph · on July 13, 2012

So, so true.

When the problem occurs, it's pretty quick for the guy who needs to fix it at 2am, to find where it exploded in the code base, while the original developer is (maybe) passed out in a bar somewhere.

Not much else matters. He could of just done :( x 10 and had the same result. The main thing is, it's easily traceable!

akent · on July 13, 2012

Sure, for fatal errors, random (greppable) strings aren't so bad, but the OP made it sound like his FFFFFFFFFFFUUUUUU message was expected all the time rather than an exception. If you're going to print something all the time in normal operation, make it meaningful.

aboodman · on July 13, 2012

That isn't how I took it. I read it as: there was an error scenario that was clearly happening, which should have resulted in a particular log message. Except it wasn't, which meant that something was racing.

vosper · on July 13, 2012

I don't know how many new logging statements you commit to production code every day, but I can't imagine it averages out to more than one or two. If you can't take the time to phrase them both professionally and meaningfully then you're doing yourself and your team a disservice.

flatline3 · on July 13, 2012

Moreover, you have, in your head, the log message that should be written.

At the time of writing the code, you're hopefully thinking through "how could this fail?"

There's your log message.

crazygringo · on July 13, 2012

I dunno. Sometimes there's just a scenario that isn't really easy to fit into a meaningful phrase, and that shouldn't happen too often.

"FUUUCK" is awfully good at conveying the seriousness of the error, and "aslfkhsdf37" ensures that the string is unique, so you can pinpoint it instantly in your gigantic codebase.

The fact is, it kind of works. Something like "missing record (line 38)" doesn't indicate the severity, there might be 10 different "missing record" error strings in your codebase, and somehow in real life, line numbers and filenames never seem to quite match up like they should (transcompilation, async callbacks, and so on.)

batista · on July 14, 2012

I think Dropbox did well enough financially and technically, so that the team doesnt need pedantic advice on professionalism...

pdeuchler · on July 13, 2012

Oh c'mon. Does HN have the capacity to not be critical 24/7? The guy is clearly competent at his job, there is no need to nitpick.

Let he who has never written a frustrated, nonsense, print statement throw the first stone, if you will.

eranki · on July 13, 2012

^ For those that wonder why there is even a separate clean log.

batista · on July 14, 2012

Because a half and second here and a half a second there, and soon we're talking weeks.

Plus, the statement is not only meaningful, but also very expressive.

elefont2 · on July 13, 2012

'Even memcached, which is the conceptually simplest of these technologies and used by so many other companies, had some REALLY nasty memory corruption bugs we had to deal with, so I shudder to think about using stuff that’s newer and more complicated'

Does anyone know what memory corruption bugs they are referring to?

acslater00 · on July 12, 2012

For the record, I use sqlalchemy 0.6.6 regularly under fairly heavy load, and have never had a problem with it. Any 'sqlalchemy bugs' are inevitably coding mistakes on my part.

kennywinker · on July 12, 2012

Yeah, I found that bit quite vague. Are they using SQLAlchemy's object layer, but just not the high level query stuff? Or are they using only the low-level query stuff and nothing else?

I'd love to know more about how their system works, if they are indeed not using an ORM.

Every time I've tried to build something without an ORM, I just end up writing my own shitty one accidentally.

zzzeek · on July 14, 2012

We got a dozen or so email list requests for support from people who I know to be from Dropbox in late 2008. At that time, we were at version 0.4.8. That is an extremely old version and the codebase was quite immature at that time - I personally didn't use SQLAlchemy in production until 0.5 (which of course is because Python was hardly used at all in the early 2000's outside of the zope community, so I was still stuck with Java/Perl gigs). However I am still quite skeptical of the claim that it returned the wrong results. You're expected to watch the SQL you're telling it to generate during development. It will always be true that pushing an ORM will not always generate the SQL you want - which is why you have to make sure those queries are how they should be, before pushing to production. The ORM will of course stick to the plan you've given it - it isn't "deciding" anything, and at worst it can only misinterpret your intent - just like any library.

My strategy with SQLAlchemy has always been to under-promote it. If you have lots of big players early adopting you and hitting all the pointy edges, it can damage your rep. There's a group of major folks out there who will never use my library due to old experiences. Others like Reddit and Yelp have hung on, and apparently dropbox is still using the core, hooray !

That's why I'm always amazed at how aggressively MongoDB is promoted, when it seems like they're still going through a lot of growing pains. I guess they sort of have to, given that they're a business and all.

ivankirigin · on July 12, 2012

Rajiv is awesome, you should listen to him

akent · on July 12, 2012

Says an ex "Product Manager at Dropbox".

Edit: Thanks for the downvotes. My point is, just make it unambiguous to everyone in your comment so we don't have to click through your profile. Context matters. e.g.:

"I was Product Manager at Dropbox and worked with Rajiv (the OP). He's awesome, you should listen to him."

Much better.

to3m · on July 12, 2012

I've found working with people to be a reasonable way of finding out whether their opinions are worth listening to...

sriramk · on July 12, 2012

Even more reason to listen to him.

stratos2 · on July 13, 2012

which means his opinion counts at least 100 times more than yours does.

carb · on July 13, 2012

What he's saying is that ivankirigin should have said that himself. I don't know that he has any credibility to his statement and wasn't going to give it any merit until akent made me realize that ivankirigin had first-hand experience.

JohnGB · on July 13, 2012

I believe that the section on "The security-convenience tradeoff" is fundamentally flawed.

A username and password represent a pair. Neither one has meaning in terms of authentication without the other.

Take the example where I have forgotten my username (JohnGB), but try with what I think it is (Say JohnB), and enter the correct password for my actual username. The system would then tell me that my username is fine, but that my password isn't. From then on, I would be trying to reset the password for a different user as the system has already told me that my username was correct.

Please, for the sake of sane UX, don't do this!

dudeguy · on July 13, 2012

No way, sir. Saying 'you entered the wrong password' in that case is not any more confusing than the ambiguous error that says 'you got one of them wrong but I'm not gonna tell you which.' most reset password systems are keyed to your email address anyway.

opminion · on July 13, 2012

A topic usually left out in scaling discussions is: how much can one predict? Or is it mostly trial and error? Is it mostly about good "reactive" engineering, would it have benefited from good mathematical modeling?

crazygringo · on July 13, 2012

> I noticed that a particular “FUUUCCKKKKKasdjkfnff” wasn’t getting printed where it should have

:)

I've never seen a shorter description of real-world software development. That's it in a nutshell!

wulczer · on July 13, 2012

Great article! Small nitpick from someone who just tried this on his server logs :)

  * on my machine xargs -I implies -L1, so you can drop that
  * use gnuplot -p or the graphic will disappear immediately after rendering

ralph · on July 13, 2012

I agree, good article.

A sort -n is also required before the uniq since server logs have the time of the request but are printed when the response is complete so they're not necessarily increasing.

anamax · on July 13, 2012

There's a talk about Dropbox scaling at http://www.stanford.edu/class/ee380/winter-schedule-20112012... .

gallerytungsten · on July 13, 2012

Great article. Rajiv made it easy to understand the conceptual framework. The lesson is: always strive to be robust. Test your failure points deliberately. Applicable to more than just server scaling.

matt · on July 12, 2012

Nice, love the idea of running with extra load to predict breaking points.

atombender · on July 14, 2012

I'm surprised that Dropbox actually uses S3 internally to store data. All along I had assumed, wrongly, that Dropbox had built their own distributed storage cluster.

philfreo · on July 12, 2012

Can you explain the nginx/HAproxy config a little more?

emmett · on July 13, 2012

HAproxy is great at exactly one thing: load balancing. It's better than nginx for that one use, because it's more flexible, has better controls for flapping, is smarter about queuing, gives you cool stats pages, etc.

Nginx is great for...pretty much everything else.

misiti3780 · on July 12, 2012

agree - i see a lot of start ups putting haproxy behind nginx for load balancing but i have never figured out why they wouldnt just stick with nginx. does anyone have an example of how the configuration looks on github?

tszming · on July 13, 2012

there are features that haproxy support out of the box, e.g. sticky session load balancing, http 1.1 to upstream (newer version of nginx also support btw); on the other hand, people use nginx for SSL termination.

misiti3780 · on July 13, 2012

any chance could either explain this a bit more or post a link to a blog? this is very interesting to me and i do not know much about it

kevinburke · on July 13, 2012

    MySQL has a huge network of support and we were 
    pretty sure if we had a problem, Google, Yahoo, 
    or Facebook would have to deal with it and patch 
    it before we did. :)

I am fairly certain Google is running its own (patched) version that's fairly different than the off-the-shelf MySQL.

nl · on July 13, 2012

You mean using the Google Mysql5patches[1]?

[1] http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patch...

wizard_2 · on July 14, 2012

And hopefully they're pushing important stuff upstream, it wouldn't make sense to not leverage the community.

mistercow · on July 13, 2012

Running with extra load seems inefficient in terms of energy consumption. Would it be possible to achieve the same thing by inserting delays or something that can be turned off?

stratos2 · on July 13, 2012

all security is a balancing act which is the point he is making. there is always a tradeoff