Hacker News new | past | comments | ask | show | jobs | submit login
Scale Fail (part 1) (lwn.net)
178 points by ableal on May 19, 2011 | hide | past | favorite | 37 comments



The bit about metrics really rang true.

At 1 company, we had a situation where 2 of the 3 webservers were out of service for various reasons. I knew from past experience that even a single web server should have a load less than 1 with all our traffic. The others were just there in case they went down. (Like they did!) In the middle of that, the 1 working server suddenly starts spiking load to 40 and misbehaving.

There was a really smart new guy there that was usually right. But this time, he said there was no way 1 server could handle our traffic and that was the problem. He refused to listen to reason and eventually I gave up and started working on the problem alone. I eventually found the code that was going crazy and fixed it... And voila. 0.4 load.

The point is that people get things in their head that 'have' to be true, but they aren't necessarily. He assumed that the existence of 3 webservers meant we had that much load, but it was just for redundancy. If I had had metrics to prove my case, it would have gone a lot smoother... But instead I just had my experience.


I think having numbers to prove your case is generally a good way to approach things. If you can get objective numbers that really show and prove what you are talking about, you can swing most people your way.

If not, they are probably not the best of people to work with.


I want to reiterate that the guy was really good at his job. One of the best people I've ever worked with. He was generally pretty easy to get along with. But my lack of hard data meant I couldn't sway him on that problem, no matter how right I was.

That company had a history of not having metrics to work with. Every new sysadmin or db admin came in and asked for metrics and we'd shrug and laugh nervously. (I mean, there was a reason why hired so many new ones!)


Oh, no, heard you very loud and clear. Was just reinforcing your anecdote by saying that it's applicable to a lot more than just scaling apps. :)


If you don't have numbers, you're not doing engineering, you're practicing druidism.


This is a phenomenal article. After wading through bullshit "X HTTP server beats Y HTTP server at serving 40 bytes of static content!!!" articles this is a breath of fresh air. This is clearly written by someone working in the real world.


LWN is pretty great, I'm quite happy I bought a subscription. Very few other publications will have long articles with diagrams explaining a kernel memory corruption bug a day after it happens.


Honestly, I'd consider getting a subscription just for this:

"When I was at Amazon, we used a squid reverse proxy ..."

"Dan, you were an ad sales manager at Amazon."


It continues to surprise me how much IT/sysadmin work is done by heuristic rather than actual measurements.

Applications slow? -> Add RAM

Database is slow? -> must be network/load

Service failures? -> reboot

It becomes very problematic in large IT organizations. Teams will play hot potato with issues, and all use excuses. Desktop support will blame the DB team, DB will blame the server team, then they all blame the network. All the while no one is actually measuring anything.


It is very frustrating when a group gathers statistics and measures everything it can. And the others do not.

I administer servers. Our group tires to measure what it can, keeps historical data. Then we get a call from the application owners:

'Users are complaining about slow performance.'

Who?

'A few. I didn't keep the email. It's just slow, look into it, will you?'

So what can I say but 'Server x did this and such a time and measure y is normal but I don't think that's your problem because of reason z'.

In other words, yes, I finger point because I _know_ it's not my problem.

And as often or not 'a few' users are one or two people whose problem went away when they restarted their PC.


Surprisingly almost no one cares about performance. When was the last time you have seen a webframework or a database that routinely does performance benchmarks at each iteration? In fact, I don't know any. I was very impressed with the continuous measurements that PyPy does.


Can you name a webframework or database?

There's many benchmarks available (I'm not familiar with Python):

http://codeigniter.com/user_guide/libraries/benchmark.html

http://codeigniter.com/user_guide/general/profiling.html



The site is great, but ironically a bit slow.


Comments > the article:

So to avoid these ends, let's avoid these beginnings: avoid multi-threading. Use single-threaded programs, which are easier to design, write, and debug than their shared-memory counterparts. Instead, use multiple processes to extract concurrency from the hardware. Choose a communication medium that works just as well on a single machine as it does between machines, and make sure the individual processes comprising the system are blind to the difference. This way, deployment becomes flexible and scaling becomes simpler. Because communication between processes by necessity has to be explicitly designed and specified, modularity almost happens by itself.


"Use single-threaded programs, which are easier to design, write, and debug than their shared-memory counterparts."

Not really. Well, only true in simple cases. Let's say I have a 16-core box and I want to crunch some data using all the cores. What's easier, clojure with a single shared memory data structure that collects that stats, or a multi-process system where we not only have to manage multiple application processes, we also have to use something like redis to hold the shared state?


Here is a link to the commenter's full comment: http://lwn.net/Articles/441867/

He's precisely considering your 16-core case. Now think about what happens when your dataset grows and you need more cores than you can reasonably fit in a single machine.


It's not when, it's if, and often you know it won't happen. Being a good engineer requires understanding when k use which model. acting as if there aren't a lot of cases where singel process, shred memory, concurrency is the only good choice for almost all cases is wrong


What I do appreciate about this article is that it doesn't just give anecdotes, but actually quantifies and explains the problems. In fact it even explains why not to use anecdotes!

Seems like most of the tips here could be applied to all sort of problems, not just scalability, and I think that's probably what the author was going for. To show that scalability isn't some special problem that needs special solutions, but that if you think hard about it, and use data to back up your findings, then it's just like any other problem you want to solve.

Can't wait for part 2.


Part 2 is already available - see http://lwn.net/Articles/443775/


How does everyone here gather and analyze their metrics? What do you have always deployed and what do you use when shit hits the fan?

[Edit for typo]


<rant> The "standard" ways are all very outdated, ugly, unscalable, and brain dead in implementation. nagios, cacti, munin, ganglia, ... -- all crap. </rant>

People end up writing their own [1]. They rarely open source their custom monitoring infrastructure. Sometimes a private monitoring system gets open sourced, but then you see it has complex dependencies. The complexity of monitoring blocks wide-scale deployment. People stick with 15 year old, simple, dumb, solutions.

I'm working on making a new distributed monitoring/alerting/trending/stats framework/service, but it's slow going. One weekend per month of free time doesn't exactly yield the mindset to get into hardcore systems hacking flows [2].

[1]: http://www.slideshare.net/kevinweil/rainbird-realtime-analyt...

[2]: Will develop next-gen monitoring for food.


I'm getting the feeling that with all the unique server setups in use, monitoring and metrics systems are going to be just as unique and specific.

There are some interesting process monitoring projects out there like god, monit and bluepill, as well as ec2/cloud specific stuff from ylastic, rightscale and librato silverline. Have you ever used any of those tools?

Fitting all these together for my setup is trial and error, but it really does force me to think hard about my tools and assumptions even before I get hard data.


I hack on the aforementioned Silverline at http://librato.com, and we provide system-level monitoring at the process/application granularity as-a-Service. (We also have a bunch of features around active workload management controls, but that's out of scope here). It actually works on any server running one of the supported versions of Linux, not just EC2. Benefits of going with a service-based offering are the same as in any other vertical, you don't need to install and manage your own software/hardware for monitoring.

Here's an example of the visualizations we provide into what's going on in your server instances: http://support.silverline.librato.com/kb/monitoring-tags/app...


Sounds like Zabbix, Pandora FMS, Osmius, NetXMS, AccelOps and those are the ones that match your requirements.

Within each, if you search for templates or cookbooks or config scripts, you'll find ways of configuring it easily enough.

https://secure.wikimedia.org/wikipedia/en/wiki/Comparison_of...


Almost.

They all suffer from inflexible data models (how many are using SQL and rrdtool in that matrix?), death at scale (what happens when you go from 10 to 500 to 3000 to 10000 servers? across three data centers? and transient xen servers?), lack of UI design, and community involvement (because of that massive comparison grid).

That's not even considering broken models for alerting (a server dies at 3am -- should it page you? no, because you have 200 of the same servers in the same roll. the load balancer will compensate.), historical logging, trending, and event aggregation/dedup.

It's a big problem, but making flexible tools from the ground up with sensible defaults can go a long way towards helping everyone.

We can fix this. We can make the redis of monitoring.


I have to laugh at you pointing out 'redis', redis can not scale at this time. Clusters are planned sometime mid-year but it'll be sometime before it has more features. Maybe you meant the MongoDB?

Alerting is quite flexible from what I read to the point that they are quite customiseable. I agree that a server dying at 3 am is not as important but should still be a valid alert to make an API call to the host to start a new server (Not sure if possible, alerts seem to be shell based).

Here's what your offering needs top in what I'm considering lately: http://www.zabbix.com/features.php

I'd love more competition but even you point out community involvement won't be as much because there's a lot of competition. Including your offering, soon.

Disclaimer: I started researching server monitoring a few weeks ago and considering Zabbix since last week.

Edit: The one issue I find is that there's lack of web transactions like New Relic has: http://newrelic.com/features/performance-analytics

You can see it in action with average response time: http://blog.tstmedia.com/news_article/show/86942

As far as I know, no open network monitoring service offers it.


That's the kind of thinking Josh calls out in the article. Redis is great in most of the use cases for replicated Mongo if you're gearing the rest of your architecture to use it properly.



I tried out statsd + graphite and it's flexible, but it requires a ton of dependencies (python, django, apache, cairo, sqlite, and node.js) and needs a lot of configuration across a bunch of apps to get up and running. It might not be too bad if you have a django app, but there was very little overlap with my stack so I decided against it.

It would be nice if the whole thing could be rolled up into a single config file and one node.js app that forked two processes for web UI and stats collection. And cairo could be replaced with javascript+canvas or svg on the client side.


> Single-threading is the enemy of scalability.

Multi-thread programming is hard. Everyone already knows about Node.js, but if you're on the Java I suggest you check out Akka (http://akka.io) - makes concurrency much easier


Is it hard? Always? If it is so hard then why is it around. How would you go about parallelizing even mildly CPU heavy work loads on node.js?

Node.js and the like are awesome, but only if you have no blocking calls and your CPU usage is tiny. Otherwise you get the performance of a single threaded server. You know, because that's what it is.


>Is it hard?

Yes multithread programming is difficult (for most people - like me). Hence the popularity of stuff like node.js, and also why people highlight Erlang's Actor model.

> How would you go about parallelizing even mildly CPU heavy work loads on node.js?

I'm not sure I understand your question, since (I believe) node.js has an event based concurrency model and not thread based (at least for its users).


You run multiple processes. Duh.


Part 2 has been released, for those who have subscription. https://lwn.net/Articles/443775/


The talk is very amusing especially the use of Jason Fried in one of the slides.

http://www.youtube.com/watch?v=nPG4sK_glls


Well written article and funny, too. Nice job.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: