At 1 company, we had a situation where 2 of the 3 webservers were out of service for various reasons. I knew from past experience that even a single web server should have a load less than 1 with all our traffic. The others were just there in case they went down. (Like they did!) In the middle of that, the 1 working server suddenly starts spiking load to 40 and misbehaving.
There was a really smart new guy there that was usually right. But this time, he said there was no way 1 server could handle our traffic and that was the problem. He refused to listen to reason and eventually I gave up and started working on the problem alone. I eventually found the code that was going crazy and fixed it... And voila. 0.4 load.
The point is that people get things in their head that 'have' to be true, but they aren't necessarily. He assumed that the existence of 3 webservers meant we had that much load, but it was just for redundancy. If I had had metrics to prove my case, it would have gone a lot smoother... But instead I just had my experience.
I think having numbers to prove your case is generally a good way to approach things. If you can get objective numbers that really show and prove what you are talking about, you can swing most people your way.
If not, they are probably not the best of people to work with.
I want to reiterate that the guy was really good at his job. One of the best people I've ever worked with. He was generally pretty easy to get along with. But my lack of hard data meant I couldn't sway him on that problem, no matter how right I was.
That company had a history of not having metrics to work with. Every new sysadmin or db admin came in and asked for metrics and we'd shrug and laugh nervously. (I mean, there was a reason why hired so many new ones!)
This is a phenomenal article. After wading through bullshit "X HTTP server beats Y HTTP server at serving 40 bytes of static content!!!" articles this is a breath of fresh air. This is clearly written by someone working in the real world.
LWN is pretty great, I'm quite happy I bought a subscription. Very few other publications will have long articles with diagrams explaining a kernel memory corruption bug a day after it happens.
It continues to surprise me how much IT/sysadmin work is done by heuristic rather than actual measurements.
Applications slow? -> Add RAM
Database is slow? -> must be network/load
Service failures? -> reboot
It becomes very problematic in large IT organizations. Teams will play hot potato with issues, and all use excuses. Desktop support will blame the DB team, DB will blame the server team, then they all blame the network. All the while no one is actually measuring anything.
Surprisingly almost no one cares about performance. When was the last time you have seen a webframework or a database that routinely does performance benchmarks at each iteration? In fact, I don't know any. I was very impressed with the continuous measurements that PyPy does.
So to avoid these ends, let's avoid these beginnings: avoid multi-threading. Use single-threaded programs, which are easier to design, write, and debug than their shared-memory counterparts. Instead, use multiple processes to extract concurrency from the hardware. Choose a communication medium that works just as well on a single machine as it does between machines, and make sure the individual processes comprising the system are blind to the difference. This way, deployment becomes flexible and scaling becomes simpler. Because communication between processes by necessity has to be explicitly designed and specified, modularity almost happens by itself.
"Use single-threaded programs, which are easier to design, write, and debug than their shared-memory counterparts."
Not really. Well, only true in simple cases. Let's say I have a 16-core box and I want to crunch some data using all the cores. What's easier, clojure with a single shared memory data structure that collects that stats, or a multi-process system where we not only have to manage multiple application processes, we also have to use something like redis to hold the shared state?
He's precisely considering your 16-core case. Now think about what happens when your dataset grows and you need more cores than you can reasonably fit in a single machine.
It's not when, it's if, and often you know it won't happen. Being a good engineer requires understanding when k use which model. acting as if there aren't a lot of cases where singel process, shred memory, concurrency is the only good choice for almost all cases is wrong
What I do appreciate about this article is that it doesn't just give anecdotes, but actually quantifies and explains the problems. In fact it even explains why not to use anecdotes!
Seems like most of the tips here could be applied to all sort of problems, not just scalability, and I think that's probably what the author was going for. To show that scalability isn't some special problem that needs special solutions, but that if you think hard about it, and use data to back up your findings, then it's just like any other problem you want to solve.
<rant>
The "standard" ways are all very outdated, ugly, unscalable, and brain dead in implementation. nagios, cacti, munin, ganglia, ... -- all crap.
</rant>
People end up writing their own [1]. They rarely open source their custom monitoring infrastructure. Sometimes a private monitoring system gets open sourced, but then you see it has complex dependencies. The complexity of monitoring blocks wide-scale deployment. People stick with 15 year old, simple, dumb, solutions.
I'm working on making a new distributed monitoring/alerting/trending/stats framework/service, but it's slow going. One weekend per month of free time doesn't exactly yield the mindset to get into hardcore systems hacking flows [2].
I'm getting the feeling that with all the unique server setups in use, monitoring and metrics systems are going to be just as unique and specific.
There are some interesting process monitoring projects out there like god, monit and bluepill, as well as ec2/cloud specific stuff from ylastic, rightscale and librato silverline. Have you ever used any of those tools?
Fitting all these together for my setup is trial and error, but it really does force me to think hard about my tools and assumptions even before I get hard data.
I hack on the aforementioned Silverline at http://librato.com, and we provide system-level monitoring at the process/application granularity as-a-Service. (We also have a bunch of features around active workload management controls, but that's out of scope here). It actually works on any server running one of the supported versions of Linux, not just EC2. Benefits of going with a service-based offering are the same as in any other vertical, you don't need to install and manage your own software/hardware for monitoring.
They all suffer from inflexible data models (how many are using SQL and rrdtool in that matrix?), death at scale (what happens when you go from 10 to 500 to 3000 to 10000 servers? across three data centers? and transient xen servers?), lack of UI design, and community involvement (because of that massive comparison grid).
That's not even considering broken models for alerting (a server dies at 3am -- should it page you? no, because you have 200 of the same servers in the same roll. the load balancer will compensate.), historical logging, trending, and event aggregation/dedup.
It's a big problem, but making flexible tools from the ground up with sensible defaults can go a long way towards helping everyone.
We can fix this. We can make the redis of monitoring.
I have to laugh at you pointing out 'redis', redis can not scale at this time. Clusters are planned sometime mid-year but it'll be sometime before it has more features. Maybe you meant the MongoDB?
Alerting is quite flexible from what I read to the point that they are quite customiseable. I agree that a server dying at 3 am is not as important but should still be a valid alert to make an API call to the host to start a new server (Not sure if possible, alerts seem to be shell based).
I'd love more competition but even you point out community involvement won't be as much because there's a lot of competition. Including your offering, soon.
Disclaimer: I started researching server monitoring a few weeks ago and considering Zabbix since last week.
That's the kind of thinking Josh calls out in the article. Redis is great in most of the use cases for replicated Mongo if you're gearing the rest of your architecture to use it properly.
I tried out statsd + graphite and it's flexible, but it requires a ton of dependencies (python, django, apache, cairo, sqlite, and node.js) and needs a lot of configuration across a bunch of apps to get up and running. It might not be too bad if you have a django app, but there was very little overlap with my stack so I decided against it.
It would be nice if the whole thing could be rolled up into a single config file and one node.js app that forked two processes for web UI and stats collection. And cairo could be replaced with javascript+canvas or svg on the client side.
Multi-thread programming is hard. Everyone already knows about Node.js, but if you're on the Java I suggest you check out Akka (http://akka.io) - makes concurrency much easier
Is it hard? Always? If it is so hard then why is it around. How would you go about parallelizing even mildly CPU heavy work loads on node.js?
Node.js and the like are awesome, but only if you have no blocking calls and your CPU usage is tiny. Otherwise you get the performance of a single threaded server. You know, because that's what it is.
Yes multithread programming is difficult (for most people - like me). Hence the popularity of stuff like node.js, and also why people highlight Erlang's Actor model.
> How would you go about parallelizing even mildly CPU heavy work loads on node.js?
I'm not sure I understand your question, since (I believe) node.js has an event based concurrency model and not thread based (at least for its users).
At 1 company, we had a situation where 2 of the 3 webservers were out of service for various reasons. I knew from past experience that even a single web server should have a load less than 1 with all our traffic. The others were just there in case they went down. (Like they did!) In the middle of that, the 1 working server suddenly starts spiking load to 40 and misbehaving.
There was a really smart new guy there that was usually right. But this time, he said there was no way 1 server could handle our traffic and that was the problem. He refused to listen to reason and eventually I gave up and started working on the problem alone. I eventually found the code that was going crazy and fixed it... And voila. 0.4 load.
The point is that people get things in their head that 'have' to be true, but they aren't necessarily. He assumed that the existence of 3 webservers meant we had that much load, but it was just for redundancy. If I had had metrics to prove my case, it would have gone a lot smoother... But instead I just had my experience.