Hacker News new | past | comments | ask | show | jobs | submit login
Don’t scale: 99.999% uptime is for Wal-Mart (37signals.com)
84 points by cfontes on Feb 24, 2011 | hide | past | favorite | 54 comments



I wish I could delete the first two words from this article's title. Uptime's related to scalability, but it's not the same thing - scalability encompasses more than that.

In my experience, you can get away with lower uptime but your users will crucify you for poor performance. Site down for a half hour here and there? Fine. Site responds slowly and your clients' data takes a while to update? Big, big problem.


Strongly agreed, I'm not sure why the post conflated these two very different issues. Scaling has almost nothing to do with five 9s uptime, with the exception that if you can't scale, once you reach a certain number of users your uptime will be closer to 0% than 100%.


Agreed. Having said that, there is a correlation for sites that go down when they are hit by a spike in traffic.

The danger, of course, is that when something happens to cause a spike (news mention, post goes viral, etc), that's exactly when you can't go down.


Relatedly, one piece of advice that 37Signals had in their Getting Real book that really helped me is that you can delay building many systems past launch.

For example, BCC has a substantial amount of functionality in the back-end interface so that I can handle common support tasks. AR has virtually nothing -- a single page which lists customer email addresses, trial statuses, and upcoming subscription renewal dates. I could have spent 2 weeks on building out a decent amount of functionality for CS and more advanced statistical navel gazing, but a) I might not pick the right stuff and b) it would mean that the release of the next feature that actually sells software would be on 3/15 instead of 3/1.

BCC has organically grown its backend over the years, as I get so frustrated with fixing the same issue manually that I make a one-button way to do it.


Well said. I launched S3stat as a paid service without any way of processing credit cards. Since I offered a 30 Day Free Trial, that gave me at least 3 weeks in which to build it (and some good incentive to do so).

FairTutor will probably go live with no way to review teachers. Same reason.


The people at my office who actually use Highrise and have to deal with 37s' frequent bouts of downtime would beg to differ.


Exactly. I don't see how a company that sells business-critical (I assume) software can so proudly proclaim that they don't care all that much about downtime. He does say 'we're still here', so yeah probably for them it doesn't matter. They're mostly great at marketing anyway. But for any service, to fail on me 5 minutes before my presentation starts and I just want to look up the customer's name and position again, and then finding out that you can't access your data - that'd be enough to dump them the next day for me.


that'd be enough to dump them the next day for me

You can't dump them if they're down!


I think your credit card company might beg to differ on that one...


As long as they're paying 37s, 37s is happy, and they're evidently happy enough with the service to keep paying.


I'm assuming that, given the article is from '05, attitudes there have changed a bit (one would hope).


98% or 99% uptime isn't good enough for the people you know? Or are you suggesting that their uptime is significantly lower than that?


1% of a year is 3.65242199 days.

So on the extreme end 37signals could be down for over 3 days in a single bout and still have 99% uptime. 3 days downtime is serious.

However, this case is unlikely.

Also unlikely is a scenario where they are down for the same time each day---14.4 minutes @ 99% uptime.

Its most likely that they're down for a few minutes here and there across the day during working hours, which makes for an incredibly frustrating experience. You can never just click a button and know your form will be submitted.


Yes, but there are a some more nuances. I'll bet a fair amount of downtime is scheduled, and would be scheduled for low usage periods. Second, most services I've ever used with an SLA for n sixes hasn't actually lived up to that SLA. Instead they offer you a prorated charge for that month or something similar.

What people promise in an SLA and what they deliver are very different things. I'd be very interested in some stats about historic uptimes for similar sites.

Also, how much more are you willing to pay for an extra 9 of uptime?


98% uptime is equivalent to being down for an entire week every year, or half an hour every day.

The article contrasted Wal-Mart (massive e-commerce operation) with things like Delicious/Technorati (casual services, usually free to use). It conveniently glossed over the middle ground, where 37 Signals' applications and many others live.

FWIW, I did look into using some of 37 Signals' services for my small businesses. They failed my assessment on several counts and we found other solutions. But judging by their blog posts etc, they seem to be proud of several of the things that caused me to look elsewhere. As long as their marketing is good enough to maintain their business success regardless of any technical merit, I suppose I can't argue with their commercial position and I assume they accept that people like me will never use them.


Which other solutions? I'm currently looking for tools - right now a mac mini running osx server seems the way to go. But it's missing some stuff, what have you found?


> But it's missing some stuff, what have you found?

The bottom line is, we didn't.

Most of these business admin services seem to offer little more than a very simple database, a pretty UI, and a monthly bill. We write software for a living ourselves, so producing an in-house bug tracker or CRM system or project dashboard takes a matter of hours, maybe a few days if we add more power to it later. It's only worth outsourcing the tools if the relatively small amount of time it saves us outweighs the many downsides: time lost to set up and maintain an external system, monthly cash cost to use it, having to fit our process around their tool instead of the other way around, interoperability issues if tools have related purposes but data is locked in, relying on external infrastructure for access, etc. etc.

That is only going to happen if the outsourced service offers a silky smooth set-up process, a tool that can be readily adjusted to our exact needs and still present a streamlined UI to match, immediate and convenient access to all the underlying data, and very low pricing, among other things. That's a pretty tall order, and frankly, none of the services we looked at seemed to be anywhere close.

We do use external software to run things like mail and web servers, but it's standard stuff running on our own boxes.

We are also looking at external services for tasks that are much more difficult/expensive to implement ourselves, such as payment processing and off-site back-up.

But for basic business admin, things like bug tracking, project planning, CRM and the like, we concluded that it wasn't even worth looking at the freebie software projects that do it. In the time it would take to properly evaluate a few of them and determine whether they could meet our needs, plus the time to customise them to do so, we could have just written exactly what we wanted from scratch.


Sounds like they are trying to fluff their reliability reputation.

I develop a web application that is used by schools and just can't entertain the notion of anything other than 100% uptime. I take the reliability of my product very very seriously. If one of my customers had a fire at their school and couldn't access our system for registers - that would be us and them up the proverbial creak without a paddle.

I've built up a company (over 7 years now) with a very good reputation for reliability and uptime. Don't assume that just because something is web based it doesn't require 100% uptime.


Are you sure your web application "just can't entertain the notion of anything other than 100% uptime"? That sounds like a vacuous promise to me. Even telecoms switches are designed with something like 99.9999999% (9 nines) of availability; that's ~30 milliseconds of downtime a year.

I'm not criticizing you or your product, but if reliability is critical, it's made explicit with a realistic SLA.


I didn't say we provide 100% uptime. We just don't entertain the idea that we may have to settle for less .. simply, we pro-actively monitor our servers and have enough redundancy in place to keep problems under OUR control.

We don't pretend, promise or have an SLA that offers 100% uptime.


Keep in mind that the typical school has substantially less than 40% uptime...


Not sure where you are based, but here in the UK schools are rapidly gaining better uptimes. The biggest problem we see is miss-configured proxies playing havoc with https connections.

I couldn't say what the average uptime is, but if a school network goes down we normally get an immediate call asking us if there is a problem our end because they use our system so heavily. So at a guess .. at least 98-99% for the majority of secondary schools, but don't quote me!


They're only actually occupied for about ~30% of hours in the week during the school year, and the hours are usually fairly predictable, which gives one some advantages. For example, if you had a scheduled maintenance window for 5 hours a week every week starting at 2 AM on Sunday, that would blow 98% uptime but no actual user would ever know. (This has saved my bacon a few times, as 2 of the 3 outages of my software were outside of peak hours, so I could count inconvenienced customers on one hand.)


We use to find the holidays a massive advantage. But schools take some of their holidays on different weeks in different parts of the UK so as we expanded we effectively lost these maintenance windows.

Also teachers/students/parents access the system from home so we have to keep it running 24/7.


Just because students are out doesn't mean the school isn't occupied! I used to work as a network admin at a boarding school, and there would usually be at least one "off-site" staff member working past midnight each day (and of course, on-site staff 24/7, although thankfully most of them sleep at night), as well as staff coming in at the weekend.

Computers have become mission critical in schools now - lessons, paperwork, planning, organisation, telephones, intercoms, security cameras†, locks‡ were all computer-dependent. To date that school network is the most advanced, highest-uptime network I've ever worked on: two server rooms (network continues at almost full capacity with the loss of either) in separate locations and a backup server in a third location. Unplanned outages were basically non-existent, and the loss of any service was very rare and involved lots of people shouting at us :(

† Internal only, external were still on tape

‡ I don't mean all the doors open if the network goes down, but the electronic locks were controlled over the network


At one of the colleges I attended, they shutdown the school's course registration/student management system from midnight to about 7 am every day. It was incredibly frustrating at times.


100% uptime is as fanciful a notion as risk-free investment.


The point I was trying to make is that if I had that kind of public attitude to uptime, I'd be out of business pretty much overnight.

Yeah, 100% uptime is a crazy notion; but something always to be strived for nonetheless.


To be strived for, but if you sign a SLA, nobody would be offering you 100% and tell the truth.


An SLA should specify penalties for missing the target. You can offer 100% uptime as the SLA, knowing that you'll only hit x%, but if the cost of compensating for 100-x% is less than the value of offering the SLA at 100% it's worth doing.

An SLA only ever guarantees commercial remedies, not service availability.


I've been seeing it offered more and more in recent years for colo ip transit. Just do a quick search for "sla 100% uptime network".

Apart from one of the latest colo providers we use I've witnessed downtime with every single one that offered ip transit with a 100% SLA. That's why we use at a minimum 3 geographically distant providers. So far this approach has not let us down.


thats because they stick in the fine print "once a month scheduled half hour outages don't count, also anything under 5 minutes doesn't count"


There should be a law against that.

Reminds me of the "unlimited mobile phone internet" plans here in the UK which in the fine print resolve to "fair use policy of 1GB a month) .. why can't companies just call a spade a spade.


> why can't companies just call a spade a spade.

Slightly off-topic, but in case there's genuine curiosity there: AIUI, most of the advertising standards rulings about "fair use" policies related to Internet and particularly mobile bandwidth were made quite some time ago, when expectations were very different. It is possible that this was not entirely coincidental, and was motivated by the legal teams at major service providers who saw the current situation coming.


Mah boiiiii, 100% uptime is what all true coders strive for.


I work in IT for a very large school district, and I have this argument all the time. Aiming for 4 9s or more is just a waste of money. Sure having the gradebook or scheduling app down sucks, but it doesn't matter at a school.

Can the teacher still teach if the systems go down? Yes.


I've worked in education for over 10 years. 7 of which was on site at one of England's largest secondary schools whilst consulting for schools all over the country.

I've had first hand experience dealing with dozens of LA (what you call districts) technical teams that take this kind of attitude and frankly I find it appalling.

I remember when schools where trying to bring in electronic registration, and the LA network teams just didn't give a damn about the schools needs. Instead they made there own conclusions as to what the school needed, and to them, having a reliable network so the school could move into the 21st century wasn't one of them.

Schools are becoming more and more engaged in technology, and downtime and poorly maintained systems set teachers and the work they do, back. Teachers need to be able to rely on the technology, simple as that, no excuses.

You're not expected to reach 100%, but to admit defeat and argue for that defeat is poor form.


What is your actual uptime?


Good question.

We have had 3 incidents in the last 7+ years. The first 2 occurred when we only had servers in one site.

1, http://www.risktec.co.uk/news/business-continuity-%E2%80%93-... (lasted days!!) 2, A thunderstorm knocked out the power to our datacenter, the backup UPS and generators both failed to engage. (lasted a day, we also took our business elsewhere)

After the 2nd outage we decided to take up additional colo space elsewhere in the country. This has worked for years and any failures in our life datacenter have quickly been switched over to a hot backup datacenter.

We have had around a half dozen minor interruptions caused by server component failures ( psu / hard drive). This is because it takes a few minutes to detect the failure and switch over to a backup.

3, The last major failure was with our DNS hosting. They decided to scratch our main zone from there records because of a billing error. Luckily for us it happened on the Friday before a weeks school holiday. Needless to say we took our business elsewhere.

I'm sure we have had other instances of downtime I can't remember, but that's pretty much us in a nutshell. No idea what our uptime on paper would be, funnily.


That's not too bad. But does that all add up to more than 9 hours a year of downtime? If so, you're already doing worse than 99.9% uptime.

I would also bet your uptime is worse than you think if you're not actively measuring it. However, I'm sure it is "good enough". I doubt it is worth spending an order of magnitude more money to get better than 3 9's for your business, which is more or less the point of the article imo.

I find the percent availability -> hours of downtime chart on wikipedia to be a really helpful reference for these types of discussions: http://en.wikipedia.org/wiki/High_availability


I also develop a web application used by schools. I have no problems taking the system down in the middle of the night for 30 minutes. Schools are quite possibly the most forgiving customers. Their revenues come from the government, so if your app goes down, they are still going to make payroll.


98% uptime is down roughly:

* 1 minute every hour, or

* 3h20 every week, or

* 1 week every year

I know about hyperbole for making a point, but does Basecamp really total anything like a week's downtime per year? If so, why? I'm pretty sure I've never had anything like that bad a number and equally that I wouldn't be happy using a service that did.

The general thrust is right: high reliability is expensive and you need to look at cost/benefit not chest-beating. But let's be honest about what we're actually aiming at.


This article is from 2005. It's quite likely that Basecamp's downtime was around a week per year. At that time, Rails was expected to have ~400 restarts a day[1] ;)

[1] http://www.loudthinking.com/posts/31-myth-2-rails-is-expecte...


Regular maintenance is enough to account for that.

Its really easy to hit 98% uptime (2% downtime). Just take your eyes off of your homegrown solution for a day.


As others have pointed out, this post is from 2005. In the past 6 years, the cost of scalability has dropped sharply, and shooting for three 9's should be the minimum for most sites. It doesn't cost thousands to go from 98% to 99% any more, and to 99.9% is still pretty cheap.

Sure, five and six 9's does get expensive, and that will depend on your cost of downtime (ie: lost sales, etc.).


Indeed this article doesn't talk about scaling but about uptime... But although the topic is still open for discussion in 2011, I don't think an article written in 2005 should be posted in Hacker _News_.


This doesn't take into account startups who have SLA's because they have a B2B product. We have both B2C and B2B customers and as a result we can't be down for our business customers or we have to credit them. Honestly, 99.9% uptime is not hard to manage. Pick the right colo facility with a history of good uptime. Have more than 1 machine, and have them on redundant power supplies (on separate PDU's). Voila, unless you screw up deploys, you have 99.9% uptime. This doesn't take a huge amount of money.


Alistair Cockburn wrote an awesome book for small teams based on the "Crystal Clear Method". It has some great info.

http://www.amazon.com/exec/obidos/ASIN/0201699478/ref=ase_al...


The criticality of your average “Web 2.0” application is one with loss of comfort as the result of something going wrong.

Which is also why your average Web 2.0 application can't charge very much: without it, your comfort level slightly drops. No big deal.


It's amazing how much the economics of up time have changed since this article was written because of services like AWS.

While it may not be technically or fiscally trivial yet, it's far easier and cheaper than its ever been, and far more so that in 2005.


> To go from 98% to 99% can cost thousands of dollars. To go from 99% to 99.9% tens of thousands more.

Somewhere in there is the pickle problem:

You have 1000kg of pickles in your basement. Now, pickles are mostly water. In fact, your pickles are 99% water, the rest is cellulose. Cellulose has negligible mass, so we can say that all the mass comes from the water. You leave your pickles in the basement for a year, and when you come back, they've dried out a certain amount, so they're now 98% water. What's their new mass?


Off topic, a sentence that really stood out for me:

"Now what if Delicious, Feedster, or Technorati goes down for 30 minutes?"

The article was written just six years ago, and these were the examples of popular sites that came to the author's mind. Gives you some idea on how transient this field is.


WoW goes down for nearly a full day every single Tuesday, and they seem to do OK.

I agree with the post, but I don't like that it encourages settling for less than the best.


The encouragment of frugality and pragmatism relating to spending to upgrade business systems is at odds with the author's publicized, flamboyant spending on frivolous ultra-luxuries. Apparently, purchasing $900,000 sports cars and houses in Italy is a higher priority for David than his customers always having access to what they're paying for.

That's his decision obviously, but if I was a 37 Signals customer ever inconvenienced by problems with their infrastructure, I'd think of this article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: