Rackspace Goes Down. Again. Takes The Internet With It. Again.

tamersalama · on Dec 18, 2009

I can't believe how TechCrunch is turning into a tech-tabloid.

Yes, there are issues (perhaps sometimes serious) that needs reporting - but - the way it is done by TC isn't quite the quality I'm excepting.

SamAtt · on Dec 18, 2009

I agree with you but I'd bet those are TechCrunch's most popular posts. In fact, the author (MG Siegler) was essentially hired because he's good at writing those types of post.

People like trashy journalism. It's like the Tiger Woods thing. I don't mind the networks reporting on it but the fact that it's been the top item on "legitimate" news sites should tell you how modern day journalism works.

pavs · on Dec 19, 2009

What quality are you expecting?

I sincerely challenge anyone to find a story/post/article from TC that is full of technical details that wets your technical palette. Chances are you won't find any.

Its really a sham that TC is considered a tech site. If any other site had as much BS posted on HN front page everyday - it would have been flagged & banned in a moment's notice. Yet TC survives and thrives on HN simply because of Arrington's connection.

I don't think I would be incorrect to say that the overwhelming TC crowd absolutely hates TC articles or its style of "journalism", yet for the mere fact that you can't "downvote" anything, stupid shit like this clutters the front page.

revicon · on Dec 19, 2009

> I can't believe how TechCrunch is turning into a tech-tabloid.

Isn't that all it's ever been anyway? Was there some kind of wow-thats-awesome version of TechCrunch that existed back before I knew about it?

tybris · on Dec 19, 2009

Amazon just started looking a whole lot more attractive. They may not have such fanatical support, but they do know how to fix their stuff.

cmelbye · on Dec 19, 2009

Well, Rackspace knows how to fix their stuff too, which is why it's up again after the problem appeared. ;-)

strlen · on Dec 19, 2009

Attention Internet start-ups: operations is your core competency. You can't just expect to push your application code to "the cloud" and have somebody else make it scalable and fault tolerant.

It's fine to use a managed hosting service provider if you're just getting started and paying month-to-month based on credit cards and can't afford networking gear (hardware itself can be leased). However, it shocks me to see venture funded, post-series A start-ups exclusively relying on others to do their operations (including systems administration).

Problem with outsourcing your operations to somebody else is that you're outsourcing it to somebody who has zero knowledge of your application and is also responsible for at the very least dozens of other customers. Essentially, rather than developing your own vertical technology team, you're relying on a on horizontal technology team whose resources multiple other companies (including your competitors) are fighting for.

That's exactly how poorly run big companies function (multiple engineering teams competing for attention of monothlic operations, SCM, release, QA etc... teams). Well run Internet giants, however, function much differently. If you look at Google's job openings, you can note that they don't hire IT/Systems Administrators (aside from data-center technicians and corporate IT). Instead, they have Site Reliability Engineers (SREs) assigned to specific properties. The SREs are actually engineers who are able to write and debug code and deeply understand the application-specific stack they operate rather than treat it as a black box.

Sure, there are great business and technical reasons (edit: it said financial, which I felt was an unfair strawman) reasons to use a managed/"cloud" provider (EC2 and Rackspace cloud are very attractive due to the ability to add/remove nodes as traffic goes up and down, as well as provision machines ad-hoc for analytics tasks/MapReduce) -- but even then, you're not off the hook for operations. It's still your responsibility to ensure that at the very least, there's a "hot standby" disaster recovery site. Yes, it's not easy - but running a successful business isn't supposed to be.

wrs · on Dec 19, 2009

Comparing Techcrunch to Google is a bit lopsided, don't you think? Operations for most websites is like vehicle maintenance for a pizza delivery company: essential for success, but hardly their "core competency". This is especially true if the website is something relatively standard like a blog or news site. It takes quite a bit of scale and customization before it makes sense to hire full-time dedicated sysadmins.

strlen · on Dec 19, 2009

If you don't need a full time operations engineer, you should become competent as a part-time one. Both Google and Techcrunch share the fact that they have paying advertisers who expect their site to be up and delivering their content.

revicon · on Dec 19, 2009

> Operations for most websites is like vehicle maintenance for a pizza delivery company: essential for success, but hardly their "core competency".

Very interesting analogy.

strlen · on Dec 19, 2009

Would you fly on an airline that treated their aircraft maintenance the same way? So why would you expect advertisers to run their ads on sites that don't have a person who understands the application carrying a pager?

bretpiatt · on Dec 19, 2009

This is good advice to a point, everyone should do some basic things to have the proper level of availablity for their needs. Referencing a blog post I made about this during the BitBucket issue a couple months back: http://www.bretpiatt.com/blog/2009/10/03/availability-is-a-f...

I'm not a big fan of the DIY and this isn't because I work for a service provider (Rackspace). I fundamentally don't believe businesses should utilize resources with business specific knowledge on commodity tasks. A number of providers are capable of setting up a multi-DC configuration for customers. A number of consultancies are capable of setting up a multi-DC/multi-provider configuration. The provider or consultant can provide clear instructions and/or support through the commodity layers. Businesses will ultimately need to know the specific details of their application and how to troubleshoot it unless you're paying the provider/support team to do so.

vaksel · on Dec 19, 2009

you yourself are at the same risk of getting the same exact problem...who do you think is better at fixing it, a startup guy who is no expert, or the 100 people at Rackspace who know what they are doing?

strlen · on Dec 19, 2009

If your start-up has nobody that is competent at operations (which really is a lot more engineering than IT/systems administration), it deserves to fail. Incidentally, I can tell from my personal knowledge that the "big successes" amongst Internet companies (i.e. those who have achieved a product/market fit on competitive, multi-billion dollar markets) have had a very strong operations engineer amongst either the founders of the first five employees.

If you use Rackspace to host your machine, that's an acceptable way to save money on hardware and networking gear (while you're still in the early stages).

However, you can't and shouldn't expect Rackspace's engineers do your operations for you. It is your responsibility to understand how to tune your operating system, application server, database et al. specifically to your application.

If you expect others to turn your application into an Internet service, you're analogous to a "business guy" on Craigslist posting about his "next big idea" that he wants somebody else to build.

The big point I meant to make is that if your site is down because your hosting provider is, it's still your fault (for not having a DR site).

vaksel · on Dec 20, 2009

a hacker who dabbles in IT, will never be as competent as an expert.

And bringing in an IT founder early on, is a waste of equity, since you are giving away a farm when the guy has nothing to do. + all that tuning etc, only matters when you are bringing in a ton of traffic, until then it doesn't matter if your server is 94% efficient or 97%. Worst case scenario, you throw another server into the mix.

strlen · on Dec 20, 2009

Best software engineers are also strong systems administrators. It's also commonly understood that strong systems people must be strong software engineers. An operation department at a serious Internet company wouldn't even invite somebody who can't write production quality code for an in-person interview (they wouldn't pass a phone screen).

Strong operations people would also resent being called "IT". IT sets up corporate desktops/mail/calendaring, the VPN and mounts machines in a rack. Indeed, this could be outsourced (mail to Google Apps, rack mounting to Rackspace, OpenVPN in lieu of an expensive Cisco product).

Production operations involves something entirely different and requires an entirely different skillset. Think being able to create application specific automated provisioning, configuration, deployment, software load balancing, monitoring, fault tolerance and high availability. Monitoring is especially tricky: nearly all the enterprise and open source NMS products, for example, "get it wrong" (e.g. hyperic using a single database to store events and allowing you to only poll every few seconds). Monitoring an application (as opposed to a machine) requires writing custom code (by definition), otherwise you simply can't capture the essential vital statistics (and not just whether or not somebody listening on port 80).

I hate doing an appeal to authority, but I should add I've done operations at a very large scale (10,000+ machines) in a big name Internet company, as well as on a smaller scale at startups. I've since moved on to software engineering (since that's what I love doing the most), but it bugs me to see misconceptions about the nature of running an Internet service. Unfortunately these misconceptions run deep and often disadvantage entire application stacks by making them unfriendly to operations (e.g. the flaws of JMX).

jmonegro · on Dec 19, 2009

TechCrunch ought to get a little green light image in their sidebar that goes red and displays the text "Rackspace is down" every time it is.

Hoff · on Dec 19, 2009

Yeah, and followed by a GoogleWave of Erlang threads on HN. Please.

pclark · on Dec 18, 2009

Linode.

bentlegen · on Dec 19, 2009

Linode had several hours of downtime back in October: http://www.linode.com/forums/archive/o_t/t_4778/host_reboots...

Not to knock on Linode - I use them - but as earlier commenters have stated, downtime is a fact of life, and no host is immune.

dnsworks · on Dec 18, 2009

It's amusing that Techcrunch considers themselves and a handful of other startups "the internet".

This is why we architect provider independent solutions. Keep your TTLs low. Replicate your site to another provider, and be prepared to hit the switch when your primary provider inevitably fails. All datacenters lose power, all networks get borked, and all storage systems fail. It's a fact of life.

iigs · on Dec 18, 2009

Yeah the snotty attitude advanced in the article is pretty offensive. If you run something on the internet and rely on it for your dinner, you don't really have an excuse for single points of failure. Do your job.

It's exceptionally easy to engineer around these issues if you don't just outsource all of your thinking to a provider.

aaronbrethorst · on Dec 19, 2009

MG Siegler, the author, isn't exactly the most technically astute person around, which isn't inherently bad except that he doesn't seem to recognize his own limitations around such matters.

MG has a bad tendency of breathlessly reporting and misreporting on issues that are far outside of his area of expertise. One of my favorite examples happened a month ago when MG misdiagnosed CSS files not loading on Twitter as 'BRAND NEW FEATURES THAT ARE UGLY, OMG!' (http://www.techcrunch.com/2009/11/17/twitter-just-ui-puked-o...).

Confusion · on Dec 19, 2009

It's exceptionally easy to engineer around these issues if you don't just outsource all of your thinking to a provider.

Can we stop blaming internet companies for outsourcing certain responsibilities? A decent hosting provider should ensure there isn't a single point of failure in their service. You don't ask a random company to set up their own accountancy department either: there are specialized companies to which you can outsource that. They bear the responsibility for faulty reports, as Rackspace bears the responsibility for not having a decent failover setup.

As a specialist of some sort, you have better things to do than to think about what is needed for a proper failover setup. Especially because you will still miss certain subtle details, because you are not an expert in handling a datacentre failure. You should hire experts to set that up for you and a hosting provider has those experts.

Honest to God, what's with this attitude of 'if something on the internet fails, it's your own fault, because we are all smart enough to take care of all those contingencies'?

1: we aren't; failover handling is prone to subtle bugs. What tptacek keeps saying about security also holds for contingency handling.

2: it's not smart to spend your time bothering with intricate, but ultimately mundane, issues that can be outsourced.

antirez · on Dec 18, 2009

without to mention how trivial is to replicate a blog.

drusenko · on Dec 19, 2009

I don't know, I sort of side with TechCrunch on this one. Yes, you can replicate a site (some easier than others, and a blog should be relatively easy), but that's still no excuse for datacenter downtime, and Rackspace has had a prolific amount of downtime.

Any datacenter downtime is serious business -- at a good datacenter, it's a small-ish incident every few years. Rackspace, on the other hand, seems to have a large incident every few months, meaning it's probably best to take your servers elsewhere.

ghshephard · on Dec 19, 2009

I used to think that Data Center Downtime was serious business, but I've come around to believing that it shouldn't be. If your data center going down is a problem, then you don't really have a very robust failover plan.

If, instead, you _plan_ for the Data Center to go down, and treat a data center failure as a trivial issue, then your DR plans start to become significantly more robust.

In particular, the companies that I admire are ones that routinely swap, on a routine basis, their DR and production facilities - and when a production data center goes down they don't even bother to wait - they just light up the DR center and are back in business.

Most of the SaaS Financial Hosting companies (Oracle Financials) that I've talked with will provide you with that feature.

nikcub · on Dec 19, 2009

We have a backup. Guess where it was.

Now guess where it will be tonight.

endtime · on Dec 19, 2009

Pardon my ignorance, but how do you hit the switch? Or, rather, what is the switch? DNS is slow to propagate, and if you have some kind of server for your "switch" then isn't that still a single point of failure?

iigs · on Dec 19, 2009

Great question. In the common case, the solution is to set low TTLs (time to live attributes: they tell the DNS server how long to cache the name) on the DNS records, so they expire quickly.

You can also run a pair of "front end" servers at all times, and use magic in the back-end to replicate the DB between sites. Depending on the type of site in question this can be really easy or really hard.

Another (generally commercial hardware) solution is to receive the query to one DNS server. This DNS server waits to respond for a few milliseconds while it sends a copy of the query to the other DNS server. They have different answers for the question (dns-east will return the IP for www-east, likewise for dns-west and www-west). They return the answers at the same time, and the closer server "wins" by getting to the recipient faster. This is called GSLB, Global Server Load Balancing.

pbhjpbhj · on Dec 19, 2009

But your GSLB requires your DNS server to be up, which it won't be if the datacenter blacks-out. Ditto if you're running a "pair of 'front end' servers".

Shouldn't DNS records have a reserve address too then the client could decide if failure had occurred to use the alternate address?

michaelneale · on Dec 19, 2009

Are low TTLs effective in practice? Sounds like a good idea, but what % of users are using DNSs that just ignore it and cache things well beyond what they should?

andrewvc · on Dec 20, 2009

With a 60 second TTL I see most traffic drop off within 5 minutes, and almost all of it gone within 30 minutes. There's a few stragglers who seem to still be there even an hour after the switch, but by and large DNS TTL does work.

antidaily · on Dec 18, 2009

Holiday party. Someone spilled egg nog on a surge protector.

ryansloan · on Dec 18, 2009

This seems like a good candidate for http://sadtrombone.com