Hacker News new | past | comments | ask | show | jobs | submit login
A gentle reminder that RAID doesn't make offsite backups (journalspace.com)
67 points by there on Jan 2, 2009 | hide | past | favorite | 50 comments



It is astonishing that a site with years worth of User-generated content didn't have any sort of backup plan. Or just years worth of data in general!

RAID isn't a backup solution, period.


And no wayback machine either, ouch:

http://web.archive.org/web/*/http://journalspace.com

"We're sorry, access to http://journalspace.com has been blocked by the site owner via robots.txt."


Their robots.txt shows they allowed Google (at an extreme 3-minutes-between-fetches crawl-delay), but no other search engines, nor the Internet Archive.

Otherwise, a tool like 'Warrick' might get back a significant portion of the content -- especially any that was old and well-inlinked. As it stands, maybe they can scrape a little from Google's cache.

Warrick: http://warrick.cs.odu.edu/


The warning about RAID is well-founded, but I'm finding it hard to imagine what sort of backup plan could defeat deliberate sabotage by a former insider. I keep coming up with ideas, and immediately thinking of problems with them. Short of visually examining a substantial number of backed-up files to confirm they contain genuine and correct data ... but then wouldn't it be possible, with enough effort, to create data that looked superficially valid but was in fact, on much closer inspection, mixed up beyond usefulness?

RAID isn't a guard against filesystem bugs, carelessness, or extreme physical server damage, but it seems to me this is more of a warning against bad blood.


If they had been backing their stuff up to AmazonS3, or just dragging a USB disk into the datacenter every month, they would have a copy of the data. And more than likely, if they used the S3 option, they'd have a copy current from the night before.

Somebody has to eat the broccoli and do the shit work of making sure that the data is backed up; if you can't do that or you hand off the job to the sort of person who is going to sabotage your company because they don't feel appreciated... you aren't a startup, you're a hobby.


The suspected culprit was the chief IT person at the company, and I think we're being led to believe he was both clever and motivated. USB disks would likely contain garbage data N days before the culmination of the sabotage, where N is whatever it takes to make sure the full rotation schedule gets ruined.

On the other hand, it probably isn't a coincidence that this happened to someone who had no backups at all.


At a high level the way to defeat this kind of attack is to require more people to be involved. Large conspiracies most often fail.

In this case that could mean having 3 USB disks, each with their own 'owner'. The owner of each disk would be responsible for making the backups and storing them in a place only they can access.

This is also lesson on how to let someone go. They say it's possible that a former employee may have done the deed. When letting someone go on bad (and even good) terms you need to have IT disable all of their accounts as soon as possible, preferably while they are in-process of being fired.


Of course if you can't trust the (creator/maintainer of the) code on a machine there's nothing you can in code on the machine to make it secure.

That said, if this kind of problem is a valid concern for an installation, the simplest solution is external validation of backups. For example, one might hire someone not in control of the original system (and not in contact with them) to do restorations and provide certification that they worked according to a test plan. This is the foundation for a lot of Sarbanes-Oxley technical audit consulting.

Audited backups would, of course, require a fairly elaborate disaster recovery plan, which sounds out of scope for this company.

I don't want to pass judgement, but if a pivotal IT person was let go and sabotage was discovered, at least a basic pass through the systems taking backups doesn't seem out of line.


If you regularly refresh your dev/qa enviroment from production backups, you get the verification "for free".


Sure, it's possible. No backup plan is fool-proof. But if you set the hurdle high, you can make sure only a determined fool will exploit it.


Not even a dev server or QA server with some version of the database? Not even a one-off manual backup from ages ago? nothing?

Pretty amazing.


Come to think of it I find this pretty much impossible to believe. Unless the data set was many, many terabytes of data, surely there would be some sort of copy of the database somewhere? Even the schema with some reference data...


That's entirely possible, but if the data is years old or simply reference data, what's the point? They just lost years worth of peoples' blogs (or whatever). That isn't really something that can be recovered from. Would you go back? Would you recommend it to a friend?


If I owned JournalSpace and only had a schema left, I'd want to give it another go, done properly. You still have the name, pagerank, domain name etc. Sure, your name is (rightly) dirt for a while but I'd still want to give it a go...


I just find it amazing that there was never a situation where someone said "I'd like to test this change on some live data before rolling it out" or "I'm about to make a change/upgrade, I better run a backup first"


Or on-site backups, for that matter.


I am absolutely certain, having been in a similar situation, that the sysadmins kept saying "we need this" and the accountants kept saying "no". I am also certain that the business will blame the sysadmins for not "selling" it to them.


Sure, I've been in similar situations too... but if it means buying a USB hard drive on the office supply budget and running manual backups to that, so be it.


"Real men don't take backups, but they cry a lot."

That said, it's really hard to do backups of lots of data.


If you can fit all of your data on one large drive, then it's not hard to do a backup.


that's not true. transferring a large amount of data in a consistent manner can be difficult. dumping your database without taking down your live site (or seriously impacting performance) can also be difficult without a dedicated backup slave. and the same can hold for any storage solution where you are short on IOs, no matter if the data fits on a drive or not.


I dunno, dude. This doesn't seem like a high that requires extremely high availability.

And anyway, if they can deal with rebuilding a RAID after a failed drive then (by definition) the site can deal with copying all the data off the drive. Heck, you could periodically yank a drive our of the RAID and replace it with a fresh one, and you'd have a backup.


Just yanking RAID drives isn't a way to guarantee DB consistency, especially if it's RAID5.

Database backups almost universally have to be made by the database system. This is no excuse for lacking a backup system; backing up databases is a solved problem.


Seems pretty clear that it's RAID 1 (mirroring). And isn't that exactly what RAID guarantees in a mirroring setup? That the two drives will have the same bits?


Even RAID1 doesn't guarantee this'll work. There are many ways simply disconnecting the secondary drive while cause problems:

1) The app may be in the middle of a series of DB commands that all need to complete with success before the DB is consistent at the app layer.

2) The DB is in the process of writing out some table rows and hasn't finished.

3) The DB has written some temporary locks to portions of the database that need to be released.

4) The OS hasn't committed writes from the DB to disk

5) The disk hasn't committed writes from the os to platter yet.

Your best bet of this working is to cleanly shut down your app, then cleanly shut down your DB and run an fs sync. After all that it might be ok to yank the drive.

Just yanking a RAID1 drive may work sometimes, but I wouldn't count on it. Especially when every DB system I know of has some sort of backup/dump mechanism. As someone else mentioned, RAID is great for providing high availability, but it does not provide disaster recovery.


With any sane storage subsystem you can un-mirror and re-silver hot. Lots of people used to back up Oracle like this (using a 3 way mirror), so you'd only be in hotbackup mode for a minute. This was before RMAN.


That's what you get for shouting at those disks. Oh, different story. Sorry. :)


A reminder also to keep a local copy of your blog, or whatever, that, for all you know, is hosted like this.


It's a good thing that stuff like this doesn't happen too often, or people would be more wary of trusting all their data to web services.

I mean, we assume Google has a really amazing backup strategy for Gmail. We hope they do, because, jesus, I have a lot of irreplaceable information in there. But we don't actually know.

Hmmm... does Gmail have an "export all" feature...?


imap


pop3 + fetchmail + crontab makes me feel a whole lot better


Sadly it doesn't work for chat... I use a lot of gmail chat, and there is no known way to download a chat archive or anything such.


Speaking of off-site backups, I have recently discovered http://www.jungledisk.com/homeserver/index.aspx

It backs up your stuff to S3, and it can work on your Mac/PC, or on Windows Home Server. Have't tied it yet, but looks pretty neat.

Also, new NAS from HP has built-in S3 backup: http://www.engadget.com/2008/12/29/hp-mediasmart-server-ex48...


I agree. JungleDisk is an easy to use program that allows you to back up on S3, and it works with GNU/Linux as well. For $20, you get unlimited free updates, unlimited installations tied to the same S3 account, and all three versions (Mac, Windows, Linux). Amazon S3 is pricey, but this program looks like a great deal. Other recommended services are Dropbox and CrashPlan.


I love Dropbox and I am a recent convert to it, thanks to someone remarking about it in a HN thread a few weeks ago. I now have my critical (encrypted) accounts & passwords file in Dropbox on computers both at home and work. I like that it's cross-platform too. I haven't heard about CrashPlan before, so I am going to Google it now. Thanks.


I can't believe that they didn't have any backups. It's not hard to do.

My operation is way smaller, and I have a cron script that runs once a week to dump the database, zip it, and then transfer it to a backup server.

This is one the first things that I set up. It really should be the first thing any company with user-generated content to do.


After building the app, of course. Without the app, there's no data to back up. Point taken, though.


You'd have a backup strategy in place for your source control system, right? Even if it's just tarring it and scp'ing it somewhere in a cron job.


I'd say they deserved to go out of business.


I hate to be judgmental. We are humans, we all make mistakes. But it's my understanding that they were operating for several years and hosting the data for many people. Carrying on without a backup plan and hoping for the best was very irresponsible and I agree with you that they, sadly, deserved to go out of business.


The only good thing to come out of this is that everyone involved (sysadmins and management) now is (or should be) a devout believer in making routine backups. It's a hard way to learn a lesson, but they're not the first nor will they be the last to learn the hard way.


raid 1 mirroring is pretty much only used to prevent downtime. So if your hard drive physically screws up, you still have all the boot sectors intact on the other drive so even a restart wouldn't kill you. I only use mirroring on disks with spinning platters, for solid state I just ignore mirroring. Solid state can go bad, but not as easily as magnetic metal spinning at 7200 times a minute.


That should say "dramatically reduce downtime". A small number times a small number is still not zero, and it is possible, though highly unlikely, to suffer a second drive failure while operating in degraded mode. My understanding is that's not what happened in this case, though.


It's actually very common for both drives to fail at once. Examples:

- case fans fail, everything overheats

- room AC fails, everything overheats

- power supply goes haywire, toasts drive electronics

- box falls off the shelf, drives crash

- fire or smoke damages drives

- roof leaks, drips onto drives

- one drive fails but nobody notices for a month until the next drive fails

- burglars steal box

No, not all these have happened to me.


just make sure you've accounted for any finite write limitations built in to your solid state drive. sure, it's not likely that'll you reach the limit but if you're successful & relying on a random sector distribution you could get there. and that could be just as problematic as a bad platter.

http://en.wikipedia.org/wiki/Flash_memory#Limitations


...'gentle'?


There's mention of a staff, but I can't imagine how they survived based on the traffic estimates I've seen for their site (14,000 visitors a month per Slashdot). Does anyone know more about their business model?


doesn't the "automatically copied to both drives" mean they're running a Raid 1?

The problem seems obscure...what kind of software bug overwrites all data on disk?


    dd if=/dev/random of=/dev/sda1


Especially one that only contains two drives...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: