Hacker News new | past | comments | ask | show | jobs | submit login
Ma.gnolia.com crashes hard -- no backups? (gnolia.com)
95 points by eli on Jan 30, 2009 | hide | past | favorite | 77 comments



This is why it's so important to take some simple steps to ensure that you don't bork your site. For my production sites, I usually have the following setup:

* Replicated Database with one slave in a separate data center just as disaster recovery (not a performance thing with the separate data center). Heartbeat for the slave(s) in the same datacenter so that it stays up.

* Nightly backups offsite. Used to be rsync, but recently thinking of using tarsnap for the ease of it (and S3 puts it in several data centers).

* Files stored either in a MogileFS setup or S3 in multiple data centers.

It doesn't have instant failover to another data center, but the offsite DB slave should mean no data loss beyond a second or two. Ma.gnolia is a decent sized site. Maybe they did have a decent infrastructure and it will only be a little while before they've gotten everything back. Of course, after the fiasco with the blog site that used RAID as their backup system, I've started to think that many people don't take data as seriously as it needs to be taken.


Reminds me of the Couchsurfing fiasco back in 2006.

http://www.techcrunch.com/2006/06/29/couchsurfing-deletes-it...

http://www.couchsurfing.com/crash_page.html

Couchsurfing had a backup process in place, but failed due to human error. It's highly likely that magnolia had backups of some sort. Before we all start berating magnolia (and website owners in general) for not backing up their data, we need to know more about the real reason for this failure.

Couchsurfing was able to come back from the dead because of limited competition. In this case, unfortunately, it will be easier for users to look elsewhere.


Reminds me of LeafyHost.

(Summary: Small hosting company advertising multiple backups; their single server crashed and they had no backups at all. Long story involving terrible customer service ensues).


You're right that we probably shouldn't jump to conclusions, but it doesn't look good.

And human error shouldn't really be possible if your backups are automated and you periodically do test restores.


I like the idea of the offsite slave. I was also reading about another idea, forcing a slave delay to recover from oopses/corruption:

http://www.rustyrazorblade.com/2008/05/07/mysql-time-delayed...


I do something like that for my outgoing email ;)


"take some simple steps to ensure that you don't bork your site"

unborked[.com] is currently available.

A site dedicated to stopping all of the borking going on out there.

===

I think some cheap, simple sql dumps from the db, burned to an external HD and taken home would've been a good first step -- your data would be dated but somehow restorable nonetheless without the need for off-site, replicated, heart-beat monitored databases.

IMHO you really need an expert database admin (of which I am not) to get all of this up and keep all of it going; a caretaker who's only job in life is the maintenance of any and all persisted data.

On another note, backups are really useless unless you have actually tested the restores.


what does 'burning to an external harddrive' do ?

is this some new tech that I'm unfamiliar with ;) ?


I'd really like to store incrementals of my MySQL at S3. Anyone using something they really like? Or should I just break-down and do full dumps each night?


Tarsnap.

Tarsnap lets you say: tarsnap -c -f backup01302009 mysql_dir/

And you can just adjust the date each day. It gives you the luxury of a full dump (anytime you want to restore, just reference backup01302009), but it only actually stores the deltas (making sure not to duplicate data that might be in backup01292009 or backup01282009 and so on). Tarsnap stores the data to S3 so that it's replicated in multiple data centers.

It costs a little more than S3 at 30 cents per GB, but it's metered out so that if you only use 1MB of storage, you'll only be charged 0.03 cents for that storage. You could try creating your own way of doing incrementals, but I doubt you'd get it as efficient as Colin (the math genius behind Tarsnap) and so I doubt you'd get it cheaper. Plus, this way you don't have to deal with it.

And remember, it's hard to fill up a database.* As the Django Book notes: "LJWorld.com's database - including over half a million newspaper articles dating back to 1989 - is under 2GB." So, if they were using Tarsnap, they might be storing 5 or 10GB tops at a whopping $1.50-$3 per month plus whatever the transfer of their deltas was for the month. Oh, and tarsnap compresses the data too. So, maybe they'd be paying $1 or something lower.

* Clearly, if you hit it big time, you might not want to continue paying for tarsnap. However, if you become the next big thing, you can hire someone to deal with it for you.


This doesn't work. You can't just copy the files and expect them to be in a sane or consistent state.

You either need to a) use InnoDB hotbackup or b) use a slave, stop the slave, run the backup, and restart the slave to catch up.

At delicious we used B, plus a hot spare master, plus many slaves.

Additionally, every time a user modified his account, it would go on the queue for individual backup; the account itself (and alone) would be snapped to a file (perl Storable, iirc.) Which only got generated when the account changed, so we weren't re-dumping users that were inactive. A little bit of history allowed us to respond to things like "oh my god all my bookmarks are gone" and various other issues (which were usually due to API-based idiocy of some sort or another.)


Using a slave isn't fool proof either. If someone were to run a malicious command, it gets replicated, and could get backed up before being caught.


I didn't say that. Read what I wrote.

You use the slave so you can shutdown the database and get a consistent file snapshot. Then you do offline backup.


Yeah, it's true. I was a little simplistic. I usually use A, but I'm not dealing with the amount of data that delicious is.


Whenever Tarsnap is mentioned, I have to mention Duplicity which does the same thing, but is Free Software.

I use this for my personal backups, as well as backups of our work svn (fsfs) and git repositories. I use it against S3, and have found it incredibly reliable.

As a bonus, it encrypts everything but still does incremental backups. It's a really nice piece of software, and you don't have to pay anyone to use it.


...Duplicity which does the same thing...

Duplicity is not the same thing as tarsnap. Duplicity uses a full plus incrementals model compared to tarsnap's snapshot model, so with duplicity you're either going to be stuck paying to store extra versions you don't want or be stuck paying for multiple full backups. Moreover, tarsnap is considerably more secure than duplicity.

Before I started working on tarsnap, I considered using duplicity; but it simply didn't measure up.


How is tarsnap considerably more secure?


Some problems with duplicity off the top of my head -- I'm sure there are others (there always are):

1. Duplicity uses GnuPG. GnuPG has a long history of security flaws, up to and including arbitrary code execution. Yes, these specific bugs have been fixed; but the poor history doesn't inspire much confidence.

2. Duplicity uses librsync, which follows rsync's lead by making rather dubious use of hashes. In his thesis, Tridge touts the fact that 'a failed transfer is equivalent to "cracking" MD4' as a reason to trust rsync; but now that we know how weak MD4 is, it's possible to create files which rsync -- and thus Duplicity -- will never manage to back up properly.

3. When you try to restore a backup, the storage system you're using can give you your most recent backup... or it can decide to give you any previous backup you stored. Duplicity won't notice.

4. If you try to use the --sign-key option without also using the --encrypt-key option, duplicity will silently ignore --sign-key, leaving your archives unsigned. Based on comments in the duplicity source code, this seems to be intentional... but this doesn't seem to be documented anywhere, and it seems to me that this is an incredibly dumb thing to do.


EBS does deltas too. Is anyone else using it? I like the ability to mount a volume or clone a volume almost instantly and mount it on another machine.


EBS does deltas, but there are a few caveats. The most important being that you need to be using EC2. For many, $72/mo plus bandwidth might be a bit much for what they're doing if it can work on a 512MB Xen instance for under $40 with a few hundred gigs of transfer included.

Beyond that, drive snapshots aren't the easiest things to do. I know that Right Scale tells their customers to freeze the drive so that no changes can occur until the backup is complete. With S3 performance around 20MBytes/sec, to backup 1GB would take around a minute. That's not bad and only doing deltas it's unlikely you're going to have a huge amount to backup at any given time, but it isn't exactly good either. With file-level backup, you can do a mysqldump and then just back up that file. Eh, maybe I'm just preferring the devil I know in this situation.

It's a little more complex to set up (doing file-level backups), but if you're going the volume route, you need to make sure you don't leave the drive in an inconsistent state.

All that said, EBS is awesome. If it fits what you're looking for, then go for it!


This is not totally accurate. EBS snapshots are basically instantaneous, its just the copy to S3 that takes time, but Amazon performs this in the background. We use XFS on our EBS volumes (running MySQL 5 innodb) and then have a little perl script (http://ec2-snapshot-xfs-mysql.notlong.com/) that does FLUSH TABLES WITH READ LOCK -> xfs_freeze -> snapshot -> xfs_thaw -> UNLOCK TABLES. The whole process takes a fraction of a second, and it also logs where in the binlog the snapshot was made (handy since we create new slaves based off snapshots and reduces how much data we shuttle around).

We snapshot a slave every 10 minutes and the master once a night (just in case something totally weird happens to the slave and the sync isn't right). This is a multi-gig DB and we've had no problems.

Here is a link to a full tutorial about running MySQL on EC2 with EBS: http://developer.amazonwebservices.com/connect/entry.jspa?ex...

I wanted to also point out that a live slave is NOT a backup scheme. If someone hacks your database and runs DROP ALL FROM PRODUCTION_DATABASE you've now got a perfect copy of nothing.


Depends on the data and your budget, I guess.

Disk space is cheap -- I do a full dump of the database nightly (and a separate system dumps a few key tables every 15 minutes)


Then you can still lose a days data. Why not switch and ship the mysql binary log every x minutes? Last backup plus all the logs gives you much better recoverability


Gut wrenching, for sure. To see your whole model explode, it's just not worth it.

If you don't have a backup plan -- close your browser, stop surfing Y!Hacker, and go write a shell script to dump your database and rsync to anywhere else on earth.


...and then do a test restore!


YES! This comment should be modded up to the top.

We talk to a lot of customers about their backups every day. I'd say more than half of the backup failure stories we hear come from failing to adequately test that data can actually be restored.

Some have been comically tragic. Like creating an offsite backup using encryption keys that are only archived as part of said backup... :(


"creating an offsite backup using encryption keys that are only archived as part of said backup..."

Ugh, that is awful.


A former colleague was attempting to replace a failed hard drive in a server. So he installed a fresh OS, put the tape in the drive, installed the backup software and watched as the backup software started to reformat the tape...


amen to that! You should see the confused looks I get when doing technical due diligence somewhere and they get that question: "when did you last try to restore a backup ?"

then once they get the point they want to leave the room asap...


An easy way to have a space efficient, perpetual backup of a database:

Run the dump file every N units of time.

Compress the dump file using gzip with the --rsyncable option. This increases size by 1% but makes it efficiently diff/patchable.

Use diff to make a patch going back to the previous compressed dump of the file. Keep the patch, discard the previous dump if you like. You now can apply the patches in succession to go back however far you like.

Finally, and most importantly, use par2 to store parity along with your backup files to protect against silent bitrot.

Note: --rsyncable is in newer gzip versions. It's too new to be included in OS X Leopard.


Wouldnt these two solutions be even nicer?

* Backup the Binlog of the DB (At least its called Binlog in MySql)

or

* Make a diff of the uncompressed dump and zip that. This diff would even be human readable.


It's easy to say, "What? No backups... how stupid can you be?" But, without knowing the particulars, I wouldn't necessarily jump to that conclusion. It's very easy to have regular automated backups fail in some way.

Unfortunately, it's not enough just to have backups. You have to actually verify that they're correct and up-to-date. Verification is easy when your database is small (for example). You can just load it in your development environment occasionally. But, how do you verify your backups if you have hundreds of gigs or even terabytes of data?

As an example, I've seen cases where backups were successful every night... but, they were being run against a slave db and replication had failed. The result: Excellent backups of weeks old data.


A backup is only a backup when you have restored from it successfully. Until then, it's no better than garbage.

A disk raid is only a raid when you have tried pulling a disk and having the raid successfully rebuild (while running in production!).

I once had a system administrator who didn't dare pull a raid disk while in production. After that, has wasn't a sysadm in my eyes, he was a lottery gambler.

Being paranoid doesn't mean nobody is following you.


That's nasty. Maybe you could automatically inject test data into a fake account and run test queries on the backup. That will not verify you are getting everything but at least that you got something recent.


Verifying backups is almost always many times more difficult than setting up the backups themselves. Checking that replication was in fact working turned out to be fairly easy to automate (although, short of checksumming all the records in all the tables you can't really be 100% sure everything is okay).

That's the thing with backups, they can fail or become corrupt in many ways. If you don't use them for something on a regular basis or have some regular verification process you may not know what you have until it's too late.

And, of course, I've also seen situations where subtle data corruption in the master database leads to weeks of subtly corrupt backups. By the time the corruption was discovered we faced the choice of rolling back several weeks to the last good backup or fixing the corruption. In the end we had to do a kind of merge -- it was a real pain.


If I just have one MySQL machine and take daily database dumps, what would be good way of testing that my backups are OK?


The best situation for me has been one of my personal projects. It's a gaming site that I use everyday. The database dump is less that 100MB gzipped, so I load it in my development environment every week or so. That way as I develop I'm verifying (to some extent) the quality of the backup.

As a baseline, you should at least restore from your backups occasionally.

It helps that I'm familiar with the data -- so after restoring the backup I expect to see games, messages, forum posts, etc that I've just seen in production.

I do some more thorough automated tests on the backups less frequently. App specific things, like replay games and verify results, etc. This process is more about assuring backward compatibility with the code though.

I think verification should ultimately be somewhat app specific. That said, I'm sure you can find tools to help with verification.


Perhaps creating a virtual machine (Xen, VMWare, whatever) on your development box and using that to test restoring your backups?


The trouble with convincing humans to back up is that it's important, but not urgent. For any given day of procrastination, the likelihood of consequences is small.

Our motto is "If it's not backed up, it's down." (Applies doubly so for us as a backup company.) For sufficiently seasoned administrators, non-redundant storage should cause sleeping difficulties.


Early on the West-coast morning of Friday, January 31st, - poor guys must be frazzled. Jan 31 is Saturday.


Or else this is a dire warning from the future! There's still time to save the data; this accident won't happen until tomorrow!


The backup strategy, in my experience, is the first or second thing you need to be thinking about when you set up a server, service or anything IT-related.

The first thing I ask when my boss tells me to set up something new: how are we going to back that up?


It's a classic dilemma -- most people don't care about backups until immediately after it's too late.


I'm planning on burning time on building a full failure solution. Records snapshotted at least daily and any single node/service can die entirely and there is an exact, tested manual recovery checklist or automatic rollover option in place for each permutation.

This runs counter to the more cavalier "release early, polish later" advice I keep seeing. Maybe I am doubly freaked out because the things I'm storing are not easily recovered or re-imported by the users themselves or any kind of algorithm/redux.


It also runs counter to "do the simplest thing that could possibly work" and "KISS" and "YAGNI".

Doesn't mean it's a bad idea though. But if it's that good, make sure you announce it and market it as a significant feature...


It'll be a time sink. Why not do it as dirty as possible now and go back and tune it as time goes on?


See the original submission, that's why not :-)

I want to not only have a backup scheme but also make sure it's restore-tested. Maybe I wasn't totally clear, not planning on a beautiful failover in each place in the beginning (planning failover for the DB at least). Just a tested (even if manual) restore procedure in each situation.


I doubt Google have backups, take a copy home on a CD, have a massive third party database with dedicated certified people, or backup to Amazon S3.

I suspect their file storage system is simply good enough to replicate data intelligently across many machines/sites like a giant RAID array.

I also think that it's a truly massive benefit they have over other companies, particularly small companies, and that it's rarely discussed as such.


Magnolia ain't exactly indexing every page on the internet though, ya know?

I agree with you that that is one of Google's key strengths, though. PageRank is neat, but the real edge is that the site is fast and the infrastructure is based on commodity hardware, so it's cheap.


The real edge is not that it's based on commodity hardware, but that it's based on scalable low-management software (I imagine all this, and skimmed their Google File System paper once. I don't work for Google).

It wouldn't matter if they were blades or Sun UltraWhatevers, the real benefit is that if one dies, nobody need panic - plug another in and "the system" will rebuild it. If a rack dies, plug another in and the system will rebuild it. If a datacenter dies, the others will cover for it.

No backup tapes, no manual tweaking each new build.


As its been said a million times before, hardware is cheap. Creating a system that can utilize the cheapest Celeron servers you have with the flick of a switch is what you really need to create.

Adding a few more servers is much cheaper than downtime.


If YC crashes and nulls my karma, there's gonna be a lawsuit!


His note says it's January 31st... What will get fixed first? The date, or the data? I'm betting on data.


I don't know about you - but it's January 31st for me.. :P (although I do believe it was just a typo)


Tell us what the future is like! ;-)


very hot down here..

Just had 3 days in a row over 43 (109) and todays pretty warm as well. (my house still hasn't cooled down for sure)


Backing up is like testing, everyone knows that they should do it, but very few people actually do. That said, I'm sorry to hear about their troubles. They are good people genuinely trying their best.


What gets me is how people even do normal development without some sort of backup -- even an informal one.

Maybe I'm just lucky enough to work with relatively small datasets, but our QA server contains a complete copy of the production database and is at most a week old. So even if we didn't run any other backups (which we do), we could restore from that server and be back up and running within hours.


From reading the notice, I get the feeling that they don't have any backups! Not very re-assuring at all.


Well, I finally did it. I stopped what I was doing, and worked out my backup strategy, which includes (finally) getting a script put together, and now to adjust it to email me copies and also (contemplating) using S3 to store copies. Maybe also box.net, depending on a few things.


They have a page/section at GetSatisfaction, but it doesn't seem to be updated anytime soon (though there are 3 employees set to answer questions): http://getsatisfaction.com/magnolia


... glad I'm on del.icio.us, but going to export all my bookmarks now anyways.


Let's say you had an insane amount of bookmarks (or something precious like it). Would you pay to have DVDs burned and sent to you?

It seems like a convenient re-assurance for users to get DVD backups mailed to them every so often -- but it would be a total pain to support this as a small business...


A burned DVD full of bookmarks is, indeed, an insane amount of data.

At about, say, 2K per bookmark, that's a couple million bookmarks. If one bookmarks a site every 60 seconds, 24x7 that's about 4 years of work. A slightly more reasonable person, bookmarking every 60 seconds for only 8 hours a day, would have to be bookmarking websites continuously since about 1996 to fill up a DVD.


I understand, I am more responding to the preciousness and non-replaceability of it, the actual application for this I have in mind is not for bookmarks :-)


I don't know if I'd pay for just my list of bookmarks (I don't have that many), but I might pay for a monthly DVD archive of the site's content for all of my bookmarks, plus a searchable index. I hate when I can't find a bookmark, or a bookmark has gone offline.


For example: http://www.smugmug.com/help/backups

But how do I offer this when we're in colocation or slicehost etc.? Can't be driving to the colocation center every time someone wants a DVD...


I have an insane amount of bookmarks. I can still back them up in a few seconds via wget.

Of course photos and whatnot are much harder.


I am of course talking about something much harder. With big file sizes, the likelihood that users will regularly (much less in an automated way) download their own backups decreases.

I broke this out... http://news.ycombinator.com/item?id=458921


Does anybody use this site? I can never, ever remember the URL, and that bothers me a lot.

The way I see it, there are too many social bookmarking sites out there. If there's now one less, I wouldn't mind.


I wish Twitter allowed comments on tweets. Companies (like Magnolia) use it for important announcements / status updates, and yet there's no way to see feedback from the community.


That's what search.twitter.com is for.



Yikes: "I feel like I've lost a piece of me. This is scary."


Hope Twitter have backups...


sounds like the same thing as Journal Space http://www.techcrunch.com/2009/01/03/journalspace-drama-all-...


We worked out our backup strategy for our time tracking system (http://letsfreckle.com/) while in beta. While we're still small, we do full db dumps to off-site, once an hour.

This does not require a lot of sophistication. And yes, we might have some downtime while we restore in the event of a catastrophe, but we do have the data.

I can't even imagine the nerve of shipping a product to the public that doesn't have even something this rudimentary in place.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: