Hacker News new | past | comments | ask | show | jobs | submit login
AWS Data Pipeline (amazon.com)
78 points by ing33k on Dec 21, 2012 | hide | past | favorite | 25 comments



AWS is slowly becoming the Oracle of our generation, in the sense that they have found a way to lock startups and large companies into a software/services ecosystem that is really really hard to stop using once you get started.

You start with regular open-source instances, but that's just the hook. Once you have EC2, it's really easy to get started with AWS 'magic' services like Elasticache and RDS. It's easier than setting up a memcache cluster or mysql right? But once you get comfortable with those services, it's just so easy to keep going down that road and making your software reliant on proprietary services like SimpleDB, S3 and AWS Data Pipeline. And then you wake up at some point and find that you're 100% dependent on AWS.

By that point, if you're lucky your monthly AWS bill gets you an invite to speak at the next AWS conference. :-) You might even get a personal customer support rep that calls you when your servers go down.

A website/service cannot by definition be HA if it's reliant on one service or infrastructure provider. AWS has so many proprietary parts now that you really need to be careful which ones to use so that you don't wake up one day and realize that you're completely dependent on AWS.

I'd stay away from this with a 30-foot pole, but if we really did need to use it, I would only use the features that I felt comfortable building internally at some future point if we chose to move off of AWS.

It's important to keep your software stack as flexible and open as possible, and for risk-management you should plan on using (or least having the option of using) multiple vendors and service providers.


The thing is, though, even though it's in Amazon's interests to create dependence on AWS, it's also in their customer's interests to use those services.

When you double down on a rich platform you can get enormous advantages. Avoiding the inner platform is a biggie; not paying portability tax is another.

The urge to be independent of any vendor, any platform etc is attractive to us as engineers. But it comes at a high price too.


"A website/service cannot by definition be HA if it's reliant on one service or infrastructure provider." you seem to be conflating highly available with a diverse supply chain. A lot of highly available systems are "locked" in to one provider, whether it's broadcom/citrix/intel/etc.


Does all Citrix hardware for a region go down simultaneously?


Actually, yeah. How about worldwide nxos crashes due to the leap second bug? Or the various poison bgp updates that've made the rounds? Or overrunning an ospf domain? Anyways my point, if you read the sentance after my quote, was there's a distinction between sole source provider and "ha". Multi source supply is due diligence. But it's not a perquisite for or solution to high availability systems.


A whole lot of glue-job VMs just became unnecessary.


Just this week I was looking for a better solution that would back up my RDS database to S3. I'm currently using mysqldump, but the RDS instance size has grown extremely large and so, it has become unwieldly. Hopefully this will help with that.


It might not be appropriate for you, but a good way to handle MySQL backups is to maintain a mirror. This has the added benefit of being available as a fail-over and as a secondary instance where you can run reports or test long-running queries on current data without the risk of taking prod down.


> It might not be appropriate for you, but a good way to handle MySQL backups is to maintain a mirror.

    DELETE * FROM business_critical_data; WHERE obsolete = true;
You were saying? :D


You can run your daily/hourly backups on the mirror and not impact performance on your main database.


Right. But the grandparent comment was suggestive of the possibility that he or she wanted the mirror to fulfil multiple roles, including being the backup.


The mirror is to run the backups on, not the backup itself.


CHANGE MASTER TO MASTER_DELAY = 3600;

Your slave will always be 5 minutes behind the master. You were saying?


If this is your tactic, I'd think the binary log would be the winning way to rollback a bad delete.


Well, with a functional mirror to run manual queries against, you wouldn't be running the risk of running such a query on your prod DB anyway. But yeah, you still need a dump.


There's lots of disaster reports that start with "shouldn't" and "normally we wouldn't" :)

http://www.taobackup.com/


Thanks, I've already tried that. My main issue is EBS performance when writing the dump file to disk. The backups themselves don't impact on database performace much, but writing up to 20 Gigs of a dump file to an EBS disk on a nightly basis is extremely slow. Maybe this Data Pipeline service will help bypass that.


Have you tried piping the dump directly to a compression tool? We use pbzip2 for our dumps which works great if you have some CPU power to spare. The largest was around 8 Gigs uncompressed, but the total size of the plain text dumps is over 20 Gigs. Hardly an issue for EBS that way. Did kill the DB for a few minutes several times before using cstream to cap the dump bw.


Have you set up Provisioned IOPS for the volume?


It's a mainframe in the cloud.


Dear AWS hire a designer. Thanks.


The AWS Management Console was recently redesigned with Bootstrap.


ETL-as-a-Service


You shouldn't really be trusting Amazon with your datawarehouse or paying that much for the storage, but from a technical convenience standpoint AWS is probably the best solution for some of the horrid little inept kinds of organizations that I have encountered.


Totally. I know I create lots of business value when I spend a day dicking around with mysqldump and rsync and inotify and scp and hfds. Who would want to use this kind janitorial service when the could do it themselves?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: