Hacker News new | past | comments | ask | show | jobs | submit login
Amazon Redshift is 10x faster and cheaper than Hadoop and Hive (slideshare.net)
138 points by fujibee on Feb 19, 2013 | hide | past | favorite | 42 comments



Disclaimer: I'm a committer on the Apache Hive project.

A couple points in no particular order:

* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.

* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?


Why wasn't that used in this performance comparison?

Because then the stupid headline wouldn't be so sensationalist, would it?

// I have no dog in this fight, but hate twisted claims


great feedback. I was also skeptical of using EMR hive due to the fact that it is so far behind in versions. Also RedShift can do the analytics part very well but I don't think it can do the exploration part that Hadoop/Hive are so good at (but maybe I am wrong)


Wow. A year ago a solution architect promised me they would catch up to mainline to get a bunch of critical bugfixes.


Comparing a column-oriented RDBMS with parallel query execution versus hadoop is a joke in the first place. Hadoop is extremely slow. That's nothing new. This is not an apples-to-apples comparison whatsoever.

How does it compare against Greenplum or Aster or Vertica and is it more cost-effective? Those are important questions.


Comparing Redshift against Hadoop+Hive is reasonable. As you pointed out... the technologies are very different. However, there is a large overlap in use cases.


I strongly disagree. I am (or was) using both Hadoop and HBase and they are useful for very different purposes (huge amounts of nonstructured data, possibly with difficult-to-predict use cases versus structured data). Also note that Hive is just a layer over Hadoop with DB-like syntax, it doesn't make Hadoop a DB. It is still running MR queries beneath it.


Especially given that Redshift, Greenplum and Aster directly build on or incorporate technology from PostgreSQL.


This is true for Vertica as well.


So redshift took 155 seconds + 17 hours (17 * 3600) = 61355 secs total

vs 1491 Hadoop

Looks like to me Hadoop is about 40 times faster...


That's like saying a bicycle is faster than a car because I can buy a bike in 10 minutes while it may take me a couple of hours to get through the car's paperwork.

If you do enough queries, redshift will come out faster (assuming the numbers are correct).


If you do enough queries, you should spend the time to use RCFile for Hive, in which case redshift wont come out _that_ much faster. The point is the 17 hours is not negligible.


That is a good case since customers who typically need a datawarehouse aren't just going to upload data once... they probably are going to upload frequently.


You're missing my point and resorting to sarcasm - very nice </sarcasm>. My point is not that Hive is the better choice because everyone is going to reload their data frequently. My point is that if you want a fair benchmark, don't use an obviously slow data format for Hive. They spent time importing data optimized for RedShift, but they took a very naive approach for Hive. I'm sure RedShift will still be faster, but not 10 times faster.


That's assuming Hive doesn't have it's own special format that could be converted to to improve performance, right?


There's a motive here. hapyrus.com is pushing themselves as Redshift consultants. Oh you're using Hadoop? Redshift is better, cheaper and you can pay us to help you use it.


I haven't tried redshift before, but coming from a MR/Hadoop/Hive background, this seems to me like quite a sensational claim. I'd be very keen to hear other's thoughts on how widely these kinds of gains would apply for BigData processing.

As Carl Sagan said..

"Extraordinary claims require extraordinary evidence"

http://en.wikipedia.org/wiki/Carl_Sagan


Hive is not particularly fast in and of itself; it just has horizontal scaling and a SQL-ish front-end. Looking at AWS RedShift's homepage[1] (emphasis added):

> Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes.

Column stores databases[2] can be screamingly fast for analytics operations compared to RDBMS or other DB types (ala assorted NoSQL). See Kdb[3] or MonetDB[4] for examples of specific implementations. I'd fully expect a competent column store designed for horizontal scaling to obliterate Hive for a wide range of problems.

The usual big-data caveat: you need to pay attention to the fit of your tools against your problem and your data. I don't expect RedShift to be any different. Still, it's pretty exciting to see a new analysis DB tech cropping up like this. And doubly interesting to see this coming from Amazon.

[1] https://aws.amazon.com/redshift/

[2] https://en.wikipedia.org/wiki/Column-oriented_DBMS

[3a] http://kx.com/kdb-plus.php

[3b] https://en.wikipedia.org/wiki/K_%28programming_language%29#K...

[4] http://www.monetdb.org/Home


SAP HANA has a column store, and a row store, and does OLAP (Analytics) and OLTP.

There is a lot of new DB tech, Redshift doesn't seem particularly competitive at the moment unless you only need to use it a portion of the time, where Amazon excels.


Given the legendary performance issues of Hadoop I am not really surprised.

Hadoop is heavily horizontally scalable, but that's about it.


1.2 TB really is not very much data in the context of "Big Data". The supposed advantage of Hadoop is that it can scale horizontally with linear performance.


They note on slide 9 that this is only biggish data -- but sometimes, that's what you need to work with.


I wish the post had gone into depth on _why_ Redshift was significantly faster, but I'm betting it uses in-memory joins whereas (hence the size limitations it mentions) whereas Hive joins are just MapReduce jobs that keep only minimal subsets of data in memory at a given point. The upshot is the Hive/MapReduce strategy isn't limited by physical memory.

Of course, if your data set can fit in memory, then Redshift or similar technologies probably is a better choice than Hive. But it's important to remember that the performance gains here come as the result of a tradeoff.


It was significantly faster because as was mentioned above the graph ignores the the 17 HOURS it took for RedShift to import the data.

The comparison is complete and utter joke.


I wonder how it compares to BigQuery: https://developers.google.com/bigquery/docs/pricing


Worth noting this presentation was made by Hapyrus, a Hadoop specialized startup from 500startups. They know quite a bit about running Hadoop. Following the results of their tests they are now adding Redshift support to their services.


... and want to sell their Redshift services starting with a bang.


They should compare redshift with hadoop + Imapala, OR hbase with Phoexnix from Salesforce. Comparing with hadoop + hive is not a correct comparison


I am still new to large data, but isn't a solution like Redshift similar to Google's Big Query in that it only works with data that has a schema? How might one use Redshift with a db thats originally in Mongo?


You won't fit that much data into Mongo anyways, so does it matter?


People have been apparently storing 3TB of data in MongoDB.

So I guess it does matter.


Impose a schema.


Is there an easy way to go back and forth between data with schema and data without? I'd love the benefits of this for queries, but for the production side of things, imposing a schema would be costly.


Indeed, this comparison seems fishy.

Nevertheless, I'll take a moment to predict that articles like this will be only becoming more and more frequent in time. Hadoop has entered its "enterprisey" stage, with massively complex, cumbersome code, arcane performance tuning, bullshit consulting business built around it (complete with books and "certificates")...

The more agile competitors will be snapping at its flanks (and ankles), sometimes without merit, and sometimes with.


I'd be interested in seeing a comparison between Redshift and SAP HANA, but a more fair comparison than this one by someone who isn't partisan.


Slide 2&6, one query every 30 minutes.

It turns out usage based billing can be cheaper if you don't use a resource.


I'm willing to bet that's a not-uncommon scenario for a lot of organizations, however. If you're doing continuous querying of large amounts of data, then it's probably worth building your own hadoop cluster (physically or via Amazon), but a lot of people are just going to accumulate data and then make queries against it. Lots of 'active users per day', 'traffic by hour', 'purchases by popularity', etc. only get run to create data for the CEO every morning, or by the marketing manager every afternoon, that sort of thing.


I don't understand that part of the slides. Redshift isn't billed per-query, it's billed by instance-hour.


We wrote the blog post about this benchmark. http://www.hapyrus.com/blog/posts/behind-amazon-redshift-is-...


Seems like stupid marketing shit


Are they benchmarking hash join on hadoop and redshift?


Anyone interested in SQL queries on Hadoop should be checking out Cloudera Impala. It's open source.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: