Amazon Redshift is 10x faster and cheaper than Hadoop and Hive

cwsteinbach · on Feb 20, 2013

Disclaimer: I'm a committer on the Apache Hive project.

A couple points in no particular order:

* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.

* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?

ajays · on Feb 20, 2013

Why wasn't that used in this performance comparison?

Because then the stupid headline wouldn't be so sensationalist, would it?

// I have no dog in this fight, but hate twisted claims

viralbajaria · on Feb 20, 2013

great feedback. I was also skeptical of using EMR hive due to the fact that it is so far behind in versions. Also RedShift can do the analytics part very well but I don't think it can do the exploration part that Hadoop/Hive are so good at (but maybe I am wrong)

Evbn · on Feb 20, 2013

Wow. A year ago a solution architect promised me they would catch up to mainline to get a bunch of critical bugfixes.

meritt · on Feb 20, 2013

Comparing a column-oriented RDBMS with parallel query execution versus hadoop is a joke in the first place. Hadoop is extremely slow. That's nothing new. This is not an apples-to-apples comparison whatsoever.

How does it compare against Greenplum or Aster or Vertica and is it more cost-effective? Those are important questions.

nieksand · on Feb 20, 2013

Comparing Redshift against Hadoop+Hive is reasonable. As you pointed out... the technologies are very different. However, there is a large overlap in use cases.

annnnd · on Feb 20, 2013

I strongly disagree. I am (or was) using both Hadoop and HBase and they are useful for very different purposes (huge amounts of nonstructured data, possibly with difficult-to-predict use cases versus structured data). Also note that Hive is just a layer over Hadoop with DB-like syntax, it doesn't make Hadoop a DB. It is still running MR queries beneath it.

verily · on Feb 20, 2013

Especially given that Redshift, Greenplum and Aster directly build on or incorporate technology from PostgreSQL.

effn · on Feb 20, 2013

This is true for Vertica as well.

free652 · on Feb 20, 2013

So redshift took 155 seconds + 17 hours (17 * 3600) = 61355 secs total

vs 1491 Hadoop

Looks like to me Hadoop is about 40 times faster...

MBCook · on Feb 20, 2013

That's like saying a bicycle is faster than a car because I can buy a bike in 10 minutes while it may take me a couple of hours to get through the car's paperwork.

If you do enough queries, redshift will come out faster (assuming the numbers are correct).

TallGuyShort · on Feb 20, 2013

If you do enough queries, you should spend the time to use RCFile for Hive, in which case redshift wont come out _that_ much faster. The point is the 17 hours is not negligible.

dromidas · on Feb 20, 2013

That is a good case since customers who typically need a datawarehouse aren't just going to upload data once... they probably are going to upload frequently.

TallGuyShort · on Feb 20, 2013

You're missing my point and resorting to sarcasm - very nice </sarcasm>. My point is not that Hive is the better choice because everyone is going to reload their data frequently. My point is that if you want a fair benchmark, don't use an obviously slow data format for Hive. They spent time importing data optimized for RedShift, but they took a very naive approach for Hive. I'm sure RedShift will still be faster, but not 10 times faster.

seanmcdirmid · on Feb 20, 2013

That's assuming Hive doesn't have it's own special format that could be converted to to improve performance, right?

iblaine · on Feb 20, 2013

There's a motive here. hapyrus.com is pushing themselves as Redshift consultants. Oh you're using Hadoop? Redshift is better, cheaper and you can pay us to help you use it.

jaytaylor · on Feb 20, 2013

I haven't tried redshift before, but coming from a MR/Hadoop/Hive background, this seems to me like quite a sensational claim. I'd be very keen to hear other's thoughts on how widely these kinds of gains would apply for BigData processing.

As Carl Sagan said..

"Extraordinary claims require extraordinary evidence"

http://en.wikipedia.org/wiki/Carl_Sagan

saidajigumi · on Feb 20, 2013

Hive is not particularly fast in and of itself; it just has horizontal scaling and a SQL-ish front-end. Looking at AWS RedShift's homepage[1] (emphasis added):

> Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes.

Column stores databases[2] can be screamingly fast for analytics operations compared to RDBMS or other DB types (ala assorted NoSQL). See Kdb[3] or MonetDB[4] for examples of specific implementations. I'd fully expect a competent column store designed for horizontal scaling to obliterate Hive for a wide range of problems.

The usual big-data caveat: you need to pay attention to the fit of your tools against your problem and your data. I don't expect RedShift to be any different. Still, it's pretty exciting to see a new analysis DB tech cropping up like this. And doubly interesting to see this coming from Amazon.

[1] https://aws.amazon.com/redshift/

[2] https://en.wikipedia.org/wiki/Column-oriented_DBMS

[3a] http://kx.com/kdb-plus.php

[3b] https://en.wikipedia.org/wiki/K_%28programming_language%29#K...

[4] http://www.monetdb.org/Home

AndyNemmity · on Feb 20, 2013

SAP HANA has a column store, and a row store, and does OLAP (Analytics) and OLTP.

There is a lot of new DB tech, Redshift doesn't seem particularly competitive at the moment unless you only need to use it a portion of the time, where Amazon excels.

ryanpers · on Feb 20, 2013

Given the legendary performance issues of Hadoop I am not really surprised.

Hadoop is heavily horizontally scalable, but that's about it.

jeremyjh · on Feb 20, 2013

1.2 TB really is not very much data in the context of "Big Data". The supposed advantage of Hadoop is that it can scale horizontally with linear performance.

pjscott · on Feb 20, 2013

They note on slide 9 that this is only biggish data -- but sometimes, that's what you need to work with.

ryanbrush · on Feb 20, 2013

I wish the post had gone into depth on _why_ Redshift was significantly faster, but I'm betting it uses in-memory joins whereas (hence the size limitations it mentions) whereas Hive joins are just MapReduce jobs that keep only minimal subsets of data in memory at a given point. The upshot is the Hive/MapReduce strategy isn't limited by physical memory.

Of course, if your data set can fit in memory, then Redshift or similar technologies probably is a better choice than Hive. But it's important to remember that the performance gains here come as the result of a tradeoff.

taligent · on Feb 20, 2013

It was significantly faster because as was mentioned above the graph ignores the the 17 HOURS it took for RedShift to import the data.

The comparison is complete and utter joke.

tonfa · on Feb 20, 2013

I wonder how it compares to BigQuery: https://developers.google.com/bigquery/docs/pricing

pytrin · on Feb 20, 2013

Worth noting this presentation was made by Hapyrus, a Hadoop specialized startup from 500startups. They know quite a bit about running Hadoop. Following the results of their tests they are now adding Redshift support to their services.

Uchikoma · on Feb 20, 2013

... and want to sell their Redshift services starting with a bang.

ameyamk · on Feb 20, 2013

They should compare redshift with hadoop + Imapala, OR hbase with Phoexnix from Salesforce. Comparing with hadoop + hive is not a correct comparison

BrianEatWorld · on Feb 20, 2013

I am still new to large data, but isn't a solution like Redshift similar to Google's Big Query in that it only works with data that has a schema? How might one use Redshift with a db thats originally in Mongo?

zeeg · on Feb 20, 2013

You won't fit that much data into Mongo anyways, so does it matter?

taligent · on Feb 20, 2013

People have been apparently storing 3TB of data in MongoDB.

So I guess it does matter.

scotth · on Feb 20, 2013

Impose a schema.

BrianEatWorld · on Feb 20, 2013

Is there an easy way to go back and forth between data with schema and data without? I'd love the benefits of this for queries, but for the production side of things, imposing a schema would be costly.

Radim · on Feb 20, 2013

Indeed, this comparison seems fishy.

Nevertheless, I'll take a moment to predict that articles like this will be only becoming more and more frequent in time. Hadoop has entered its "enterprisey" stage, with massively complex, cumbersome code, arcane performance tuning, bullshit consulting business built around it (complete with books and "certificates")...

The more agile competitors will be snapping at its flanks (and ankles), sometimes without merit, and sometimes with.

AndyNemmity · on Feb 20, 2013

I'd be interested in seeing a comparison between Redshift and SAP HANA, but a more fair comparison than this one by someone who isn't partisan.

z_ · on Feb 20, 2013

Slide 2&6, one query every 30 minutes.

It turns out usage based billing can be cheaper if you don't use a resource.

danudey · on Feb 20, 2013

I'm willing to bet that's a not-uncommon scenario for a lot of organizations, however. If you're doing continuous querying of large amounts of data, then it's probably worth building your own hadoop cluster (physically or via Amazon), but a lot of people are just going to accumulate data and then make queries against it. Lots of 'active users per day', 'traffic by hour', 'purchases by popularity', etc. only get run to create data for the CEO every morning, or by the marketing manager every afternoon, that sort of thing.

zwily · on Feb 20, 2013

I don't understand that part of the slides. Redshift isn't billed per-query, it's billed by instance-hour.

fujibee · on Feb 23, 2013

We wrote the blog post about this benchmark. http://www.hapyrus.com/blog/posts/behind-amazon-redshift-is-...

kushti · on Feb 20, 2013

Seems like stupid marketing shit

hobbyist · on Feb 20, 2013

Are they benchmarking hash join on hadoop and redshift?

cmccabe · on Feb 20, 2013

Anyone interested in SQL queries on Hadoop should be checking out Cloudera Impala. It's open source.