Disclaimer: I'm a committer on the Apache Hive project.
A couple points in no particular order:
* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.
* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?
great feedback. I was also skeptical of using EMR hive due to the fact that it is so far behind in versions. Also RedShift can do the analytics part very well but I don't think it can do the exploration part that Hadoop/Hive are so good at (but maybe I am wrong)
Comparing a column-oriented RDBMS with parallel query execution versus hadoop is a joke in the first place. Hadoop is extremely slow. That's nothing new. This is not an apples-to-apples comparison whatsoever.
How does it compare against Greenplum or Aster or Vertica and is it more cost-effective? Those are important questions.
Comparing Redshift against Hadoop+Hive is reasonable. As you pointed out... the technologies are very different. However, there is a large overlap in use cases.
I strongly disagree. I am (or was) using both Hadoop and HBase and they are useful for very different purposes (huge amounts of nonstructured data, possibly with difficult-to-predict use cases versus structured data).
Also note that Hive is just a layer over Hadoop with DB-like syntax, it doesn't make Hadoop a DB. It is still running MR queries beneath it.
That's like saying a bicycle is faster than a car because I can buy a bike in 10 minutes while it may take me a couple of hours to get through the car's paperwork.
If you do enough queries, redshift will come out faster (assuming the numbers are correct).
If you do enough queries, you should spend the time to use RCFile for Hive, in which case redshift wont come out _that_ much faster. The point is the 17 hours is not negligible.
That is a good case since customers who typically need a datawarehouse aren't just going to upload data once... they probably are going to upload frequently.
You're missing my point and resorting to sarcasm - very nice </sarcasm>. My point is not that Hive is the better choice because everyone is going to reload their data frequently. My point is that if you want a fair benchmark, don't use an obviously slow data format for Hive. They spent time importing data optimized for RedShift, but they took a very naive approach for Hive. I'm sure RedShift will still be faster, but not 10 times faster.
There's a motive here. hapyrus.com is pushing themselves as Redshift consultants. Oh you're using Hadoop? Redshift is better, cheaper and you can pay us to help you use it.
I haven't tried redshift before, but coming from a MR/Hadoop/Hive background, this seems to me like quite a sensational claim. I'd be very keen to hear other's thoughts on how widely these kinds of gains would apply for BigData processing.
Hive is not particularly fast in and of itself; it just has horizontal scaling and a SQL-ish front-end. Looking at AWS RedShift's homepage[1] (emphasis added):
> Amazon Redshift delivers fast query and I/O performance for virtually any size dataset by using columnar storage technology and parallelizing and distributing queries across multiple nodes.
Column stores databases[2] can be screamingly fast for analytics operations compared to RDBMS or other DB types (ala assorted NoSQL). See Kdb[3] or MonetDB[4] for examples of specific implementations. I'd fully expect a competent column store designed for horizontal scaling to obliterate Hive for a wide range of problems.
The usual big-data caveat: you need to pay attention to the fit of your tools against your problem and your data. I don't expect RedShift to be any different. Still, it's pretty exciting to see a new analysis DB tech cropping up like this. And doubly interesting to see this coming from Amazon.
SAP HANA has a column store, and a row store, and does OLAP (Analytics) and OLTP.
There is a lot of new DB tech, Redshift doesn't seem particularly competitive at the moment unless you only need to use it a portion of the time, where Amazon excels.
1.2 TB really is not very much data in the context of "Big Data". The supposed advantage of Hadoop is that it can scale horizontally with linear performance.
I wish the post had gone into depth on _why_ Redshift was significantly faster, but I'm betting it uses in-memory joins whereas (hence the size limitations it mentions) whereas Hive joins are just MapReduce jobs that keep only minimal subsets of data in memory at a given point. The upshot is the Hive/MapReduce strategy isn't limited by physical memory.
Of course, if your data set can fit in memory, then Redshift or similar technologies probably is a better choice than Hive. But it's important to remember that the performance gains here come as the result of a tradeoff.
Worth noting this presentation was made by Hapyrus, a Hadoop specialized startup from 500startups. They know quite a bit about running Hadoop. Following the results of their tests they are now adding Redshift support to their services.
I am still new to large data, but isn't a solution like Redshift similar to Google's Big Query in that it only works with data that has a schema? How might one use Redshift with a db thats originally in Mongo?
Is there an easy way to go back and forth between data with schema and data without? I'd love the benefits of this for queries, but for the production side of things, imposing a schema would be costly.
Nevertheless, I'll take a moment to predict that articles like this will be only becoming more and more frequent in time. Hadoop has entered its "enterprisey" stage, with massively complex, cumbersome code, arcane performance tuning, bullshit consulting business built around it (complete with books and "certificates")...
The more agile competitors will be snapping at its flanks (and ankles), sometimes without merit, and sometimes with.
I'm willing to bet that's a not-uncommon scenario for a lot of organizations, however. If you're doing continuous querying of large amounts of data, then it's probably worth building your own hadoop cluster (physically or via Amazon), but a lot of people are just going to accumulate data and then make queries against it. Lots of 'active users per day', 'traffic by hour', 'purchases by popularity', etc. only get run to create data for the CEO every morning, or by the marketing manager every afternoon, that sort of thing.
A couple points in no particular order:
* EMR Hive is a closed source fork of the upstream Apache Hive code base. The EMR docs imply that the latest version of EMR Hive is based on Apache Hive 0.8.1 (which was released more than a year ago), which means EMR users aren't benefitting from the performance improvements that appeared in the 0.9 and 0.10 releases.
* It is implied (though not explicitly stated) that the Hive queries were run against gzip compressed TSV files stored in S3, while Redshift was allowed to spend 17 hours converting the same data to its own optimized internal format. Hive supports an optimized columnar format too (RCFile). Why wasn't that used in this performance comparison?