Why the days are numbered for Hadoop as we know it

monstrado · on July 7, 2012

Just because Google is no longer using something doesn't mean it's no longer useful for other companies. Most companies will never need to do the same level of computation that Google needs to do.

The fact is, Hadoop is becoming more and more easy to setup, with distributions like Cloudera & Hortonworks allowing users to set up clusters with minimal upfront knowledge of Hadoop and how it works. The article uses HBase and Hive together in the same sentence which is kind of weird, Hive and HBase are completely different. Hive was made by Facebook to allow users to map schemas to a large set of files so their employees could run MapReduce jobs without Java knowledge, but instead SQL (HSQL)...basically a way to bridge the gap between Hadoop and non-java / programmers. HBase is a different beast all together and has no relation to MapReduce...the only thing it has in common with Hadoop is that it uses HDFS as its underlying FS (but doesn't have to). It's a NoSQL database that is modeled after BigTable (K -> V) and is pretty complicated...more so than just Hadoop/MR.

I think Hadoop will be around for quite a while, especially since it's becoming almost trivial to deploy. I see a lot of companies repurposing older servers into hadoop clusters and they are VERY happy with its performance. I rarely see where Hadoop is not providing enough to the user, but rather the user is not providing enough for Hadoop.

strlen · on July 8, 2012

Few things:

* Existence of tools beyond Map/Reduce in use at Google, does not imply that Map/Reduce's "days are numbered."

Map/Reduce is still enormously useful for many tasks even when other approaches (BSP, traditional distributed RDBMS techniques like Dremel) are available.

* Hadoop is not restricted to Map/Reduce. HDFS, cluster management, and more can be used and are used by other applications.

I am not too heavily involved with the query-processing side of Hadoop, but as far as as I understand the long term idea is that Map/Reduce will become just another application (of many) running on top of Hadoop's cluster management and storage infrastructure. See http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/had...

(Disclosure: I contribute to HDFS and HBase)

joshu · on July 8, 2012

Google talks very selectively about the technologies they use an how they work. Pretending you understand what google is doing based on this output is an enormous mistake. This article is vague speculation wrapped in the clothing of something more. It isn't.

yaroslavvb · on July 8, 2012

Other technologies to watch:

1. GraphLab2. Unlike Pregel's Bulk Synchronous Parallel Model, GraphLab2 it allows non-synchronous updates, which is more efficient for approximate quantities. For instance, for AltaVista's web graph, most nodes only need to be updated couple of times, while some nodes need more than 60 updates.

2. Flume: it's an abstraction on top of MapReduce, you program as if your data is contained in Java-like containers, and it turns your program into series of regular MapReduces

3. ScalOps (http://cs.markusweimer.com/pub/2012-DataEng.pdf): that's a higher level abstraction prototyped in Yahoo Research, might get resurrected in Microsoft.

4. AllReduce

AaronBBrown · on July 8, 2012

Flume is only related to MapReduce in that it is able to write to HDFS. All Flume is is a transport mechanism to send log-like data from one place to another. It can be made fancy by adding decorators to manipulate the data along the way, but at its core, it just moves bytes around. Unfortunately, at this stage, it's highly unreliable with its fault tolerant design causing more problems than it solves. Hopefully FlumeNG will improve upon this. I speak as someone who runs Flume in production and has to deal with it constantly failing on me, usually silently.

It's a great technology, but it just isn't there yet.

yaroslavvb · on July 8, 2012

Oops, looks-like Flume is Google-only name, open-source implementation is called Crunch -- https://issues.apache.org/jira/browse/MAPREDUCE-1849

anxrn · on July 7, 2012

Hadoop is popular not (only?) because of some PR-driven mania around big data. It is popular because it is an organic, evolving open source project that solves hard but common problems faced by a lot of companies in a variety of industries rather cheaply; despite being somewhat susceptible to the common pitfalls of design-by-committee projects.

I fail to see how the existence of other tools that solve different and/or narrower problems takes anything away from the success of Hadoop, let alone spells its demise.

benbjohnson · on July 7, 2012

It seems like Hadoop will move into areas where it shines -- namely custom bulk processing of medium to large files. Big data is an umbrella for a lot of different areas (click stream processing, financial analysis, image & video processing). When it comes to processing large datasets you really need tools designed for each particular type of data to obtain the performance you need.

Hadoop was a nice general purpose tool but niche-specific tools will eventually supersede it for many types of processing.

AaronBBrown · on July 7, 2012

Can HN just ban gigaom articles? Have they ever written a single technically interesting or accurate article in their entire existence? It's all just spin.

big_data · on July 7, 2012

The title sounds a bit ominous for an article that really only attempts to show how Hadoop was a stepping stone to a more refined set of tools at Google.

dumm · on July 7, 2012

Pregel: map-reduce in 15 lines of code = Google tries to teach its Java/C++ loving staff functional programming.

Once again, very old ideas being published as some amazing new discovery from Google, mesmerizing geeks, tech companies. and wannabe tech companies everywhere. The fact they are trying to sell it as a service shows they are behind the curve. Using buzz phrases like "time to value" and "time to insight".

How about the time it takes to get programmers to stop using iterative, loop bases programming and braindead IDE's?

The author of the Pregel article talks about graphs and vertexes. "Everything is a graph." "Think in terms of vertexes." No, everything is a list. That's a very old idea. You must think in terms multi-dimensional lists and vectors. The old new thing.

Processing trillions of rows in minutes. This is old hat for many folks in the financial world.

Iterative programming is ingrained. And Google is a victim of this as much as anyone else.

You can show a CS grad how to generate highly efficient C replete with goto's using a high level language like Scheme, they will see the performance benefit, and yet they will still go back to using some crippled "expressive" language, because that's what they are used to. They want to write algorithms that no one needs and programs that no one will ever use. Users want stuff that is FAST. But a lot of programming is not for users, it's to entertain programmers who are doing it. Sadly, they are not entertained by functional programming and short programs of a few lines. They want to write 1000's of lines of code. FAIL.

Give me someone who's mind has not been poisoned with the idea of loops and the du jour scripting languages, preferably someone who has not majored in Computer Science, and I can make them 100x as productive as today's average and even above average programmers.

People will be stuck on Hadoop for a long time. Just as people are stuck on C++, Perl, Java, Python and other verbose iterative languages.

wumpushunter · on July 7, 2012

Okay, I'll bite. I'm ready to be 100x more productive—where do I begin?

rjurney · on July 8, 2012

Try Spark: http://www.spark-project.org/

chrys · on July 9, 2012

whoa! this sounds awful lot like a superpower. Without any sarcasm: I'd love to learn what you've to teach and become ~100x more productive than an average programmer. Just so you know: I have done a CS degree but I hardly went to classes and all I know is a little bit of C++ and little bit of Java. So, my brain is totally uncorrupted. So, please teach me so I can be 100x more productive than an average programmer.

alayne · on July 8, 2012

Functional programming has had over 50 years to catch on. It failed to become the dominant paradigm (for good reason in my opinion -- it is writable, but not readable).

realitygrill · on July 14, 2012

I've started on this. Want to mentor me?

rjurney · on July 8, 2012

Here's the reality: Every sliver of data is going to land on HDFS as the most trusted and authoritative resource or 'record of truth.' It is the most cost-effective highly available storage mechanism available. Batch computing is here to stay, and Hadoop MapReduce will be a big part of it.

One might bet against Hadoop MapReduce, but betting against the Hadoop Filesystem as cheap storage built on commodity hardware that can serve large data in a highly available fashion is... misguided. Nothing else scales to 10,000 nodes per cluster and provides data locality (processor near disk spindles) so that data is accessible, or is even close.

Many systems will sit in front of Hadoop to do things other than batch computing, and many new types of distributed compute systems will sit on top of Hadoop and Zookeeper. Hadoop is here to stay.

MapReduce is too low level, and systems like Pig and Hive will continue to grow, improve and be the standard interface to working with Hadoop.

grammr · on July 7, 2012

tl;dr: MapReduce is bad at a lot of things that it was never designed for.

wmf · on July 7, 2012

I have a feeling any new open source analytics tools will be rolled into Hadoop to preserve the value of the brand. I don't know what the future will look like, but it will probably be called Hadoop.

benbjohnson · on July 8, 2012

Not necessarily. Not everyone wants to build on top of a large pre-existing code base or be locked into limitations because of design choices within Hadoop or HDFS. And some people just don't want to write Java.

I'm writing a behavioral database in C and I didn't add it to the Hadoop umbrella specifically for those reasons. For example, my database's query language uses LLVM for compilation and optimization. That's something I couldn't do if I was locked into the JVM.

CurtMonash · on July 8, 2012

If we read this as an argument "And therefore MRv2/YARN will be important", it's not crazy. The Hadoop project itself is opening up to break its dependency on pure MapReduce. First out of the gate -- the Apache Hama folks, who I believe have gotten their own Hacker News attention by (somewhat ironically) attacking Hadoop.

zitterbewegung · on July 7, 2012

This is silly. Just because google offers these things doesn't mean having your own isn't an advantage. The other thing is there are mapreduce like systems where you can start to break the rules.

The days aren't numbered. More and more companies are going to keep on contributing to Hadoop and Hadoop like projects. One can imagine that the days are numbered for Google. The moat will start to become less and less. Google is running from Hadoop not the other way around.

I mean look at this http://opencloudconsortium.org/ Eventually everyone is going to figure out that you can not just copy google but you can out engineer google. Google can't hire every good engineer.

vineet · on July 7, 2012

This point is not really relevant to the article. The article is calling for new non-MapReduce-based architectures that leverage the Hadoop core (as opposed to the entire Hadoop stack).

danielhlockard · on July 7, 2012

There are certain kinds of data -- (financial, etc) -- that you are going to always want to run on your own locked down cluster, not uploading to google.

sanxiyn · on July 7, 2012

Good point, but not relevant for the article.