Hacker News new | past | comments | ask | show | jobs | submit login
The Dark Side of Hadoop (backtype.com)
132 points by nadahalli on April 22, 2011 | hide | past | favorite | 41 comments



A few months ago I got my hadoop cluster stabilized (64 cores with 64 TB hdfs space). It wasn't a pleasant experience. The most memorable frustrating parts of the setup:

Searching for help online results in solutions for ancient versions or incomplete wiki pages ("we'll finish this soon!" from three years ago).

If Apple is one extreme end of the user friendly spectrum, hadoop is at the polar opposite end -- the error conditions and error messages can be downright hostile. The naming conventions are wonky too: namenode and secondary namenode, but the secondary isn't a backup, it's a copy. And don't get me started on tasktracker versus jobtracker (primarily because I can never remember the difference).

Restarting a tracker doesn't make it vanish in the namenode, so you have to restart the namenode too (at least in my CDH3 setup).

Everything is held together with duct tape shell scripts.

On the good side, I got everything hadoop related managed in puppet. All I need to do for a cluster upgrade is load a new CDH repo, reboot the cluster, then make sure nothing is borked.

If I didn't have to deal with isomorphic SQL<->hadoop queries, I'd start over using http://discoproject.org/


Disco looks incredible. Are you aware of any big production uses of it outside of nokia?


I work at Nokia and I'm not aware of any big production uses of it inside Nokia. Most of the teams in my part of the company use Hadoop.

Oh well.


Actually the largest MapReduce cluster at Nokia runs Disco. Feel free to contact me at ville.h.tuulos at nokia dot com if you want to give it a try.


It seems to be a very active project:

https://github.com/tuulos/disco/commits/master


I think the word "job" comes from the old mainframe days. Each program was encapsulated in a "job". There was even a language on some systems (aptly named "Job Control Language", or JCL) to specify the different aspects of the job.


I don't think the OP was saying he doesn't know what "job" means, he was saying that they use two terms "task" and "job" that mean the same thing, so he doesn't know what the difference between the trackers are.


Take a look at Hive if you need SQL semantics on Hadoop.


Backtype have there own language for quering. https://github.com/nathanmarz/cascalog


I have to disagree completely with your example of horrible naming conventions regarding the jobtracker and tasktracker. The names are very descriptive and convey enough details to know what their purpose is.

http://hadoop.apache.org/mapreduce/docs/current/mapred_tutor...

"The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master."

If your problems is in remembering the difference perhaps you have not spent enough time understanding the tools you are using.


I think OP is saying he understands the difference in what they do, but can't keep the names straight. I felt the same way for the first few months of using hadoop.

I eventually got that in real-life, a task is just a subpiece of a job. But the naming is not immediately evocative of the relationship, particularly since in computing, "job" is pretty overloaded. Far better would be "job" and "subjob".


It's good to have critical reviews. But perhaps it could have been better if the tone were more respectful towards a free and open source project.

The approach of Hadoop is not my cup of tea, but I praise them for giving for free a working product solving such a hard problem.

Playing Hadoop's side, where are the test cases, the patches or bug reports? Or even some missing documentation blurbs, like you mention.


We've become spoiled.

open source has different levels ranging from "made in basement and released to the world" (one maintainer, patches welcome) to "throw over the fence" (you can see the code, but nobody cares about patches) to "we're an open source corporation" (apache, patches welcome if you follow steps 1 to 22).

Hadoop falls under apache governance which means jira, contributor license agreements, and spending a week of your life trying to get your point across. It's much easier to complain like the spoiled developers we've become.


It's much easier to complain like the spoiled developers we've become."

By your own comments, that's not really fair. The Apache project has become a disaster area of bureaucracy, crappy code and crappier documentation. You have to spend weeks in the muck to get to the point where you can even properly understand the deficiencies of any given Apache subproject, and by that point, you're so bound up in the project that it's hard to move away. From the build system of the code to the structure of the community, not a single thing about Apache is what I would characterize as nimble or lightweight. Can you blame people for feeling helpless?

It's gotten to the point where I'll look anywhere for a competing implementation of an idea before using Apache code to do something. It's just not worth the pain.


I feel exactly the same way about Apache projects, and even now it's very rare to meet a fellow traveler.

Years ago I had the same emperor has no clothes realization that the ASF was an enormous BureaucraticBullshitFactoryFactory — that it was seemingly founded just to host enterprise middleware crap while trading on the name of httpd (without really developing it of course).

It took quite a while before I realized that the Apache Software Foundation was just the new IBM — and not just figuratively, if you actually looked at the active membership of the ASF, it was chock full of IBM employees contributing on company time.


Agreed, alecco. We should all strive to be kinder and excellent to each other, even when being critical.


Well, to be fair, it's easy to fall in that behavior. I often catch myself before hitting send/post. But sometimes the filter isn't good or fast enough. This happens usually on caffeinated bad days working on frustrating bugs. It's likely the writer was experiencing that.


> This means that the Hadoop process which was using 1GB will temporarily "use" 2GB when it shells out.

This is exactly not what happens, the new process copies only the parents address mappings (now marked read-only), which represents vastly less than 1gb physical memory.

I think only a single 4kb page or two will ultimately be copied, representing a chunk of the calling thread's stack used to prepare the new child before finally calling execve() or similar.


As far as I understand it (I could be wrong as I've pieced this together from multiple sources), forking still reserves the same amount of memory that the parent process is using without actually copying the data (since it could very well use all that memory eventually).

The wrench is that Linux has a feature called memory overcommit (which I haven't been able to decipher completely). Supposedly it causes forking to not actually reserve that much space, but by default it's in a "heuristic" mode so it may or may not take effect.

These are the best resources I could find on what happens when you fork:

http://developers.sun.com/solaris/articles/subprocess/subpro... http://lxr.linux.no/linux/Documentation/vm/overcommit-accoun...


I'm not anywhere near being a Java expert but I've seen similar behavior with a non-Hadoop workload.

In my case I had several virtual machines hosting Java VMs with 2+GB heaps running an app that liked to fork and run external programs for short periods of time. If the entire heapsize was say 3GB and it forked twice Linux would act like it needed 9GB. The Linux overcommit heuristic regularly got things wrong and wanted RAM that it would never actually use. This usually resulted in the JVM failing in new and interesting ways.

The workaround is to allocate a crapload of swap (mainly more than heap size times a guestimate of number of concurrent forks). It will never actually USE the swap but having it there seems keep the overcommit heuristic happy.

Yay for cargo cult server tuning. I've never figured out how to get the kernel to not be so pessimistic and I can't modify the Java code in question so... eh.


I think there's a confusion here between the copy-on-write system that makes fork() so fast under Unix and Linux's overcommiting memory, which allows malloc() to return success for memory allocation that will in fact only be allocated as necessary. (e.g. you can allocate 32GB using malloc but as long as you don't write anything there, the kernel isn't really allocating that much memory).

The latter causes much trouble because people tend to assume that, say, a database can allocate a certain amount of memory and be sure that it will never run out of memory as long as it does not explicitly allocate more memory. This is not true for Linux at all, and a process can be terminated at any time (if I am not mistaken, with the confusing SIGILL signal and an OOM-killer message sent to the system log) if the system is running out of memory.

I suspect the diagnostics in the article may be wrong, and there is something else causing the problem there. My polite guess would be something with the JVM (you are all having problems with the JVM... hmmm...). Merely forking a process is a tiny operation under Linux and I don't believe memory overcommitting is relevant at all here. And in any case you can simply turn off overcommits with a sysctl to verify the theory.


We got bitten by the same bug shelling out from the JVM. We ended up running socat in a separate process and bouncing all our external processes off it instead. But it's insane that you even have to think about this kind of thing.


Still, it does make you step back and wonder why they've put together Hadoop nodes and intentionally ran them without swap. The resulting OOM is not surprising.


I would be interested to hear why riak, disco etc aren't viable alternatives. I've seen very few good comparisons of the various options (the Mozilla data blog being the only one that comes to mind).


"Risking" a downvote for a basically "me-too" reply, I would also be very interested in such a comparison or even a simple blog post of the experience (good or bad) with alternatives, especially disco.


We've seen all of these issues, but the primary causes appear to be either problems in configuration of the cluster (which you typically don't see until you're doing serious work, or working with an overstressed cluster at scale) and problems with underlying code quality of the jobs (e.g., processes hanging due to tasks which don't terminate due to inability to handle unexpected input gracefully--infinite loops or non-terminal operations). If you're working with big data, particularly data that isn't the cleanest, you'll start to see some of these issues and others arise.

Despite those issues, the most remarkable thing about Hadoop is the out-of-the-box resilience to get the work done. The strategy of no side-effects and a write-only approach (with failed tasks discarding work in progress) ensures predictable results--even if the time it takes to get those results can't be guaranteed.

The documentation isn't the greatest, and it's very confusing sorting out the sedimentary nature of the APIs and configuration (knowing which APIs and config match up across various versions such as 0.15, 0.17, 0.20.2, 0.21, etc., not to mention various distributions from Cloudera, Apache and Yahoo branches), but things are starting to finally converge. You're probably better off starting with one of the later, curated releases (such as the recent Cloudera distribution) where some work has been done to cherry pick features and patches from the main branches.


I've been using Hadoop in production for about 3 years on a small 48 node cluster. I've seen many issues, but none of the ones mentioned in the blog post.

My general theory is that if its an important tool for your business, you need at least 1 person to be an expert on it. The alternative is to pay Cloudera a significant amount per node for support. Another possible alternative is to use http://www.MapR.com/, they are in beta and claim to be api compatible with Hadoop, but they are not free.


Memory; it's Java. Java is great, but is a memory hog. Doesn't matter so much these days (well it does, but what are you going to do... There isn't much OSS competition for Hadoop). The documentation is horrible though. I'm not sure if this is a 'new' OSS trend, but now that i'm working a lot with Rails and gems I notice that on that side of OSS it's pretty normal to produce no or completely horrible documentation; use the force read the source and that kind of hilarious tripe. Apache projects which are Hadoop related (Hadoop, Pig, Hbase) all suffer from this; you are hard pressed to find anything remotely helpful and not incredibly outdated. At least for rails you can find tons of examples (no docs though) of how to achieve things; for Hadoop/Hbase everything is outdated, not functional and requiring you to jump into tons of code to get stuff done.

Again; there is not so much competition for tasks you would accomplish with Hadoop (and Hbase) on the scale it has been tested (by Yahoo/Stumbleupon and many others).


I never got the Hadoop craze. It performs terribly on memory and CPU, which is odd for something that prides itself on being HPC-oriented.


Alternatives?

Edit: Real alternatives I mean; something OSS (free is not so important, but no lock-in), able to manage huge amounts of data, actively developed, active projects built on top.


If you want general distributed process management, check out Mesos: http://www.mesosproject.org/

The guys working on it are over at Twitter nowadays.


Shouldn't Hadoop users help to document it and contribute to it? Why isn't this happening?


A lot of users of Hadoop are companies with serious competition. If you took the time to figure all of this out, would you go post a howto for your competitors? No, might as well let them expend their own time and money to do what you did.

While that's not exactly the open source spirit, it is how plenty of people think and also happens to be the default, easiest thing to do. It would be nice if people could share to benefit themselves and each other.


I'd like to know what distribution they are running, and if they've tried others.


We used to use CDH3b2, but in the past month we switched to EMR. These and other issues existed in both distributions.


Yeah, hadoop is a bear to get working. One of the benefits of working at a more established company like quantcast, google, or presumably facebook is there are internal infrastructure teams to smooth all that over. I pretty much get to type submitJob and it more or less just works...


I'm at a big company that uses hadoop, and someone else maintains the infrastructure. But, the problem we have is that hadoop is a very leaky abstraction and it's very easy for someone that doesn't know implementation details to break the grid. It's relatively easy to exhaust the memory of the jobtracker, we constantly have problems with mapper starvation because someones seemingly innocent job is causing issues, etc, etc.

My experience is that it doesn't just work, even when someone else maintains the infrastructure.


It would probably work better in a large company. We spent a couple weeks evaluating Hadoop in our small startup and came to the conclusion that we couldn't afford to have such an infrastructure team, and I imagine most startups considering Hadoop are in a similar boat.


We use hadoop at foursquare and it was set up and maintained by 1 person. It was a real bear of a job but he got it done.


It's definitely doable. It's just a lot of work.


  Hadoop is so old school...
  Even Google moving away from Map/Reduce.
  Prepare for new shiny things, 
  which will be better, faster and cheaper to operate!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: