Hacker News new | past | comments | ask | show | jobs | submit login

Can you give an example of a job that would be difficult or impossible to perform efficiently with Cascading, but Pangool gives an advantage over raw MapReduce?



Hi avibryant, According to our initial benchmark (http://pangool.net/benchmark.html), secondary sorting in Cascading is slow (http://bit.ly/wTKOxo), showing a 243% performance overhead compared to an efficient implementation in MapReduce. The implementation in MapReduce has a lot of lines (http://bit.ly/yYGnGe) whereas Pangool's implementation is quite simple (http://bit.ly/x9U7Yj). A common application of secondary sort is calculating moving averages, for instance.


Ok, so Cascading has a slow implementation of secondary sort, but is there any reason you believe that couldn't be improved? I don't think you're really comparing architectures there, just how well optimized particular implementations are.

I'm asking because in my experience the extra level of abstraction provided by Cascading, Crunch etc is a huge advantage, and if you're making a conscious choice to operate at a lower level, you better be getting something significant in return; it's not clear to me yet what that is.


Pangool is not an alternative for Cascading. For example, at this point, Pangool does not help you managing workflows. If you are starting a MapReduce application, it is probably the best option to start using higher level abstractions: Cascading, Hive, Pig, etc.

But if you are thinking about learning Hadoop using the standard Hadoop API, or if you need for some particular reason to use it for your project, we recommend you to use Pangool instead.

Or if you are considering to implement another abstraction on top of Hadoop, probably using Pangool for it would also be a good idea.

In fact, what we believe is that the default Hadoop API should look like Pangool.


You are doing regex matching in the Cascading code, but splitting on a character in the pangool code. The latter is obviously much faster. I don't know that that's the reason for the difference you observe, but it certainly can't hurt to fix that and make the user-supplied code more comparable.


Indeed that regex was problematic because it had a bug itself. We replaced that line by RegexSplitter and updated the benchmark page. Please shout if you notice something else wrong. Thanks.


Just for clarify, split() java function is using regexp for the split as well. The code of String.split() is:

return Pattern.compile(regex).split(this, limit);

The benchmark seems fair to me.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: