Hacker News new | past | comments | ask | show | jobs | submit login
Key/value is dead. Long live tuples: Pangool for Hadoop (datasalt.com)
79 points by ivanprado on March 7, 2012 | hide | past | favorite | 28 comments



I just popped in to say that I'm tired of the "X is dead" linkbait headlines. They demonstrate a myopic view of the world. Visual Basic and COBOL are still around.


Dead is relative. Dead usually means "dead to me".


Or in the case of VB, "better off dead".


So, what you're saying is "'X is dead' is considered harmful"?



And for a slightly more modern take on the tuple space, check out Java Spaces [1] or Gigaspaces [2]. There's still plenty of active research on the topic too [3] (disclaimer: I did my PhD thesis on distributed tuple spaces).

I've long contended that a tuple space was basically a generalised key-value store, so it's nice to see projects like this one crop up.

[1] http://java.net/projects/jini/

[2] http://www.gigaspaces.com/

[3] http://eprints.utas.edu.au/9996/


I'm wondering what's the need for this when we already have Apache Pig, etc?


Apache Pig is a much different beast than this project, from what I can tell reading the documentation for Pangool. While they both operate on tuples and work at a higher level than pure Hadoop, they accomplish their goals much differently.Pig uses its own language called Pig Latin (http://pig.apache.org/docs/r0.9.2/basic.html), which is then compiled down into code that interfaces with the Hadoop library. Pangool is much closer to Hadoop, in that you are writing Java. If you look at one of their examples (http://pangool.net/introduction.html), I get the sense that the developers aim to make Hadoop easier to use, while Pig aims to make data analysis easier to use.

These goals are greatly divergent. In Pig, Java code is written to create new functions that can be used for analysis--i.e. Java is written in support of Pig Latin. Pangool focuses instead on extending Hadoop by making the Java code easier to write. This means Pig could potentially be implemented in Pangool, if Pangool were to satisfy the requirements for the task. (Not that I am suggesting that Pig actually be written--it might just be possible, depending on the technical requirements.)

Having used Hadoop in the past, I would be more inclined to use Pangool. Parts of Hadoop are poorly written--especially the reliance on singletons--and anything that makes it easier to write code that runs on a Hadoop cluster is a desirable goal in my eyes. I look forward to seeing how this project shapes up.


Hi, I'm one of the developers of Pangool. The idea of Pangool is not to be yet another higher level API on top of Hadoop but rather to pose a replacement for the low-level Hadoop Java MapReduce API. Pangool has the same performance and flexibility than that of the Java MapReduce API although it makes several things a lot easier and convenient. There is no tradeoff, just advantages. There will be cases where you'd want to use Pig or Cascading. There will be some other cases where you'd want the flexibility and efficiency of MapReduce. For those cases we conceived Pangool. Nowadays only very advanced Hadoop users could write efficiently-performing MapReduce Jobs. Pangool hides all the advanced boilerplate code needed for writing highly efficient MapReduce jobs, making things like secondary sorting or reduce-side joins extremely easy.


> There is no tradeoff, just advantages.

Though I don't have deep expertise in Hadoop, I find this claim highly suspect. High-level APIs achieve user-friendliness by making decisions/assumptions about the way a lower-level API will be used. I would be very surprised if there was no use case for which your API does impose a trade-off vs. the low-level Hadoop API.

I feel much more confident using a high-level API if its author is up-front about what assumptions it's making. If the claim is that there is no trade-off vs. the low-level API, I generally conclude that the author doesn't understand the problem space well enough to know what those trade-offs are.

I could be wrong, but this is my bias/experience.


Hi haberman, I'm one of the developers of Pangool. Let me try to clarify why we stated that. I understand it may sound aggresive.

Pangool is based on an extension of the MapReduce model we suggest and call "Tuple MapReduce". This is explained in detail in this post: http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...

What this means is that in Pangool, if you worked with 2-sized Tuples, you would be able to do exactly the same that you do now with Java MapReduce - That includes custom RawComparators and arbitrary business logic in any place of the MapReduce chain (Mapper, Combiner, Reducer). Using n-sized Tuples together with Pangool's group & sort by, reduce-side join API will only mean less code, easier code at no loss of performance or flexibility.

Realize that Pangool is still a MapReduce API so it doesn't add any level of abstraction.

We designed Pangool with the aim of offering it as a replacement of the current MapReduce API. Therefore we are not labelling it as a "higher-level API" but as comparable low-level API.

On the other hand we are also benchmarking Pangool to show it doesn't impose a performance overhead: http://pangool.net/benchmark.html


It sounds like you are implementing an in-memory data structure (Tuple) and serialization of that data structure on top of the raw strings provided by the Hadoop API. While I can believe that the overall overhead of this would be small in many cases, you would observe it most severely in cases where your data was natively key/value pairs of very short strings, or where you had lots of tuples with very short payloads. Do any of your performance tests cover this case? I would expect Pangool to display more than negligible CPU and memory overhead in this case.

Also, since the data model is more complicated and provides more features, it takes more code and a more complex implementation. This could be significant if you were trying to port the model to another language or implementation, or were trying to formally things about the code or mathematical model, etc.

I'm not saying it's not cool; I actually think it's a good and powerful abstraction -- I just object to the characterization of "all features and no tradeoffs".


The tradeoff, then, is that if someone's current problem maps exactly to the current API, then your API is more complex than needed.


Pangool actually seems like a generalization of Hadoop. This doesn't necessarily make it more complex. If a problem maps exactly to the Hadoop API, then it should also map exactly to the Pangool API by setting m=2 (in the extended map reduce model described at http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...).


I agree with your first sentence, but disagree with the second. That you can find an exact mapping does not prevent the underlying API from being more complex than what you need. That you had to realize "Oh, m=2" is more complexity.

I'm not arguing this is a terrible thing. In fact, I think this is an acceptable level of additional complexity for the power it buys you. But if we're going to make an honest evaluation of the trade-offs, I think we must mention this.

It may be relevant to the discussion to point out that I work on a tuple-based streaming system. Product: http://www-01.ibm.com/software/data/infosphere/streams/ Academic: http://dl.acm.org/citation.cfm?id=1890754.1890761, http://dl.acm.org/citation.cfm?id=1645953.1646061


Can you give an example of a job that would be difficult or impossible to perform efficiently with Cascading, but Pangool gives an advantage over raw MapReduce?


Hi avibryant, According to our initial benchmark (http://pangool.net/benchmark.html), secondary sorting in Cascading is slow (http://bit.ly/wTKOxo), showing a 243% performance overhead compared to an efficient implementation in MapReduce. The implementation in MapReduce has a lot of lines (http://bit.ly/yYGnGe) whereas Pangool's implementation is quite simple (http://bit.ly/x9U7Yj). A common application of secondary sort is calculating moving averages, for instance.


Ok, so Cascading has a slow implementation of secondary sort, but is there any reason you believe that couldn't be improved? I don't think you're really comparing architectures there, just how well optimized particular implementations are.

I'm asking because in my experience the extra level of abstraction provided by Cascading, Crunch etc is a huge advantage, and if you're making a conscious choice to operate at a lower level, you better be getting something significant in return; it's not clear to me yet what that is.


Pangool is not an alternative for Cascading. For example, at this point, Pangool does not help you managing workflows. If you are starting a MapReduce application, it is probably the best option to start using higher level abstractions: Cascading, Hive, Pig, etc.

But if you are thinking about learning Hadoop using the standard Hadoop API, or if you need for some particular reason to use it for your project, we recommend you to use Pangool instead.

Or if you are considering to implement another abstraction on top of Hadoop, probably using Pangool for it would also be a good idea.

In fact, what we believe is that the default Hadoop API should look like Pangool.


You are doing regex matching in the Cascading code, but splitting on a character in the pangool code. The latter is obviously much faster. I don't know that that's the reason for the difference you observe, but it certainly can't hurt to fix that and make the user-supplied code more comparable.


Indeed that regex was problematic because it had a bug itself. We replaced that line by RegexSplitter and updated the benchmark page. Please shout if you notice something else wrong. Thanks.


Just for clarify, split() java function is using regexp for the split as well. The code of String.split() is:

return Pattern.compile(regex).split(this, limit);

The benchmark seems fair to me.


So it sounds like this slots in like so, in order of abstraction:

HIVE -> Pig -> Pangool -> Cascading -> MapReduce

Nice addition!


Hi rjurney. I would say "Hive, Pig, Cascading" are on the higher level API side and "Pangool, MapReduce" on the low-level side. Pangool is a MapReduce API that aims to make MapReduce simpler. We explain this better in our FAQ: http://pangool.net/faq.html


HIVE -> Pig -> Cascading -> Pangool -> MapReduce ?


Tuples reminds me of RDBMS


Exactly. A tuple is exactly the same as a relation.


Set of tuples (of like kind) is a relation.




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: