It sounds like you are implementing an in-memory data structure (Tuple) and seri...

It sounds like you are implementing an in-memory data structure (Tuple) and serialization of that data structure on top of the raw strings provided by the Hadoop API. While I can believe that the overall overhead of this would be small in many cases, you would observe it most severely in cases where your data was natively key/value pairs of very short strings, or where you had lots of tuples with very short payloads. Do any of your performance tests cover this case? I would expect Pangool to display more than negligible CPU and memory overhead in this case.

Also, since the data model is more complicated and provides more features, it takes more code and a more complex implementation. This could be significant if you were trying to port the model to another language or implementation, or were trying to formally things about the code or mathematical model, etc.

I'm not saying it's not cool; I actually think it's a good and powerful abstraction -- I just object to the characterization of "all features and no tradeoffs".