ivanprado's comments

ivanprado · on March 20, 2019

That's a great idea.

ivanprado · on Sept 17, 2013

You are right that it should be used the proper tool for each particular problem. And Hadoop world is harder than single machine systems (like pandas). So, you shouldn't user Hadoop if you can do the job with simpler systems.

But I have something to add. Hadoop is not only introducing new techniques for distributed storage and computation. Hadoop is also proposing a methodological change in the way a data project is approached.

I'm not talking only about doing some analytic over the data, but building an entire data driven system. A good example would be the case of building a vertical search engine, for example for classified ads around the world. You can try to build the system just using a database and some workers dealing with the data. But soon you'll find a lot of problems for managing the system.

Hadoop provides you all the storage and processing power that you want (it is matter of money). Why if you build your system in a way where you recompute always everything from the raw input data? That can be seen as something stupid: Why doing that if you can run the system with less resources?

The answer is that with this approach you can:

- Being human fail-tolerant. If somebody introduces a bug in the system, you just have to fix the code, and relaunch the computation. That is not possible with stateful systems, like those based in doing updates over a database. - Being very agile in developing evolutions. Change the whole system is not traumatic, as you just have to change the code and relaunch the process with the new code without much impact in the system. That is not something simple in database backed systems.

The following page shows how a vertical search engine would be built using Hadoop and what would it be its advantages: http://www.datasalt.com/2011/10/scalable-vertical-search-eng...

jacques_chester · on Sept 18, 2013

> That is not possible with stateful systems, like those based in doing updates over a database.

Could you elaborate? This sounds like the NoSQL argument that relational databases are "not agile", which usually means relational databases complain that you have records that won't logically fit the changes you just made.

ivanprado · on Jan 16, 2013

Three things differentiate Splout SQL from using Sqoop for exporting to an existing SQL database: 1) Scalability: Relational databases rarely scales, or are too expensive for big volumes of data. They don't work well with Hadoop. 2) Update isolation: In Splout SQL, database updating never affects serving queries as it is performed in a Hadoop cluster. 3) Atomicity: Datasets are deployed atomically in Splout SQL. That avoids inconsistency problems that arises in RDMS when updating existing databases.

ivanprado · on March 7, 2012

Hi, I'm one of the developers of Pangool. The idea of Pangool is not to be yet another higher level API on top of Hadoop but rather to pose a replacement for the low-level Hadoop Java MapReduce API. Pangool has the same performance and flexibility than that of the Java MapReduce API although it makes several things a lot easier and convenient. There is no tradeoff, just advantages. There will be cases where you'd want to use Pig or Cascading. There will be some other cases where you'd want the flexibility and efficiency of MapReduce. For those cases we conceived Pangool. Nowadays only very advanced Hadoop users could write efficiently-performing MapReduce Jobs. Pangool hides all the advanced boilerplate code needed for writing highly efficient MapReduce jobs, making things like secondary sorting or reduce-side joins extremely easy.

haberman · on March 7, 2012

> There is no tradeoff, just advantages.

Though I don't have deep expertise in Hadoop, I find this claim highly suspect. High-level APIs achieve user-friendliness by making decisions/assumptions about the way a lower-level API will be used. I would be very surprised if there was no use case for which your API does impose a trade-off vs. the low-level Hadoop API.

I feel much more confident using a high-level API if its author is up-front about what assumptions it's making. If the claim is that there is no trade-off vs. the low-level API, I generally conclude that the author doesn't understand the problem space well enough to know what those trade-offs are.

I could be wrong, but this is my bias/experience.

ferrerabertran · on March 7, 2012

Hi haberman, I'm one of the developers of Pangool. Let me try to clarify why we stated that. I understand it may sound aggresive.

Pangool is based on an extension of the MapReduce model we suggest and call "Tuple MapReduce". This is explained in detail in this post: http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...

What this means is that in Pangool, if you worked with 2-sized Tuples, you would be able to do exactly the same that you do now with Java MapReduce - That includes custom RawComparators and arbitrary business logic in any place of the MapReduce chain (Mapper, Combiner, Reducer). Using n-sized Tuples together with Pangool's group & sort by, reduce-side join API will only mean less code, easier code at no loss of performance or flexibility.

Realize that Pangool is still a MapReduce API so it doesn't add any level of abstraction.

We designed Pangool with the aim of offering it as a replacement of the current MapReduce API. Therefore we are not labelling it as a "higher-level API" but as comparable low-level API.

On the other hand we are also benchmarking Pangool to show it doesn't impose a performance overhead: http://pangool.net/benchmark.html

haberman · on March 8, 2012

It sounds like you are implementing an in-memory data structure (Tuple) and serialization of that data structure on top of the raw strings provided by the Hadoop API. While I can believe that the overall overhead of this would be small in many cases, you would observe it most severely in cases where your data was natively key/value pairs of very short strings, or where you had lots of tuples with very short payloads. Do any of your performance tests cover this case? I would expect Pangool to display more than negligible CPU and memory overhead in this case.

Also, since the data model is more complicated and provides more features, it takes more code and a more complex implementation. This could be significant if you were trying to port the model to another language or implementation, or were trying to formally things about the code or mathematical model, etc.

I'm not saying it's not cool; I actually think it's a good and powerful abstraction -- I just object to the characterization of "all features and no tradeoffs".

scott_s · on March 7, 2012

The tradeoff, then, is that if someone's current problem maps exactly to the current API, then your API is more complex than needed.

tim_h · on March 7, 2012

Pangool actually seems like a generalization of Hadoop. This doesn't necessarily make it more complex. If a problem maps exactly to the Hadoop API, then it should also map exactly to the Pangool API by setting m=2 (in the extended map reduce model described at http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...).

scott_s · on March 7, 2012

I agree with your first sentence, but disagree with the second. That you can find an exact mapping does not prevent the underlying API from being more complex than what you need. That you had to realize "Oh, m=2" is more complexity.

I'm not arguing this is a terrible thing. In fact, I think this is an acceptable level of additional complexity for the power it buys you. But if we're going to make an honest evaluation of the trade-offs, I think we must mention this.

It may be relevant to the discussion to point out that I work on a tuple-based streaming system. Product: http://www-01.ibm.com/software/data/infosphere/streams/ Academic: http://dl.acm.org/citation.cfm?id=1890754.1890761, http://dl.acm.org/citation.cfm?id=1645953.1646061

avibryant · on March 7, 2012

Can you give an example of a job that would be difficult or impossible to perform efficiently with Cascading, but Pangool gives an advantage over raw MapReduce?

ivanprado · on March 7, 2012

Hi avibryant, According to our initial benchmark (http://pangool.net/benchmark.html), secondary sorting in Cascading is slow (http://bit.ly/wTKOxo), showing a 243% performance overhead compared to an efficient implementation in MapReduce. The implementation in MapReduce has a lot of lines (http://bit.ly/yYGnGe) whereas Pangool's implementation is quite simple (http://bit.ly/x9U7Yj). A common application of secondary sort is calculating moving averages, for instance.

avibryant · on March 7, 2012

Ok, so Cascading has a slow implementation of secondary sort, but is there any reason you believe that couldn't be improved? I don't think you're really comparing architectures there, just how well optimized particular implementations are.

I'm asking because in my experience the extra level of abstraction provided by Cascading, Crunch etc is a huge advantage, and if you're making a conscious choice to operate at a lower level, you better be getting something significant in return; it's not clear to me yet what that is.

ivanprado · on March 7, 2012

Pangool is not an alternative for Cascading. For example, at this point, Pangool does not help you managing workflows. If you are starting a MapReduce application, it is probably the best option to start using higher level abstractions: Cascading, Hive, Pig, etc.

But if you are thinking about learning Hadoop using the standard Hadoop API, or if you need for some particular reason to use it for your project, we recommend you to use Pangool instead.

Or if you are considering to implement another abstraction on top of Hadoop, probably using Pangool for it would also be a good idea.

In fact, what we believe is that the default Hadoop API should look like Pangool.

squarecog · on March 8, 2012

You are doing regex matching in the Cascading code, but splitting on a character in the pangool code. The latter is obviously much faster. I don't know that that's the reason for the difference you observe, but it certainly can't hurt to fix that and make the user-supplied code more comparable.

ferrerabertran · on March 8, 2012

Indeed that regex was problematic because it had a bug itself. We replaced that line by RegexSplitter and updated the benchmark page. Please shout if you notice something else wrong. Thanks.

ivanprado · on March 8, 2012

Just for clarify, split() java function is using regexp for the split as well. The code of String.split() is:

return Pattern.compile(regex).split(this, limit);

The benchmark seems fair to me.