What do people use this sort of thing for? The example is neat, but it's trivial and would take 4 lines of non-mapreduce code. What sorts of problems do people have that are big enough to need mapreduce? It seems to me that problems big enough to need it are going to be big enough to bother with hadoop or maybe rolling something custom so you can control the details. Is that not true?
I spun this our of some work I was doing for an NLP related thesis. Large scale text processing where there's already a shared file system is a perfect fit for mincemeat.py.
I basically wanted something that was much, much easier to develop for, setup, and run than Hadoop. I've heard many academic researchers complain that although they had a algorithm that would fit neatly into MapReduce, they didn't want to bother setting up Hadoop and importing all of their data for a process that would only get run a few times (and that was already coded in Python).
That makes sense. I hadn't considered academia. I was thinking that if there is a business need for mapreduce, then the solution is probably going to become long-term infrastructure, in which case the cost of setting up hadoop can be amortized. But it makes sense to me that researchers would have a higher proportion of one-off problems, and in that situation, a simpler python solution can be a big win. Thanks.
The only downside for me is that not only do you have to find a massively parallel problem to work on, but the computation function also needs to be much slower than network latency. With network latency being in the ms range, the algorithm solving problem needs to be really slow to benefit.
Not necessarily. It can also just be a vast amount of data, in which case bandwidth (which is generally pretty good), not latency, is your limiting factor with mapreduce.
Also, you only need one part of your infrastructure to require mapreduce's parallelism in order to argue for using it across the board. If you have simpler problems to solve, you may as well solve them with mapreduce since you'll be thinking in that computation model anyway, and you can easier use the results later in a computation that may require mapreduce.
Your problem still has to have the property that loading and processing the dataset needs to be slower than sending it over the network and getting the result back, though...
Maybe that's just what is going on? He is rolling his own for what ever reason and decided to release it. Python is used pretty heavily in science, having taken the place of Perl as language of choice. Doing science can produce massive data sets that often need heavy and complex processing.
I'm a short walk from an university with a Linux cluster for crunching astronomical data for example. Lots of signal analysis goes on there.
Thats a very valid point. Hadoop does a lot more than handle the basic logic of MapReduce (which btw, is not terribly complicated). Fault tolerance, preemptive duplication of tasks in case of unexpected slowness, etc. are all part of the ugliness that Hadoop takes care of on large clusters.
That said, I can see this being useful for prototyping MapReduce code quickly in Python without going through the hassle of setting up Hadoop and coding in Java.
Mincemeat.py actually does do fault tolerance and preemptive duplication. I'm trying to get it to handle all of the MapReduce logic (and reliability, security, etc.) without the extra burdens of setting up a distributed file system (the machines I've been running it on have a hefty NFS share, which is good enough for most purposes), preallocation of the machines that will be part of the cluster (my university uses Condor, which spawns processes randomly across a large cluster), etc.
In that case, kudos! One of my pet gripes about Hadoop is that it really used to be unusable without HDFS (don't know if that's changed). If you truly manage to develop it to a stable release with all the features of Hadoop, I definitely see it making some inroads in scientific computation.
Thanks for sharing this, I will definitely be taking a look at it for a big bootstrap (Monte Carlo) simulation I do. Have you used the parallel extensions in IPython? I have used IPython with great success for the task farming needed in my embarrassingly-parallel context, where the same algorithmic steps need to be applied to hundreds or thousands of data sets. I'll be interested to use your approach and consider the pros and cons, but I like how simple it appears to use.
I worry that in your example, vs is passed in to reducefn as an array. A generator would be more memory-efficient (though from the example it isn't possible to differentiate and I haven't looked at the source).