Collaborative Map-Reduce in the browser

timf · on March 3, 2009

"how hard would it be to assemble a million people to contribute a fraction of their compute time?"

The BOINC project's done it, they've seen 1million+ computers. And they even have an installation barrier which is different than what you are suggesting (their software is robust and easy to install but you still have to do it).

One thing BOINC and BOINC projects do well is establish non-monetary incentives, whether it be competitions, fancy graphs, etc. That's something to solve, not sure enlisting just your social network (manually, with an URL) is going to be enough to cut it if you want thousands of participants (unless you are particularly "influential" I guess).

Or maybe this is something a legion of mechanical turkers would be interested in?

hendler · on March 3, 2009

In 2001 when I was traveling in Asia, I installed Seti@home (now a Boinc project) on every computer I could in internet cafés.

Google Gears is also pretty good way to do this your self since there's a DB and a background process - although full apps in the client changes the security model - as with google native client - http://code.google.com/contests/nativeclient-security/

Nice post. Oh, and I like the rainbow butts iconography. :)

timf · on March 3, 2009

Another thing BOINC does is make sure the work is backgrounded and interruptible. Given the propensity for heavy Javascript to lock up browsers (other than process-per-tab browsers like Chrome), I wonder how well this could be executed in terms of not ticking people off.

ryanwaggoner · on March 3, 2009

I knew I'd seen something like this before...

http://www.pluraprocessing.com

Launched on HN (where else) a few months ago:

http://news.ycombinator.com/item?id=347359

lecha · on March 3, 2009

Here are some more "business ideas" for your enjoyment:

- Buy tons of those fancy interactive visual advertisements, embed the worker into them and perform mapreduce jobs in the browsers of unsuspecting users

- Run some of the analytic/batch processing related to a popular social network on CPUs of your customers.

- Have a popular site? Sell CPUs of its audience just like one sells impressions via AdSense.

raghus · on March 3, 2009

Google's server farm is rumored to be over six digits (and growing fast), which is an astounding number of machines, but how hard would it be to assemble a million people to contribute a fraction of their compute time? - maybe Google can put an optional thingy into Chrome so that users' computers can be part of their server farm?

hendler · on March 3, 2009

Gears and Native Client are likely just for this.

http://code.google.com/contests/nativeclient-security/

henryl · on March 3, 2009

I realize the author wasn't proposing that something like this could be a business, but humor me:

Had this idea a few years back with a business model that paid publishers for cpu cycles gathered from a javascript or flash widget. We hoped to then sell this service to data-intensive industries. Decided it wasn't feasible.

We need to consider cpu cycles gained from this regime vs. bandwidth and cpu cycles lost from the hundreds of web, queue, data servers needed to run this model. IMO it is unlikely that this model will pay off once you consider things like network latency, and trade-offs like job size (higher job size is better) vs. job completion probability (lower job size is better).

Even if the potential for viability were there, it isn't clear that there is a market for something like this. Large scale computing challenges obviously exist, and a lot of people are making money with solutions like cloud computing, but these problems typically involve proprietary data sets, using proprietary or industry standard (good 'ol apps like MySQL) software. Chopping up your sensitive data and sending it en masse to the public to be processed on javascript instead of C++ doesn't exactly fit client needs.

mchadwick · on March 4, 2009

I had actually tried to do this exact thing. Problem A is that you can't execute very much in a clients browser at any one time. Okay, so you make the jobs smaller and fetch more. Problem B is that it turns out that by the time you've pulled the data from disk, shipped it to the browser and back, put it back to disk, and done the same cycle for the reduce, it ends up being cheaper to stream 64M HDFS blocks around EC2.

That doesn't even get in to verifying results from an untrusted client.

Here's my half-implemented proof of concept from a while back that runs on AppEngine: http://github.com/markchadwick/emarer/tree/master

Slightly different implementation, but the same idea (which I think is a very cool idea!).

henryl · on March 4, 2009

You can extend the amount of processing time available on the client by storing intermediate results in window.name (which gives you up to 2MB of semi persistent storage)

igrigorik · on March 3, 2009

Yep, great points. The HTTP workflow definitely adds a lot of overhead, but I think it is still usable for a whole class of aplications. For example, anytime you can saturate the CPU, or outgoing bandwidth (use clients machine as a spider), this works really well.

As far as centralized servers & data storage.. I didn't cover this in the post, but I've been thinking a lot about using bittorrent to address this, and I think it's totally feasible. All you need is several thousand seed servers and you'll have a worldwide file system / job tracking queue. ;)

sam_in_nyc · on March 4, 2009

Have fun making sure clients don't send you invalid data. You'll have to have some sort of voting system where several clients compute the same piece, and make sure they all match up. Even then, you can't be 100% sure of the results.

piramida · on March 4, 2009

how is this related to map-reduce besides method names? it has a single point of failure (server), nodes do not have logic to split the job further, and on top of that painfully slow javascript engine...

jacktang · on March 4, 2009

my current work might related to the field. we created firefox add-ons and let the browsers work for us.

moonpolysoft · on March 4, 2009

Sorry, but this is basically grid computing with a slightly different client. As pointed out many times before, most interesting problems right now are IO bound. It turns out that data locality is the most important thing in processing extremely large datasets. That is the key insight in the map-reduce paper and the linchpin to the success or failure of all the distributed map-reduce frameworks that have sprung from it.

Most startups and small scale companies that would see the value in leveraging a system like this simply don't have the right processing profile which would make something like this worth their while. I'm sure if you graphed CPU time per byte of data you'd find a sweet spot where a service like this would speed up jobs rather than slowing them down.

As it happens, most companies that have a high CPU time per byte ratio are either financial firms or pharma. Most of whom not only have their own infrastructure, but would rather close up shop than see their proprietary code out in the wild for competitors to analyze.

And there are already plenty of clients out there for running fourier transforms on possible seti signals.