R3, a map-reduce engine with Python and Redis

cypherpunks01 · on July 27, 2012

Do the map-reduce results get put back into redis? I always worry about OOM problems when I'm putting a somewhat unbounded set of things into redis.

jdboyd · on July 27, 2012

You can configure redis to limit how much memory is used.

maxmemory 104857600

Of course, that still might not be the result you want.

heynemann · on July 29, 2012

They do get put into redis.

Maybe we should have a different storage strategy if the data is too big? File storage? I just meant for it to be simple.

If you are going to use redis for storage then you'll need to fine tune it to the processing you are doing (we have).

binarycrusader · on July 26, 2012

Neat; but seems to missing copyright notices and an explicit license, which means no one can actually use it or redistribute it with their application.

wahnfrieden · on July 27, 2012

Likely an oversight. Submit a pull request with a BSD-like license file.

heynemann · on July 29, 2012

The license is in the README now. It's MIT licensed.

fsaintjacques · on July 26, 2012

Can you horizontaly scale the redis backend or it supports only one instance?

Why restrain to sequential reducers when you can parallelize with partitions and sorting?

heynemann · on July 29, 2012

We do horizontally scale redis as a farm. I'll try to get more details on how we do it as I'm not the one responsible.

We thought of parallel reducers and it does make a lot of sense. The reason they are sequential is to get a first release out so we can juggle ideas with people. If you care to contribute we'd love it. Even if you just create an issue.

grantjgordon · on July 26, 2012

Anyone have some insight into situations where running map reduce on redis makes more sense than other software like the traditional hadoop?

seiji · on July 26, 2012

Hadoop is a bloated pile of elephant poo. Any and all alternatives are welcome. Disco (http://discoproject.org/) is popular in some parts of the mapreducesphere.

fsaintjacques · on July 26, 2012

Using disco here, very happy with it.

grantjgordon · on July 26, 2012

Mind sharing how long you've been using it and how it compares to hadoop in your opinion? I'm very interesting in hearing your experience.

achompas · on July 27, 2012

Same here, I'm really interested in hearing about Disco and potential benefits/costs vs. Hadoop.

heynemann · on July 29, 2012

The reason I wrote r³ is because I was a little overwhelmed by how complex disco is to administer and scale.

r³ was designed from the ground up to adhere to HTTP. That means it's pretty easy to scale using our old and well-proven techniques: caching and load-balancing.

achompas · on July 26, 2012

I hear the above comment about Hadoop a lot. Can you explain why?

seiji · on July 26, 2012

I'd love to, but it would take about an hour to run through everything.

Here's a short version: There's a collective ecosystem problem of fragmented applications, not-quite-right command line utilities, web interfaces that look like they were designed in 1995, noisy log files people actually have to read constantly, and cross coupling of dependencies that make keeping a cluster live for production use a full time job.

There's the programming problem of nobody actually writing hadoop mapreduce code because it's impossibly complicated. Everybody uses hive and pig and half a dozen other tools to compile to pre-templated java classes (this knocks off 5% to 30% of your performance if you could do it by hand).

It hasn't grown because it's so amazing, performant, and company saving. It grows because people jumped on a fad wagon then got stuck with having a few hundred TB in HDFS. The lack of a competing project with equal mindshare and battle-testedness doesn't foster any competition. It's the mysql of distributed processing systems. It works (mostly), but it breaks (in a few dozen known ways), so people keep adding features and building on top of it.

MichaelSalib · on July 26, 2012

seiji pretty much nails it. Hadoop seems to have come out of a weird culture. It is a distributed system with a single point of failure (name node) because its designers insisted on avoiding Paxos (distributed systems are too hard so we'll just make a broken-by-design protocol instead). Another example is that a lot of the database code built on top of Hadoop is designed around one Java hashmap per row which really limits performance.

There are all sorts of oddities and you can mostly work around them but it is...exhausting, and I spend a lot of time thinking "surely there must be a better way".

jamii · on July 27, 2012

> surely there must be a better way

http://www.spark-project.org/

sqrt17 · on July 27, 2012

Wait, so Zookeeper (= distributed consensus thingie that I think implements the Paxos algorithm) is a Hadoop project but not actually used in Hadoop mapreduce?

mumrah · on July 27, 2012

That's correct. I believe they are using it in some new "high availability" stuff coming down the road

grantjgordon · on July 26, 2012

Thanks for your insightful comments! I appreciate that you took the time to back up your opinion by distilling your thoughts into something quickly digestible.

Have you heard of any other projects outside of disco that are more performant than hadoop when used for similar applications?

cgh · on July 26, 2012

I'd also just like to say: NameNode = single point of failure.

I worked on a contract for a large, very well-known social networking company a while back who refused to consider Hadoop because of this.

sitkack · on July 27, 2012

I <B disco. It is so well designed and easy to run. Truly, love it.

hogu · on July 27, 2012

If you have alot of data, and network IO is a big issue, you'll want to use something like hadoop (or disco) becuase they come with an integrated distributed file system and they preserve data locality.

If you don't have that much data, MR on redis is fine

ChristianMarks · on July 26, 2012

I can think of one case where a redis dictionary is used to represent a tree, and reductions are needed over a subtree. Calculations on river networks are like this. You might want to use redis instead of a cPickled dictionary, and you might not want the overhead of a full Hadoop.

sitkack · on July 27, 2012

On redis 2.6 you can use Lua, reductions over lists could be done directly on the server.

iandanforth · on July 26, 2012

"Getting one up in your system is beyond the scope of this document."

- 67 characters

brew install redis

redis-server

- 31 characters

zalew · on July 27, 2012

I actually like when people focus on the case and don't pollute their manuals with such things. If you don't know how to, it's information you can easily obtain elsewhere and you probably got some homework to do anyway.

bkirwi · on July 26, 2012

Unfortunately, not all systems are OSX.

wildmXranat · on July 26, 2012

apt-get install redis

jeremiep · on July 26, 2012

Its actually in the redis-server package on Ubuntu.

kermatt · on July 27, 2012

wget

tar -xzvf

make

sudo make install

brandynwhite · on July 26, 2012

This is pretty interesting, I have a related project (plug, hadoopy.com). The way I went about this (in an experimental branch) is to use Celery running on Redis.

dchichkov · on July 27, 2012

Can multiple users run tasks simultaneously? Can they set task priorities?

heynemann · on July 29, 2012

Yes and No.

We use tornado for the stream (the task processor). That means that only one user gets to run a task simultaneously.

That said, the stream is just an http application.

This means that you can scale it as easily as you would any web app.

velodrome · on July 26, 2012

This looks like an interesting project.

Is there something like this for php?

ericmoritz · on July 26, 2012

Python isn't a hard language to learn. It's probably easier to learn Python than to port this to PHP.

heynemann · on July 29, 2012

I agree, but one of the next features we'll implement is for you to be able to write stream processors, mappers and reducers in any language you want. Stay tuned!