All the open-source engines I've tried scale poorly. Since our input data already contains >200m interactions I suspect recommendify would struggle (from my quick reading it looks like the co-concurrence matrix is stored in redis ie in-memory).
The approach I'm leaning towards at the moment is collating the article->IPs and IP->articles tables in leveldb and then distributing read-only copies to each worker. Everything else can easily be partitioned by article id.
I can't tell from the README - is your data fairly wide? I'm playing with using Postgres's new K-Nearest-Neighbor support to calculate similarity on 20D cubes, but I suspect my approach won't work well for an arbitrary number of columns (i.e. users x products) unless you first do some sort of PCA or SVD to narrow it down, and it isn't optimized for binary ratings at all. I started writing it up here: http://parapoetica.wordpress.com/2012/02/15/feature-space-si...
Around 200m download logs, 2m articles, some million IP addresses. I suspect that interest in research papers is inherently high dimensional and dimensional reduction would probably damage the results.
I don't have much hardware to throw at it either. I just started looking at randomized algorithms - trying to produce a random walk on the download graph that links articles with probability proportional to some measure of similarity (probably cosine distance or Jaccard index).
Paul, how does C fits into the picture? I don't see anything in the source for compiling native extensions, and the implementations src/recommendify.c calls look incomplete?
All the open-source engines I've tried scale poorly. Since our input data already contains >200m interactions I suspect recommendify would struggle (from my quick reading it looks like the co-concurrence matrix is stored in redis ie in-memory).
The approach I'm leaning towards at the moment is collating the article->IPs and IP->articles tables in leveldb and then distributing read-only copies to each worker. Everything else can easily be partitioned by article id.