Using your laptop to compute PageRank for millions of webpages

kragen · on May 27, 2010

From skimming, this appears to be the algorithm given in the original PageRank paper from 12 years ago. You don't need hundreds of megs of RAM to do it, though; put yourself in the mindset of a 1960s COBOL programmer, and you can do the matrix-vector multiplication in a few K of memory space by the following approach. The adjacency matrix is a set of triples (src, dst, weight), indicating that there is a link from URL src to URL dst of weight weight, where the weights in a row sum to s. You're computing an occupancy vector consisting of a bunch of pairs (url, weight) where the weights total to 1.

1. Sort the adjacency matrix by src and the occupancy vector by url.

2. Iterate through the current occupancy vector and the adjacency matrix in parallel. For each triple (a, b, c) and pair (a, d), emit a partial-sum pair (b, d*c).

3. The partial-sum pairs will be randomly ordered. Sort them by URL.

4. Sum the partial-sum elements for the same URL, and add in the contribution of sD + tE. (I'm leaving out how to compute sD at this point, but it's not that hard.) This is your new occupancy vector.

5. Go back to step 2 for a few more iterations.

This approach parallelizes nicely with MapReduce, and doesn't require anything but sequential data access; you could do it with four tape drives and a microcontroller. I know I've seen it in some paper somewhere, but I can't remember where.

jacquesm · on May 27, 2010

Every programmer should be - for a while at least - forced to work on a machine with no more than a megabyte of ram and a very slow hard disk.

Nothing like a constrained environment to teach you about efficiency in programming.

kragen · on May 27, 2010

While that is true, this is not the most efficient way to solve the problem if you have a lot of RAM. My point is merely that your laptop can compute PageRank for, roughly, any link graph that will fit on its hard disk, or that it can access at a reasonable speed over the network.

(At 30 links per web page and 5 bytes per link, you need about 160 bytes per web page to represent this link graph. That would allow you to store a 6-billion-page graph on a 1TB notebook hard disk. Unfortunately, making 8 passes over that terabyte will probably take more than a day.)

kragen · on May 27, 2010

By the way, this blog seems to be a gold mine of good thinking about how to reimagine academic publishing.

notauser · on May 27, 2010

{Related} I wonder how search engines are using the 'nofollow' tag these days.

I know it's supposed to mean that PR does not accrue to the destination but they would have to be mad to ignore such a huge source of data.

ddemchuk · on May 27, 2010

just because the PR doesn't flow through the nofollow link does not mean that the SE's aren't following the links to see what's on the other side....