Exploration of the Google PageRank Algorithm

wfunction · on May 2, 2013

This is way too complicated!

How about an easy explanation, one that a middle- or high-schooler who just learned to solve 3-equations and 3-variables and has seen probability can understand?

Let's say you have three websites: Reddit, Yahoo, and Google. Let's say Reddit links to Yahoo and Google, and Yahoo links to Google, and Google links to Reddit.

If you start on a random page and click randomly, what happens? You'll land on Reddit with probability x, Yahoo with probability y, and Google with probability z. Notice that if you continue clicking randomly, x, y, and z will stop changing after a while.

Okay, now stop clicking. How likely are you to have landed on page Google? (i.e., What's z?)

- You're on Reddit with probability x, and 1/2 of the links there will take you to Google.

- You're on Yahoo with probability y, and 1/1 of the links there will take you to Google.

- You're on Google with probability z, and 0/1 of the links there will take you to Google.

Therefore, z = x(1/2) + y(1/1) + z(0/1)

Similarly, x = x(0/2) + y(0/1) + z (1/1)

Similarly, y = x(1/2) + y(0/1) + z (0/1)

Since x + y + z = 1, we get: x = 2/5, y = 1/5, z = 2/5

What does that mean? It means Reddit and Google are twice as "important" as Yahoo, since you're twice as likely to land on them.

There, we just learned how PageRank works with zero linear algebra business.

mullr · on May 2, 2013

Nonetheless, linear algebra is the way it's done for real applications. It's how the literature treats the subject as well, so it's important to understand. This tool helps for that goal, so I'm glad it's there.

wfunction · on May 2, 2013

In a real application, you don't even know the transition probabilities to begin with, so actually performing the computation is the least of your worries -- the first problem is coming up with numbers, which wasn't mentioned. :)

So I feel that if the goal is to give a conceptual understanding, linear algebra is overkill. If the goal is to show how it's done in practice then this isn't terribly useful since it doesn't even show the tip of the iceberg!

bflbfl · on May 3, 2013

Hey wfunction

I believe we know exactly what the matrix is, based on the hyperlink structure itself. It's updated and displayed on the vis here based on that, too.

btw, David Austin's article on PageRank is highly recommended : http://www.ams.org/samplings/feature-column/fcarc-pagerank

mullr · on May 2, 2013

Fair enough. I'm thinking in particular of a paper I read that used pagerank for word sense disambiguation [1]. It's actually variant of pagerank, so the high level description in terms of probabilities (which I read several of) didn't take me very far. In order to come to grips with the technique I ended up actually implementing it and trying it on some toy datasets. This definitely would have helped me in that case.

[1] http://www.aclweb.org/anthology/E/E09/E09-1005.pdf It's interesting research, but it unfortunately turns out that even the best results in word sense disambiguation aren't good enough for a lot of applications. I wish I had 2 years to do nothing but play around in this field.

williamcotton · on May 2, 2013

I made a similar example a few years ago. It even works in IE6!

http://williamcotton.com/pagerank-explained-with-javascript

merusame · on May 2, 2013

Thanks & even great music you make!

chmullig · on May 2, 2013

This is awesome if you actually want to understand the linear algebra behind it.

A useful paper is also Bryan and Leise's The $25B Eigenvector, The Linear Algebra Behind Google.

sgloutnikov · on May 2, 2013

Love me some power method :) I was amazed when studying it in numerical analysis, and the professor showed us how this works in the real world (Google).

dewey · on May 2, 2013

Anyone else just getting a 404 "We're sorry, but you do not have access to this page." error?

RobAley · on May 2, 2013

Yup. If you google the URL you'll get the cached version though.

hilti · on May 2, 2013

I get a 403 (Forbidden)!!1

deadairspace · on May 2, 2013

What does G signify? I'm a bit confused as to why taking its eigenvector gives you the pagerank vector.

jamessb · on May 2, 2013

In essence, pagerank sorts pages by the probability that you'd arrive at them by randomly following links form other pages.

G is a matrix of transition probabilities - each entry is the probability of moving directly from a particular page to another particular page. It is composed of 3 terms:

- the hyperlink matrix: the probabilities obtained by assuming a user randomly selects a link to follow from their current page, with an equal probability for each

- the dangling nodes matrix: to ensure that it is possible to leave every page, this adds an equal probability for moving from a page with no outgoing links to every other page

- the matrix U of all ones: this provides G with 2 desirable properties, by ensuring it is both irreducible/strongly-connected and aperiodic. It does this by making it possible to move from any page to any other page, with some small probability (exactly how small is determined by the damping factor d).

The pagerank vector represents the equilibrium probability distribution - in other words, the probability of being on a particular page after starting on a random page then spending a long time randomly moving between pages according to the probabilities in G.

Now, if at a given time your probability of being on each page is given by some vector X, the probability of being on each page after one random move is the transition matrix multiplied by X.

The equilibrium probability vector thus has the property that it is unchanged by multiplication by the transition matrix - it is therefore an eigenvector of the transition matrix, with an eigenvalue of 1.

Edit: After typing this, I see that explanatory hypertext appears when you mouse over the blue words.

j2kun · on May 2, 2013

Interested in more of the mathematics behind the algorithm? I implement and prove things about it on my website: http://jeremykun.com/2011/06/18/googles-pagerank-a-first-att...

bflbfl · on May 13, 2013

For those who may come across this, I have added some extra stuff to let you watch how the power method converges (or doesn't converge) to the PageRank vector. Just click the "Explore Convergence" button over there where the PageRank vector is written out.

chriscoyfish · on May 2, 2013

As old as this algorithm may be, it's still relevant/useful in many practical applications. It's crazy to think that these small examples on here scale to gigantic data sets, especially as each PageRank gets updated.

mikescoffield · on May 2, 2013

This is pretty cool. I'm playing around w/ different backlink structures. It seems if you link back to a tier 1 link, it will boost your PR. Is this true in practice?

bflbfl · on May 3, 2013

Hey Mike - can you give me some details on the examples you're playing with? If I get around to it, I was going to see about including some canned examples to maybe show/disprove how things like that work (at least with this version particular of the algorithm). The inclusion of the dangling nodes matrix and that extra all-ones one can make it trickier to predict what will happen.

mikescoffield · on May 6, 2013

I built a basic pyramid and then tinkered w/ variations from that.

bflbfl · on May 7, 2013

Thanks Mike. I've been playing with small networks as well (partly to confirm things working correctly) - wrote up a few notes on one little one with slightly counter-intuitive behavior - http://www.nowherenearithaca.com/2013/05/google-pagerank-eva...

mikescoffield · on May 7, 2013

That's interesting!

Sadly, it does remind me of how much math/engineering I've forgotten since I graduated.

Are you still a student?

bflbfl · on May 10, 2013

Hi Mike

thanks for taking a look at that. Linear algebra/eigenvector/eigenvalue stuff is useful and usually fairly (re-)approachable.

I am a student of interesting things :)

ari_elle · on May 2, 2013

Try deleting all the Nodes with DEL

When deleting the last one it crashes (tested in FF and Chrome)

bflbfl · on May 7, 2013

thx - think I fixed that the other day.