How about an easy explanation, one that a middle- or high-schooler who just learned to solve 3-equations and 3-variables and has seen probability can understand?
Let's say you have three websites: Reddit, Yahoo, and Google.
Let's say Reddit links to Yahoo and Google, and Yahoo links to Google, and Google links to Reddit.
If you start on a random page and click randomly, what happens?
You'll land on Reddit with probability x, Yahoo with probability y, and Google with probability z.
Notice that if you continue clicking randomly, x, y, and z will stop changing after a while.
Okay, now stop clicking.
How likely are you to have landed on page Google? (i.e., What's z?)
- You're on Reddit with probability x, and 1/2 of the links there will take you to Google.
- You're on Yahoo with probability y, and 1/1 of the links there will take you to Google.
- You're on Google with probability z, and 0/1 of the links there will take you to Google.
Therefore, z = x(1/2) + y(1/1) + z(0/1)
Similarly, x = x(0/2) + y(0/1) + z (1/1)
Similarly, y = x(1/2) + y(0/1) + z (0/1)
Since x + y + z = 1, we get: x = 2/5, y = 1/5, z = 2/5
What does that mean?
It means Reddit and Google are twice as "important" as Yahoo, since you're twice as likely to land on them.
There, we just learned how PageRank works with zero linear algebra business.
Nonetheless, linear algebra is the way it's done for real applications. It's how the literature treats the subject as well, so it's important to understand. This tool helps for that goal, so I'm glad it's there.
In a real application, you don't even know the transition probabilities to begin with, so actually performing the computation is the least of your worries -- the first problem is coming up with numbers, which wasn't mentioned. :)
So I feel that if the goal is to give a conceptual understanding, linear algebra is overkill. If the goal is to show how it's done in practice then this isn't terribly useful since it doesn't even show the tip of the iceberg!
Fair enough. I'm thinking in particular of a paper I read that used pagerank for word sense disambiguation [1]. It's actually variant of pagerank, so the high level description in terms of probabilities (which I read several of) didn't take me very far. In order to come to grips with the technique I ended up actually implementing it and trying it on some toy datasets. This definitely would have helped me in that case.
[1] http://www.aclweb.org/anthology/E/E09/E09-1005.pdf It's interesting research, but it unfortunately turns out that even the best results in word sense disambiguation aren't good enough for a lot of applications. I wish I had 2 years to do nothing but play around in this field.
Love me some power method :) I was amazed when studying it in numerical analysis, and the professor showed us how this works in the real world (Google).
In essence, pagerank sorts pages by the probability that you'd arrive at them by randomly following links form other pages.
G is a matrix of transition probabilities - each entry is the probability of moving directly from a particular page to another particular page. It is composed of 3 terms:
- the hyperlink matrix: the probabilities obtained by assuming a user randomly selects a link to follow from their current page, with an equal probability for each
- the dangling nodes matrix: to ensure that it is possible to leave every page, this adds an equal probability for moving from a page with no outgoing links to every other page
- the matrix U of all ones: this provides G with 2 desirable properties, by ensuring it is both irreducible/strongly-connected and aperiodic. It does this by making it possible to move from any page to any other page, with some small probability (exactly how small is determined by the damping factor d).
The pagerank vector represents the equilibrium probability distribution - in other words, the probability of being on a particular page after starting on a random page then spending a long time randomly moving between pages according to the probabilities in G.
Now, if at a given time your probability of being on each page is given by some vector X, the probability of being on each page after one random move is the transition matrix multiplied by X.
The equilibrium probability vector thus has the property that it is unchanged by multiplication by the transition matrix - it is therefore an eigenvector of the transition matrix, with an eigenvalue of 1.
Edit: After typing this, I see that explanatory hypertext appears when you mouse over the blue words.
For those who may come across this, I have added some extra stuff to let you watch how the power method converges (or doesn't converge) to the PageRank vector. Just click the "Explore Convergence" button over there where the PageRank vector is written out.
As old as this algorithm may be, it's still relevant/useful in many practical applications. It's crazy to think that these small examples on here scale to gigantic data sets, especially as each PageRank gets updated.
This is pretty cool. I'm playing around w/ different backlink structures. It seems if you link back to a tier 1 link, it will boost your PR. Is this true in practice?
Hey Mike - can you give me some details on the examples you're playing with? If I get around to it, I was going to see about including some canned examples to maybe show/disprove how things like that work (at least with this version particular of the algorithm). The inclusion of the dangling nodes matrix and that extra all-ones one can make it trickier to predict what will happen.
How about an easy explanation, one that a middle- or high-schooler who just learned to solve 3-equations and 3-variables and has seen probability can understand?
Let's say you have three websites: Reddit, Yahoo, and Google. Let's say Reddit links to Yahoo and Google, and Yahoo links to Google, and Google links to Reddit.
If you start on a random page and click randomly, what happens? You'll land on Reddit with probability x, Yahoo with probability y, and Google with probability z. Notice that if you continue clicking randomly, x, y, and z will stop changing after a while.
Okay, now stop clicking. How likely are you to have landed on page Google? (i.e., What's z?)
- You're on Reddit with probability x, and 1/2 of the links there will take you to Google.
- You're on Yahoo with probability y, and 1/1 of the links there will take you to Google.
- You're on Google with probability z, and 0/1 of the links there will take you to Google.
Therefore, z = x(1/2) + y(1/1) + z(0/1)
Similarly, x = x(0/2) + y(0/1) + z (1/1)
Similarly, y = x(1/2) + y(0/1) + z (0/1)
Since x + y + z = 1, we get: x = 2/5, y = 1/5, z = 2/5
What does that mean? It means Reddit and Google are twice as "important" as Yahoo, since you're twice as likely to land on them.
There, we just learned how PageRank works with zero linear algebra business.