In case anyone is interested, here's a clickable chart of all PG's essays showin...

raganwald · on Dec 16, 2010

I get the algorithm, but what tool(s) did you use generate the page itself?

RiderOfGiraffes · on Dec 16, 2010

OK, here's the script that drives the process.

    echo Retrieving index
        curl http://www.paulgraham.com/articles.html > data0

    echo Extracting URLs
        grep -ho "<a href[^>]*>[^<]*<.a>" data0 | grep -v http | grep -v RSS > data1

ComputeRankings takes the list of essays, extracts outgoing links to create a digraph, and then computes the Page-Rank of each page.

CreatePGER takes the results and an HTML template and glues them together to create the rankings page.

    echo Creating rankings page
        ./ComputeRankings.py > ComputeRankings.out
        ./Create_PGER.py > PaulGrahamEssaysRanking

MakeGraph outputs dot files for the giant component and the "other nodes" graph.

    echo Creating DOT file
        ./MakeGraph.py > links0.dot

Then I use neato to layout the graphs. In each case I run it a few hundred times and pick the result that has the largest boxes. That means it's the most compact output, and seems to be a good heuristic.

    echo Create Giant.png
        ./LayoutGraph Giant 11,40

    echo Create Other.png
        ./LayoutGraph Other   8,5

Finally I create the HTML you see using an HTML template.

    echo Create page with map
        ./Create_PGE.py > PaulGrahamEssays

There's a small lie in this. The web site is actually a statically generated "wiki". When it was first devised I didn't have the facility to run scripts on my host. I generate the pages as plain text with some mark-down, then off-line generate the entire site. Then I upload the parts that have changed.

If you try to edit then your suggested new version gets emailed to me, where it goes through an aggressive spam filter. Then it sits in my inbox for me to decide if it's a good change. If so, I trigger a refresh. Some people have passwords that trigger the refresh automatically, without my intervention, so it really does work as a limited access wiki.

Er, right. Is that what you wanted to know?

RiderOfGiraffes · on Dec 16, 2010

Here's a stripped down version of the "LayoutGraph" script. In essence, layout the graph 100 times and score each attempt according to the size of the biggest box. Save the attempt in a file with a name that has the score in it, then pick off the best. That way I can compare attempts with each other while it runs.

    GraphName=$1
    GraphSize=$2

    echo Laying out graph $1

    LayoutParms="-Gstart=random -Nshape=box -Goverlap=false -Gsplines=true"

    for n in 9 8 7 6 5 4 3 2 1 0
    do
        for m in 9 8 7 6 5 4 3 2 1 0
        do
            neato ${GraphName}.dot $LayoutParms -Gsize="${GraphSize}" -Tpng -o ${GraphName}.png -Timap -o ${GraphName}.map
            score=`sed "s/,/ /g" ${GraphName}.map | gawk '{printf("%04d\n",$5-$3)}' | sort -rn | head -1`
            mv ${GraphName}.png ${GraphName}.png_$score
            mv ${GraphName}.map ${GraphName}.map_$score
            echo $n$m $score `ls -l ${GraphName}.png_* | tail -1`
        done
    done

    cp `ls ${GraphName}.png_0??? | tail -1` ${GraphName}.png
    cp `ls ${GraphName}.map_0??? | tail -1` ${GraphName}.map

    ls -r ${GraphName}.png_0??? | tail -n +2 | xargs rm
    ls -r ${GraphName}.map_0??? | tail -n +2 | xargs rm

slug · on Dec 17, 2010

A small tip for the for loops:

  for n in {9..0}; do echo $n ; done

or

  for n in $(seq 9 -1 0); do echo $n ; done

RiderOfGiraffes · on Dec 17, 2010

Cool - thanks. I did actually know that, but I never got around to changing it to the more efficient version, and it's nice to be reminded.

sudont · on Dec 16, 2010

A little context may be required. I'm assuming that a bigger number means that it's higher linked? More important?

RiderOfGiraffes · on Dec 16, 2010

I kind of assumed everyone would know how the underlying page-rank algorithm works.

Put people on pages at random, and have them click on an outgoing link at random. Then, 20% of the time, teleport them to a new, random page. What is their distribution in the long run?

The higher the number, the more likely you are to end up there. This means you have more incoming links, but they are incoming links from pages with higher rankings. Incoming links from lower ranked pages - pages with few incoming links - don't give you much "Google Juice".

I'm curious, do people really not know this? Should I write it up? It's a standard piece of linear algebra to find the eigenvector of the appropriate matrix. I thought pretty much everyone would know it.

michael_nielsen · on Dec 16, 2010

For those who are interested, here's a lengthier explanation I wrote that relates PageRank explicitly to solving the eigenvector equation (Mq = q, in the language of the article):

http://michaelnielsen.org/blog/lectures-on-the-google-techno...

Skip past the opening, and down to "Basic Description of PageRank". The article eventually gets somewhat technical, but hopefully this at least helps explain the basics. Incidentally, I don't use the term "eigenvector" in the article, but when we're analysing an equation like Mq =q, that's an eigenvector equation!

RiderOfGiraffes · on Dec 16, 2010

Excellent - that saves me having to write it! Thanks.

michael_nielsen · on Dec 16, 2010

Unfortunately, in the present context, only the early part of the notes will be really accessible to someone who has only just taken an intro class in linear algebra. The later parts, which deal with issues such as how fast the PageRank algorithm converges, really require people to have more mathematical experience.

RiderOfGiraffes · on Dec 16, 2010

But it's not the job of such a paper to teach them about these things. This sort of paper serves as context and motivation. The material itself can, and should, be learned from other texts designed to teach it. I think what you have is an excellent piece. Read carefully at first, then skim more and more as you get lost. Inspired, go and read about some of the necessary math, then come back and get further next time.

jallmann · on Dec 17, 2010

That is exactly why I took linear algebra in college. I (tried to) read the original Google paper and was like, "eigen-what??" and decided to take the class as an elective.

Linear algebra is stupendously useful material, and I think it really should be standard in any CS curriculum. It wasn't in mine.

sudont · on Dec 16, 2010

I had no clue, and that sounds really interesting; a write-up would be a very nice thing!

Wasn't there an article on here a little while ago about how most jobs seem easy to the worker because the "common knowledge" is fairly trained in?

iterationx · on Dec 16, 2010

THE $25,000,000,000 ∗ EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pd...

kaisdavisOR · on Dec 16, 2010

Write that up!

tkahn6 · on Dec 16, 2010

> It's a standard piece of linear algebra to find the eigenvector of the appropriate matrix.

Yes please!! I just learned about eigenpairs and the teacher refused to explain their significance. I have no idea what they are beyond their purely abstract definition. I would definitely read your post about this.