PG's essays, ranked "Google-style"

RiderOfGiraffes · on Dec 16, 2010

In case anyone is interested, here's a clickable chart of all PG's essays showing explicit links between them:

http://www.solipsys.co.uk/new/PaulGrahamEssays.html?HN

I was prompted to provide these links by this item:

http://news.ycombinator.com/item?id=2011897

I do regularly re-visit PG's essays, and I use these charts/graphs to see what catches my eye on any particular occasion.

raganwald · on Dec 16, 2010

I get the algorithm, but what tool(s) did you use generate the page itself?

RiderOfGiraffes · on Dec 16, 2010

OK, here's the script that drives the process.

    echo Retrieving index
        curl http://www.paulgraham.com/articles.html > data0

    echo Extracting URLs
        grep -ho "<a href[^>]*>[^<]*<.a>" data0 | grep -v http | grep -v RSS > data1

ComputeRankings takes the list of essays, extracts outgoing links to create a digraph, and then computes the Page-Rank of each page.

CreatePGER takes the results and an HTML template and glues them together to create the rankings page.

    echo Creating rankings page
        ./ComputeRankings.py > ComputeRankings.out
        ./Create_PGER.py > PaulGrahamEssaysRanking

MakeGraph outputs dot files for the giant component and the "other nodes" graph.

    echo Creating DOT file
        ./MakeGraph.py > links0.dot

Then I use neato to layout the graphs. In each case I run it a few hundred times and pick the result that has the largest boxes. That means it's the most compact output, and seems to be a good heuristic.

    echo Create Giant.png
        ./LayoutGraph Giant 11,40

    echo Create Other.png
        ./LayoutGraph Other   8,5

Finally I create the HTML you see using an HTML template.

    echo Create page with map
        ./Create_PGE.py > PaulGrahamEssays

There's a small lie in this. The web site is actually a statically generated "wiki". When it was first devised I didn't have the facility to run scripts on my host. I generate the pages as plain text with some mark-down, then off-line generate the entire site. Then I upload the parts that have changed.

If you try to edit then your suggested new version gets emailed to me, where it goes through an aggressive spam filter. Then it sits in my inbox for me to decide if it's a good change. If so, I trigger a refresh. Some people have passwords that trigger the refresh automatically, without my intervention, so it really does work as a limited access wiki.

Er, right. Is that what you wanted to know?

RiderOfGiraffes · on Dec 16, 2010

Here's a stripped down version of the "LayoutGraph" script. In essence, layout the graph 100 times and score each attempt according to the size of the biggest box. Save the attempt in a file with a name that has the score in it, then pick off the best. That way I can compare attempts with each other while it runs.

    GraphName=$1
    GraphSize=$2

    echo Laying out graph $1

    LayoutParms="-Gstart=random -Nshape=box -Goverlap=false -Gsplines=true"

    for n in 9 8 7 6 5 4 3 2 1 0
    do
        for m in 9 8 7 6 5 4 3 2 1 0
        do
            neato ${GraphName}.dot $LayoutParms -Gsize="${GraphSize}" -Tpng -o ${GraphName}.png -Timap -o ${GraphName}.map
            score=`sed "s/,/ /g" ${GraphName}.map | gawk '{printf("%04d\n",$5-$3)}' | sort -rn | head -1`
            mv ${GraphName}.png ${GraphName}.png_$score
            mv ${GraphName}.map ${GraphName}.map_$score
            echo $n$m $score `ls -l ${GraphName}.png_* | tail -1`
        done
    done

    cp `ls ${GraphName}.png_0??? | tail -1` ${GraphName}.png
    cp `ls ${GraphName}.map_0??? | tail -1` ${GraphName}.map

    ls -r ${GraphName}.png_0??? | tail -n +2 | xargs rm
    ls -r ${GraphName}.map_0??? | tail -n +2 | xargs rm

slug · on Dec 17, 2010

A small tip for the for loops:

  for n in {9..0}; do echo $n ; done

or

  for n in $(seq 9 -1 0); do echo $n ; done

RiderOfGiraffes · on Dec 17, 2010

Cool - thanks. I did actually know that, but I never got around to changing it to the more efficient version, and it's nice to be reminded.

sudont · on Dec 16, 2010

A little context may be required. I'm assuming that a bigger number means that it's higher linked? More important?

RiderOfGiraffes · on Dec 16, 2010

I kind of assumed everyone would know how the underlying page-rank algorithm works.

Put people on pages at random, and have them click on an outgoing link at random. Then, 20% of the time, teleport them to a new, random page. What is their distribution in the long run?

The higher the number, the more likely you are to end up there. This means you have more incoming links, but they are incoming links from pages with higher rankings. Incoming links from lower ranked pages - pages with few incoming links - don't give you much "Google Juice".

I'm curious, do people really not know this? Should I write it up? It's a standard piece of linear algebra to find the eigenvector of the appropriate matrix. I thought pretty much everyone would know it.

michael_nielsen · on Dec 16, 2010

For those who are interested, here's a lengthier explanation I wrote that relates PageRank explicitly to solving the eigenvector equation (Mq = q, in the language of the article):

http://michaelnielsen.org/blog/lectures-on-the-google-techno...

Skip past the opening, and down to "Basic Description of PageRank". The article eventually gets somewhat technical, but hopefully this at least helps explain the basics. Incidentally, I don't use the term "eigenvector" in the article, but when we're analysing an equation like Mq =q, that's an eigenvector equation!

RiderOfGiraffes · on Dec 16, 2010

Excellent - that saves me having to write it! Thanks.

michael_nielsen · on Dec 16, 2010

Unfortunately, in the present context, only the early part of the notes will be really accessible to someone who has only just taken an intro class in linear algebra. The later parts, which deal with issues such as how fast the PageRank algorithm converges, really require people to have more mathematical experience.

RiderOfGiraffes · on Dec 16, 2010

But it's not the job of such a paper to teach them about these things. This sort of paper serves as context and motivation. The material itself can, and should, be learned from other texts designed to teach it. I think what you have is an excellent piece. Read carefully at first, then skim more and more as you get lost. Inspired, go and read about some of the necessary math, then come back and get further next time.

jallmann · on Dec 17, 2010

That is exactly why I took linear algebra in college. I (tried to) read the original Google paper and was like, "eigen-what??" and decided to take the class as an elective.

Linear algebra is stupendously useful material, and I think it really should be standard in any CS curriculum. It wasn't in mine.

sudont · on Dec 16, 2010

I had no clue, and that sounds really interesting; a write-up would be a very nice thing!

Wasn't there an article on here a little while ago about how most jobs seem easy to the worker because the "common knowledge" is fairly trained in?

iterationx · on Dec 16, 2010

THE $25,000,000,000 ∗ EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pd...

kaisdavisOR · on Dec 16, 2010

Write that up!

tkahn6 · on Dec 16, 2010

> It's a standard piece of linear algebra to find the eigenvector of the appropriate matrix.

Yes please!! I just learned about eigenpairs and the teacher refused to explain their significance. I have no idea what they are beyond their purely abstract definition. I would definitely read your post about this.

jrockway · on Dec 16, 2010

I like. You see pg's popular essays all the time and get a little tired of them. Beating the averages, yeah whatever... hackers and painters, uh huh. But the less popular ones are really good, I especially like "The Submarine".

(Maybe because it is my hobby to push back against dress codes. One time, I was visiting my company's Singapore office. "You can't wear tennis shoes here, it's against the dress code!" "Do you know who I am?" "Nope." "Can you fire me?" "Nope." "Then what are you going to do about it?" :)

burgerbrain · on Dec 17, 2010

Aren't you mister social.

Protip: pissing off random people in your company for childish reasons is a piss-poor career move.

jrockway · on Dec 17, 2010

Have you ever watched "Curb your enthusiasm"? I aspire to be Larry David.

pclark · on Dec 16, 2010

What does "Google-style" mean? How are these numbers calculated and what do they mean?

nickolai · on Dec 16, 2010

He said he used PageRank algorithm

http://en.wikipedia.org/wiki/PageRank

It is what google ranking is said ot be based upon. though they keep their exact algorithm secret afaik.

In a nutshell, you give each page a base score, and have it give away part of its score to the pages it links to. Assuming that the page author wouldn't link to a page with poor/irrelevant content, you get some sort of quality/popularity score for each page.

evo_9 · on Dec 16, 2010

Very cool.

Suggestion - please use a san-serif font.

RiderOfGiraffes · on Dec 16, 2010

Consider:

http://www.solipsys.co.uk/Giant.png

http://www.solipsys.co.uk/Other.png

These will change as I experiment. Comments welcome.

sushi · on Dec 16, 2010

Interestingly there's no font-styling (or any css for that matter, he is using tables for christ's sake) on that webpage so the reason you are seeing Serif is because of your web browser more than anything else.

wahnfrieden · on Dec 16, 2010

It's too old school to expect people to have their own personal default style settings for their browsers. There are barely any sites these days that don't define their own styles.

It's actually a neat idea of the original web, but it's of little practical value today.

RiderOfGiraffes · on Dec 16, 2010

He might mean within the graphs. I haven't played with asking neato to change the font being used, and I'm not sure what fonts I have available on my very, very elderly setup.