Hacker News new | past | comments | ask | show | jobs | submit login
PG's essays, ranked "Google-style" (solipsys.co.uk)
85 points by RiderOfGiraffes on Dec 16, 2010 | hide | past | favorite | 27 comments



In case anyone is interested, here's a clickable chart of all PG's essays showing explicit links between them:

http://www.solipsys.co.uk/new/PaulGrahamEssays.html?HN

I was prompted to provide these links by this item:

http://news.ycombinator.com/item?id=2011897

I do regularly re-visit PG's essays, and I use these charts/graphs to see what catches my eye on any particular occasion.


I get the algorithm, but what tool(s) did you use generate the page itself?


OK, here's the script that drives the process.

    echo Retrieving index
        curl http://www.paulgraham.com/articles.html > data0

    echo Extracting URLs
        grep -ho "<a href[^>]*>[^<]*<.a>" data0 | grep -v http | grep -v RSS > data1
ComputeRankings takes the list of essays, extracts outgoing links to create a digraph, and then computes the Page-Rank of each page.

CreatePGER takes the results and an HTML template and glues them together to create the rankings page.

    echo Creating rankings page
        ./ComputeRankings.py > ComputeRankings.out
        ./Create_PGER.py > PaulGrahamEssaysRanking
MakeGraph outputs dot files for the giant component and the "other nodes" graph.

    echo Creating DOT file
        ./MakeGraph.py > links0.dot
Then I use neato to layout the graphs. In each case I run it a few hundred times and pick the result that has the largest boxes. That means it's the most compact output, and seems to be a good heuristic.

    echo Create Giant.png
        ./LayoutGraph Giant 11,40

    echo Create Other.png
        ./LayoutGraph Other   8,5
Finally I create the HTML you see using an HTML template.

    echo Create page with map
        ./Create_PGE.py > PaulGrahamEssays
There's a small lie in this. The web site is actually a statically generated "wiki". When it was first devised I didn't have the facility to run scripts on my host. I generate the pages as plain text with some mark-down, then off-line generate the entire site. Then I upload the parts that have changed.

If you try to edit then your suggested new version gets emailed to me, where it goes through an aggressive spam filter. Then it sits in my inbox for me to decide if it's a good change. If so, I trigger a refresh. Some people have passwords that trigger the refresh automatically, without my intervention, so it really does work as a limited access wiki.

Er, right. Is that what you wanted to know?


Here's a stripped down version of the "LayoutGraph" script. In essence, layout the graph 100 times and score each attempt according to the size of the biggest box. Save the attempt in a file with a name that has the score in it, then pick off the best. That way I can compare attempts with each other while it runs.

    GraphName=$1
    GraphSize=$2

    echo Laying out graph $1

    LayoutParms="-Gstart=random -Nshape=box -Goverlap=false -Gsplines=true"

    for n in 9 8 7 6 5 4 3 2 1 0
    do
        for m in 9 8 7 6 5 4 3 2 1 0
        do
            neato ${GraphName}.dot $LayoutParms -Gsize="${GraphSize}" -Tpng -o ${GraphName}.png -Timap -o ${GraphName}.map
            score=`sed "s/,/ /g" ${GraphName}.map | gawk '{printf("%04d\n",$5-$3)}' | sort -rn | head -1`
            mv ${GraphName}.png ${GraphName}.png_$score
            mv ${GraphName}.map ${GraphName}.map_$score
            echo $n$m $score `ls -l ${GraphName}.png_* | tail -1`
        done
    done

    cp `ls ${GraphName}.png_0??? | tail -1` ${GraphName}.png
    cp `ls ${GraphName}.map_0??? | tail -1` ${GraphName}.map

    ls -r ${GraphName}.png_0??? | tail -n +2 | xargs rm
    ls -r ${GraphName}.map_0??? | tail -n +2 | xargs rm


A small tip for the for loops:

  for n in {9..0}; do echo $n ; done
or

  for n in $(seq 9 -1 0); do echo $n ; done


Cool - thanks. I did actually know that, but I never got around to changing it to the more efficient version, and it's nice to be reminded.


A little context may be required. I'm assuming that a bigger number means that it's higher linked? More important?


I kind of assumed everyone would know how the underlying page-rank algorithm works.

Put people on pages at random, and have them click on an outgoing link at random. Then, 20% of the time, teleport them to a new, random page. What is their distribution in the long run?

The higher the number, the more likely you are to end up there. This means you have more incoming links, but they are incoming links from pages with higher rankings. Incoming links from lower ranked pages - pages with few incoming links - don't give you much "Google Juice".

I'm curious, do people really not know this? Should I write it up? It's a standard piece of linear algebra to find the eigenvector of the appropriate matrix. I thought pretty much everyone would know it.


For those who are interested, here's a lengthier explanation I wrote that relates PageRank explicitly to solving the eigenvector equation (Mq = q, in the language of the article):

http://michaelnielsen.org/blog/lectures-on-the-google-techno...

Skip past the opening, and down to "Basic Description of PageRank". The article eventually gets somewhat technical, but hopefully this at least helps explain the basics. Incidentally, I don't use the term "eigenvector" in the article, but when we're analysing an equation like Mq =q, that's an eigenvector equation!


Excellent - that saves me having to write it! Thanks.


Unfortunately, in the present context, only the early part of the notes will be really accessible to someone who has only just taken an intro class in linear algebra. The later parts, which deal with issues such as how fast the PageRank algorithm converges, really require people to have more mathematical experience.


But it's not the job of such a paper to teach them about these things. This sort of paper serves as context and motivation. The material itself can, and should, be learned from other texts designed to teach it. I think what you have is an excellent piece. Read carefully at first, then skim more and more as you get lost. Inspired, go and read about some of the necessary math, then come back and get further next time.


That is exactly why I took linear algebra in college. I (tried to) read the original Google paper and was like, "eigen-what??" and decided to take the class as an elective.

Linear algebra is stupendously useful material, and I think it really should be standard in any CS curriculum. It wasn't in mine.


I had no clue, and that sounds really interesting; a write-up would be a very nice thing!

Wasn't there an article on here a little while ago about how most jobs seem easy to the worker because the "common knowledge" is fairly trained in?


THE $25,000,000,000 ∗ EIGENVECTOR THE LINEAR ALGEBRA BEHIND GOOGLE http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pd...


Write that up!


> It's a standard piece of linear algebra to find the eigenvector of the appropriate matrix.

Yes please!! I just learned about eigenpairs and the teacher refused to explain their significance. I have no idea what they are beyond their purely abstract definition. I would definitely read your post about this.


I like. You see pg's popular essays all the time and get a little tired of them. Beating the averages, yeah whatever... hackers and painters, uh huh. But the less popular ones are really good, I especially like "The Submarine".

(Maybe because it is my hobby to push back against dress codes. One time, I was visiting my company's Singapore office. "You can't wear tennis shoes here, it's against the dress code!" "Do you know who I am?" "Nope." "Can you fire me?" "Nope." "Then what are you going to do about it?" :)


Aren't you mister social.

Protip: pissing off random people in your company for childish reasons is a piss-poor career move.


Have you ever watched "Curb your enthusiasm"? I aspire to be Larry David.


What does "Google-style" mean? How are these numbers calculated and what do they mean?


He said he used PageRank algorithm

http://en.wikipedia.org/wiki/PageRank

It is what google ranking is said ot be based upon. though they keep their exact algorithm secret afaik.

In a nutshell, you give each page a base score, and have it give away part of its score to the pages it links to. Assuming that the page author wouldn't link to a page with poor/irrelevant content, you get some sort of quality/popularity score for each page.


Very cool.

Suggestion - please use a san-serif font.


Consider:

http://www.solipsys.co.uk/Giant.png

http://www.solipsys.co.uk/Other.png

These will change as I experiment. Comments welcome.


Interestingly there's no font-styling (or any css for that matter, he is using tables for christ's sake) on that webpage so the reason you are seeing Serif is because of your web browser more than anything else.


It's too old school to expect people to have their own personal default style settings for their browsers. There are barely any sites these days that don't define their own styles.

It's actually a neat idea of the original web, but it's of little practical value today.


He might mean within the graphs. I haven't played with asking neato to change the font being used, and I'm not sure what fonts I have available on my very, very elderly setup.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: