Hacker News new | past | comments | ask | show | jobs | submit login

well.. extracting links etc is super fast with lxml's xpath. It is written in C, and I don't think it would be faster if you write your own parser.

For example, to extract links from hacker news homepage, you would just do

    xpath('//tr/td[@class="title"]/a/@href')
This will be really fast. You can do it even faster with a more specific xpath. I extracted about 10k links a second from documents this way and was still network bound. Usually you are primarily limited by websites throttling you.



I was using beautifulsoup with lxml backend I believe. I should have mentioned earlier. There were some other graph manipulation stuff too, like favoring links with more inlinks, keeping web crawler polite but still busy by looking at other domains. This is more expensive that extracting links I guess. I had a submission deadline, but whatever I tried in that time with Python didn't work. It was just easier to write faster code in Go (except maybe where regex are involved, now I remember I used some Go markup parser instead that is now in their library).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: