Show HN: Sukhoi – A flexible and extensible Webcrawler in Python

gear54rus · on July 14, 2017

Name seems to reference a prominent Russian aerospace engineer or maybe that's just wishful thinking. https://en.wikipedia.org/wiki/Pavel_Sukhoi

doubleplusgood · on July 14, 2017

Also it literally means "dry".

dguo · on July 13, 2017

Interesting timing! I just started using Scrapy today for a project, and I'm trying to figure out how to elegantly piece together information from different sources. I'm glad to see that that problem is the focus of your README example.

vosper · on July 13, 2017

How useful are scrapers that don't execute Javascript these days? I find Selenium + PhantomJS (now Chrome Headless I guess) is pretty easy to drive from Python, and it works everywhere because it's a real browser.

dhruvkar · on July 13, 2017

Phantom still gets blocked, as it reveals itself in the header.

I've had success with a headless Chrome instance in a virtual display (xvfb) driven with Selenium, backed by Postgres. It's as close you can get to scripting a real browser.

rectangletangle · on July 14, 2017

You can set the user-agent with Phantom

    var webPage = require('webpage');
    var page = webPage.create();
    page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';

http://phantomjs.org/api/webpage/property/settings.html

dhruvkar · on July 14, 2017

While that's true, user-agent isn't the only thing in the header that reveals PhantomJS[0].

You could take the time to build in spoofs for these issues. But for testing (and scraping), you're going to be better off if your headless browser is the same as your GUI browser.

0: https://blog.shapesecurity.com/2015/01/22/detecting-phantomj...

eriknstr · on July 14, 2017

I think I read recently that Chrome / Chromium is now able to run without having to use xvfb, so now truly headless.

danpalmer · on July 14, 2017

We scrape a significant amount of highly structured data from a large number of websites (in our case, inventory from ecommerce sites), and have yet to find a site where we needed to use a headless browser. So far we've managed with just lxml. That also includes driving some checkout processes as well (although we don't do this for everyone).

We used to use Selenium + Firefox for the checkout processes (run manually), but it was too much maintenance overhead so we switched to requests+lxml.

We generally find the more "single page app" a website is, the easier it is to scrape, because we can just use the API that's backing the SPA directly, rather than parsing data out of the HTML.

Mic92 · on July 14, 2017

In my experience when the page is rendered with javascript, there is often a json "API", which is even easier to use. Web browsers are often too slow and load content I am not interested in.

penetrarthur · on July 14, 2017

I've done some scraping recently and 99% of sites don't have the json api

iogf · on July 13, 2017

These ones are slower in most cases, it seems for some situations the ones that dont execute js would better do the job.

rectangletangle · on July 14, 2017

They're still surprisingly useful, however it depends a lot on your use case. In my case I've scraped quite a bit from Wikipedia (not everything is available in a clean API) and other sources this way.

danpalmer · on July 14, 2017

Instead of scraping Wikipedia have you had a look at http://wiki.dbpedia.org/ ? They provide a SPARQL endpoint for querying the knowledge graph on Wikipedia.

ConfucianNardin · on July 14, 2017

Depending on your use case you can also download full database dumps (https://dumps.wikimedia.org/).

tedmiston · on July 14, 2017

Pretty cool project. It looks more enjoyable to use than BeautifulSoup.

How does it approach throttling or rate limiting? I didn't see this mentioned in the readme examples. Would be nice if there were some simple config to kick requests back into a queue to be re-run once limits aren't exhausted.

Minimal support for caching / ETag / etc would be a nice addition.

iogf · on July 14, 2017

The throtting can be set directly from untwisted reactor(planning to implement soon once i get untwisted on py3). I think the support for caching is really good too, i plan to implement it this week.

tedmiston · on July 14, 2017

Awesome. It looks like you're reusing your own dependencies which is cool. Can you explain how untwisted relates to twisted a little more? I read the repo readme, but not sure I'm following.

iogf · on July 14, 2017

Untwised is meant to solve all problems twisted solves but it does it in quite a different way. They are two different tools that would solve the same problems using different approaches. Untwisted doesnt share code nor architecture with twisted. In untwisted, sockets are abstracted as event machines, they are sort of "super sockets" that can dispatch events. You map handles to Spin instances, these handles are mapped upon events, when these events occurs then your handles get called. The handles can spawn events inside the Spin instances, in this way you can better abstract all kind of internet protocols consequently achieving a better level of modularity and extensibility. That is one of the reasons that sukhoi's code is sort of short, it is due to the underlying framework in which it was written on.

sandGorgon · on July 14, 2017

I'm not able to figure out dependencies.. is this pure python ? Or are you using one of gevent, libev, uvloop, etc.

Since it is py2, i suppose asyncio is out of the picture

iogf · on July 14, 2017

untwisted is pure python. it uses either select/epoll for scaling sockets.

sandGorgon · on July 14, 2017

This is very interesting. did you consider using libev/uvloop - which are generally consider battle tested async frameworks ?

Is there anything missing that prompted you to reimplement ?

iogf · on July 14, 2017

It seems a good thing to do, indeed. i'll consider that.

danpalmer · on July 14, 2017

> It looks more enjoyable to use than BeautifulSoup.

I don't believe BS is a full scraping solution, it's only the HTML parsing/querying isn't it? In that case, this project actually uses lxml for that part - a relatively well known alternative to BS.

I highly recommend lxml, the API isn't perfect, but in my experience it's much more powerful than BS, and significantly faster as well. We run custom scrapers for a large number of websites, and apart from a few where we use JSON feeds, the majority use lxml, it has been very useful.

tedmiston · on July 17, 2017

Yes, BeautifulSoup is focused on parsing not crawling (BS supports the lxml parser out of the box). Scrapy is more of an opinionated scraping framework whereas BS is a parsing library for scraping. I think the choice depends on what exactly you're trying to build and scale. I like both personally, though I'd use BS for simple MVPs and Scrapy if I wanted to crawl thousands of pages.

ldng · on July 14, 2017

Depending on your needs, sometimes it might be more interesting starting for there :

https://about.commonsearch.org/

and then scrap whatever is missing or not fresh enough. The scrapping process can be quite intense on servers.

pryelluw · on July 14, 2017

Is this Python 3 compatible? Searched but the wiki is empty and the readme has examples in Python 2.

iogf · on July 14, 2017

It is py2 now, however, i'm gonna port it to py3 soon. I'm planning to write some better docs for it tomorrow.

pryelluw · on July 14, 2017

That's great. Ill check it out. Thanks!

monksy · on July 13, 2017

How does this differ from Scrapy?

iogf · on July 13, 2017

Try to imagine how to solve the second example of the sukhoi README.md using scrapy, you'll notice you'll end up with some kind of obscure logic to achieve that json structure thats outputed by the second example in sukhoi's README.md.

bbernoulli · on July 14, 2017

FWIW I don't believe this would be overly convoluted in scrapy. I'd probably scrape the tags and quotes in one pass...

Also, generator expressions would make the examples more readable IMO.

  self.extend((tag, QuoteMiner(self.geturl(href))) for tag, href in self.acc)

iogf · on July 14, 2017

I would like to see that in scrapy. I think you may have a point about the generators, yea.

iogf · on July 13, 2017

The way of how you construct your json structures in scrapy it is different, scrapy has a longer learning curve too. It seems sukhoi has got better results in performance too.