Interesting timing! I just started using Scrapy today for a project, and I'm trying to figure out how to elegantly piece together information from different sources. I'm glad to see that that problem is the focus of your README example.
How useful are scrapers that don't execute Javascript these days? I find Selenium + PhantomJS (now Chrome Headless I guess) is pretty easy to drive from Python, and it works everywhere because it's a real browser.
Phantom still gets blocked, as it reveals itself in the header.
I've had success with a headless Chrome instance in a virtual display (xvfb) driven with Selenium, backed by Postgres. It's as close you can get to scripting a real browser.
While that's true, user-agent isn't the only thing in the header that reveals PhantomJS[0].
You could take the time to build in spoofs for these issues. But for testing (and scraping), you're going to be better off if your headless browser is the same as your GUI browser.
We scrape a significant amount of highly structured data from a large number of websites (in our case, inventory from ecommerce sites), and have yet to find a site where we needed to use a headless browser. So far we've managed with just lxml. That also includes driving some checkout processes as well (although we don't do this for everyone).
We used to use Selenium + Firefox for the checkout processes (run manually), but it was too much maintenance overhead so we switched to requests+lxml.
We generally find the more "single page app" a website is, the easier it is to scrape, because we can just use the API that's backing the SPA directly, rather than parsing data out of the HTML.
In my experience when the page is rendered with javascript, there is often a json "API", which is even easier to use.
Web browsers are often too slow and load content I am not interested in.
They're still surprisingly useful, however it depends a lot on your use case. In my case I've scraped quite a bit from Wikipedia (not everything is available in a clean API) and other sources this way.
Instead of scraping Wikipedia have you had a look at http://wiki.dbpedia.org/ ? They provide a SPARQL endpoint for querying the knowledge graph on Wikipedia.
Pretty cool project. It looks more enjoyable to use than BeautifulSoup.
How does it approach throttling or rate limiting? I didn't see this mentioned in the readme examples. Would be nice if there were some simple config to kick requests back into a queue to be re-run once limits aren't exhausted.
Minimal support for caching / ETag / etc would be a nice addition.
The throtting can be set directly from untwisted reactor(planning to implement soon once i get untwisted on py3). I think the support for caching is really good too, i plan to implement it this week.
Awesome. It looks like you're reusing your own dependencies which is cool. Can you explain how untwisted relates to twisted a little more? I read the repo readme, but not sure I'm following.
Untwised is meant to solve all problems twisted solves but it does it in quite a different way. They are two different tools that would solve the same problems using different approaches. Untwisted doesnt share code nor architecture with twisted. In untwisted, sockets are abstracted as event machines, they are sort of "super sockets" that can dispatch events. You map handles to Spin instances, these handles are mapped upon events, when these events occurs then your handles get called. The handles can spawn events inside the Spin instances, in this way you can better abstract all kind of internet protocols consequently achieving a better level of modularity and extensibility.
That is one of the reasons that sukhoi's code is sort of short, it is due to the underlying framework in which it was written on.
> It looks more enjoyable to use than BeautifulSoup.
I don't believe BS is a full scraping solution, it's only the HTML parsing/querying isn't it? In that case, this project actually uses lxml for that part - a relatively well known alternative to BS.
I highly recommend lxml, the API isn't perfect, but in my experience it's much more powerful than BS, and significantly faster as well. We run custom scrapers for a large number of websites, and apart from a few where we use JSON feeds, the majority use lxml, it has been very useful.
Yes, BeautifulSoup is focused on parsing not crawling (BS supports the lxml parser out of the box). Scrapy is more of an opinionated scraping framework whereas BS is a parsing library for scraping. I think the choice depends on what exactly you're trying to build and scale. I like both personally, though I'd use BS for simple MVPs and Scrapy if I wanted to crawl thousands of pages.
Try to imagine how to solve the second example of the sukhoi README.md using scrapy, you'll notice you'll end up with some kind of obscure logic to achieve that json structure thats outputed by the second example in sukhoi's README.md.
The way of how you construct your json structures in scrapy it is different, scrapy has a longer learning curve too. It seems sukhoi has got better results in performance too.