Hacker News new | past | comments | ask | show | jobs | submit login

I agree, this is definitely a solved problem!

If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.

Scrapy has been tried and tested over 6-7 years of community development, as well as being the base infrastructure for a number of >$1B businesses. Not only this, but there is a suite of tools which you have been built around it – Portia for one, but also other lots of useful open source libraries: http://scrapinghub.com/opensource/).

Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.

There's more and more ways of skipping this step and getting at data automatically: https://github.com/redapple/parslepy/wiki/Use-parslepy-with-... https://speakerdeck.com/amontalenti/web-crawling-and-metadat... https://github.com/scrapy/loginform https://github.com/TeamHG-Memex/Formasaurus https://github.com/scrapy/scrapely https://github.com/scrapinghub/webpager https://moz.com/devblog/benchmarking-python-content-extracti...

Scrapy (and also lots of python tools, likely a majority of them created by people using it and BeautifulSoup) have lowered the cost of building web data harvesting systems to the point where one guy can build crawlers for an entire industry in a couple of months.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: