You can't scrape with AJAX because of cross domain security restrictions. One po... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

NathanKP on July 24, 2012 | parent | context | favorite | on: Show HN: Hacker News Comments

You can't scrape with AJAX because of cross domain security restrictions.

One potential solution to obey robots.txt might be to spawn multiple small EC2 instances with different IP's and have them coordinate with each other to share the crawling without individually running over the limits. (This is also useful for scraping from sites that have rate limits)

typpo on July 25, 2012 [–]

robots.txt doesn't enforce itself so there is no IP limitation; this is still a violation and no better than simply lowering the delay on a single scraper.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact