Hacker News new | past | comments | ask | show | jobs | submit login

You can't scrape with AJAX because of cross domain security restrictions.

One potential solution to obey robots.txt might be to spawn multiple small EC2 instances with different IP's and have them coordinate with each other to share the crawling without individually running over the limits. (This is also useful for scraping from sites that have rate limits)




robots.txt doesn't enforce itself so there is no IP limitation; this is still a violation and no better than simply lowering the delay on a single scraper.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: