Common CMS are fairly good at caching and can handle a high load, but quite often someone deems a badly programmed extension "mission critical". In that case one of your requests might trigger dozens of database calls. If multiple sites share a database backend, an accidental DOS might bring down a whole organization.
If the bot has a distinct IP (or distinct user agent), then a good setup can handle this situation automatically. If the crawler switches IPs to circumvent a rate limit or for other reasons, then it often causes trouble in the form of tickets and phone calls to the webmasters. Few care about some gigabytes of traffic, but they do care about overtime.
Some react by blocking whole IP ranges. I have seen sites that blocked every request from the network of Deutsche Telekom (Tier 1 / former state monopoly in Germany) for weeks. So you might affect many on your network.
So:
* Most of the time it does not matter if you scrape all information you need in minutes or overnight. For crawl jobs I try to avoid the time of day I assume high traffic to the site. So I would not crawl restaurant sites at lunch time, but 2 a.m. local time should be fine. If the response time goes up suddenly at this time, this can be due to a backup job. Simply wait a bit.
* The software you choose has an impact: If you use Selenium or headless Chrome, you load images and scripts. If you do not need those, analyzing the source (with for example beautiful soup) draws less of the server's resources and might be much faster.
* Keep track of your requests. A specific file might be linked from a dozen pages of the site you crawl. Download it just once. This can be tricky if a site uses A/B testing for headlines and changes the URL.
* If you provide contact information read your emails. This sounds silly, but at my previous work we had problems with a friendly crawler with known owners. It tried to crawl our sites once a quarter and was blocked each time, because they did not react to our friendly requests to change their crawling rate.
Side note: I happen to work on a python library for a polite crawler. It is about a week away from stable (one important bug fix and a database schema change for a new feature). In case it is helpful: https://github.com/RuedigerVoigt/exoskeleton
If you use Selenium & Chrome WebDriver you can disable loading images by : AddUserProfilePreference("profile.default_content_setting_values.images", 2)
If the bot has a distinct IP (or distinct user agent), then a good setup can handle this situation automatically. If the crawler switches IPs to circumvent a rate limit or for other reasons, then it often causes trouble in the form of tickets and phone calls to the webmasters. Few care about some gigabytes of traffic, but they do care about overtime.
Some react by blocking whole IP ranges. I have seen sites that blocked every request from the network of Deutsche Telekom (Tier 1 / former state monopoly in Germany) for weeks. So you might affect many on your network.
So:
* Most of the time it does not matter if you scrape all information you need in minutes or overnight. For crawl jobs I try to avoid the time of day I assume high traffic to the site. So I would not crawl restaurant sites at lunch time, but 2 a.m. local time should be fine. If the response time goes up suddenly at this time, this can be due to a backup job. Simply wait a bit.
* The software you choose has an impact: If you use Selenium or headless Chrome, you load images and scripts. If you do not need those, analyzing the source (with for example beautiful soup) draws less of the server's resources and might be much faster.
* Keep track of your requests. A specific file might be linked from a dozen pages of the site you crawl. Download it just once. This can be tricky if a site uses A/B testing for headlines and changes the URL.
* If you provide contact information read your emails. This sounds silly, but at my previous work we had problems with a friendly crawler with known owners. It tried to crawl our sites once a quarter and was blocked each time, because they did not react to our friendly requests to change their crawling rate.
Side note: I happen to work on a python library for a polite crawler. It is about a week away from stable (one important bug fix and a database schema change for a new feature). In case it is helpful: https://github.com/RuedigerVoigt/exoskeleton