Hello HN!
As part of my learning in data science, I need/want to gather data. One relatively easy way to do that is web scraping.
However I'd like to do that in a respectful way. Here are three things I can think of:
1. Identify my bot with a user agent/info URL, and provide a way to contact me
2. Don't DoS websites with tons of request.
3. Respect the robots.txt
What else would be considered good practice when it comes to web scraping?
Next put yourself in their shoes and realize they don't usually monitor their traffic that much or simply don't care as long as you don't slow down their site. It's usually only certain big sites with heavy bot traffic such as linkedin or sneaker shoe sites which implement bot protections. Most others don't care.
Some websites are created almost as if they want to be scraped. The json api used by frontend is ridiculously clean and accessible. Perhaps they benefit when people see their results and invest in their stock. You never fully know if the site wants to be scraped or not.
The reality of scraping industry related to your question is this
1. scraping companies generally don't use real user agent such as 'my friendly data science bot' but they hide behind a set of fake ones and/or route the traffic through a proxy network. You don't want to get banned so stupidly easily by revealing user agent when you know your competitors don't reveal theirs.
2. This one is obvious. The general rule is to scrape over long time period continuously and add large delays between requests of at least 1 second. If you go below 1 second be careful.
3. robots.txt is controversial and doesn't serve its original purpose. It should be renamed to google_instructions.txt because site owners use it to guide googlebot to navigate their site. It is generally ignored by the industry again because you know your competitors ignore it.
Just remember the rule of 'not to piss off the site owner' and then just go ahead and scrape. Also keep in mind that you are in a free country and we don't discriminate here whether it is of racial or gender reasons or whether you are a biological or mechanical website visitor.
I simply described the reality of data science industry around scraping after several years of being in it. Note that this will probably not be liked by HN audience as they are mostly website devs and site owners.