Hacker News new | past | comments | ask | show | jobs | submit login

It's hilarious to think there exists people who think googlebot does not get special treatment from website operators. Here is an experiment you can do in a jiffy, write a script that crawls any major website and see how many URL fetches it takes before your IP gets blocked.

Googlebot has a range of IP addresses that it publicly announces so websites can whitelist them.




> Googlebot has a range of IP addresses that it publicly announces so websites can whitelist them.

Google says[1] they do not do this:

"Google doesn't post a public list of IP addresses for website owners to allowlist."

[1]https://developers.google.com/search/docs/advanced/crawling/...


From that same page they recommend using a reverse DNS lookup (and then a forward DNS lookup on the returned domain) to validate that it is google bot. So the effect is the same for anyone trying to impersonate googlebot (unless they can attack the DNS resolution of the site they’re scraping I guess).


I don't whitelist googlebot, but I don't block them either because their crawler is fairly slow and unobtrusive. Other crawlers seem determined to download the entire site in 60 seconds, and then download it again, and again, until they get banned.


I have never had that problem running screaming frog on big brand sites apart from one or two times.


I don't scrape a website often, but when I do, I'm using a user agent of a major browser.


Do any of them intersect with Google Cloud IP addresses? If so set up a VPN server on Google Cloud.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: