Hacker News new | past | comments | ask | show | jobs | submit login

There are two main reasons why I say nobody besides Google is really allowed to crawl the web.

The first is that Google gets much more access to pages on websites than everybody else. You can see this by examining the robots.txt files of various websites[0]. I've been doing this for several years now and Google has a consistent advantage across many thousands websites that I've looked at. This adds up to a significant advatnage and many search engine operators complain about how it hampers their ability to compete with Google[1].

The second is that Google gets to ignore crawl delay directive in robots.txt while other search engines don't[2]. Website operators cannot tell Google how fast they want their website crawled, they can only request that Google slow down. If another search engine tried to do what Google does, they would likely be blocked by many important websites.

If you would like to read more about this, please checkout https://knuckleheads.club/

[0] https://pdf.sciencedirectassets.com/robots.txt

[1] https://www.nytimes.com/2020/12/14/technology/how-google-dom...

[2] https://www.seroundtable.com/google-noindex-in-robots-txt-de...




So, uh, don't respect robots.txt in your search engine? It's not like there's a law that you have to, and that you can't pretend you are Googlebot. The only real obstacle I can imagine is that some firewalls might be configured to be more permissive with traffic originating from Google subnets.


You would be blocked fairly quickly by many website operators and no longer able to access those websites if you straight up ignored robots.txt files. You also might even end up being served cease and desists by some websites and sued if you continue to persist and try to find ways around it.


And what if you do respect it but follow Googlebot rules?


Applebot was able to get away with doing exactly this but I imagine that's because it's Apple and websites knew that Apple was about to send them enough traffic via Apple News to make it worth their while. I don't know if other search engine operators have tried this but I would imagine they would get caught by rate limiters set for non Google IP's and then they would be blocked.


Still, you keep saying all that as if most websites even notice that they're being crawled, and that their operators are very aware exactly when by whom they're crawled. Like as if the admin gets a notification every time a crawler comes by or something, with precise details about it. I don't think it's nearly as serious as you're trying to make it look.


I've been a part of a team that operated a large website and I've been paged before because of the issues that somebody was causing because it was being crawled too much. Many people in the web operations field have had the same experience. Generally speaking, the larger the website, the more sensitive they are about who is crawling and why.


To add another data point for you: I have had one of my websites brought down by Yandex bots before. There are also dozens of no-name bots (often SEO tools like ahrefs, semrush, etc.) that can sometimes cause troubles.

For me it was a problem of having lots of pages, and having a high cost per request (due to the type of website it was).

For other websites, it is not necessarily about the volume of traffic from bots, but the risk of web scrapers getting their proprietary data. They're fine with Google scraping their info because that's where their traffic comes from. They're not okay with some random bot scraping them because it could be taking their content and republishing it, or scraping user profile data, or using it for some nefarious/competitive purpose.


> the risk of web scrapers getting their proprietary data

That's some weird logic, to me at least. That data is literally given away to everyone but some people or organizations can't have it? If you want to control access to it, maybe at least require people to register before they can see it? Is it even proprietary if it's public with no access control whatsoever?

This for-profit internet is just really such a parallel universe to me.


This is a question the courts are working through with LinkedIn and HiQ https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l... as well as the Van Buren case https://themarkup.org/news/2020/12/03/why-web-scraping-is-vi...

It’s a different world where there are no laws or prices or contracts really.


> This for-profit internet is just really such a parallel universe to me.

I know I have been a contrary commentor in this thread, but I hear you with this. What a monster we have built, and what always gets me is how trivial everything is. So much capital is flowing through these ephemeral software systems that, if gone tomorrow, would be ultimately inconsequential to mankind.


I mean it's ridiculous to think about it, but there's this giant, many-billion-dollar online marketing industry that I essentially don't exist for. If it's gone tomorrow, I would indeed not notice, but it'd be the end of the world for some.

> and what always gets me is how trivial everything is

Whenever I read about corporations and how they work, I always inevitably ask myself the question "where the hell does enough work to keep this many people busy even come from". Everything is ridiculously overengineered to meet imaginary deadlines.


> That data is literally given away to everyone but some people or organizations can't have it?

It's often a question of quantity. LinkedIn probably doesn't care about you scraping a few profiles, but if you're harvesting every bit of their publicly-available data, then they get a little scared that you're building something that's going to compete with them.

Same with Instagram, or Facebook, for example. Though in this case it's probably more of a user-privacy issue - at least that's what they say.

It's not really weird logic to me - seems to make sense.

> If you want to control access to it, maybe at least require people to register

Most of the time they can't do this because they need the Google traffic. LinkedIn wants a result in the SERP for Bob Smith when you search for "Bob Smith" because that helps them get signups. Google won't list the page if that content is gated by a sign-in/register page.


There are syndicated blacklists that get fed into automatic traffic filters. Not to mention a surprising amount of the web is fronted by Cloudflare and other CDNs, making that kind of traffic detection and blocking more effective and widespread than you might expect.


Google tells website operators how to verify google bots in a way that can't be spoofed.


It's a situation where the rules seem obvious but the practical realities of it mean Google has the advantage by being the incumbent. No one would dare block Google for a search traffic reliant business, but some upstart search engine will quickly end up on blacklists even with reasonably slow crawling.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: