Hacker News new | past | comments | ask | show | jobs | submit login
Only Google is really allowed to crawl the web (knuckleheads.club)
957 points by skinkestek on March 26, 2021 | hide | past | favorite | 346 comments



The bigger problem, to me, is not around crawling. It's the asymmetrical power Google has after crawling.

Google is obviously on a mission to keep people on Google owned properties. So, they take what they crawl and find a way to present that to the end user without anyone needing to visit the place that data came from.

Airlines are a good example. If you search for flight status for a particular flight, Google presents that flight status in a box. As an end user, that's great. However, that sort of search used to (most times) lead to a visit to the airline web site.

The airline web site could then present things Google can't do. Like "hey, we see you haven't checked in yet" or "TSA wait times are longer than usual" or "We have a more-legroom seat upgrade if you want it".

Google took those eyeballs away. Okay, fine, that's their choice. But they don't give anything back, which removes incentives from the actual source to do things better.

You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

In short, I don't think the crawler is the problem. And I don't think Google will realize what the problem is until they start accidentally killing off large swaths of the actual sources of this content by taking the audience away.


They are not just taking away internet traffic, but in the flights example, they actually acquired an aggregate flight/travel company and so they are actually entering markets and competing with their own ad customers.

Then it comes fully circle to Google unfairly using their market position vis-a-vis data, search and advertising. It’s a win-win Google lets the data dictate which markets to enter and on one hand they can jack up advertising fees on customers/competitors and unfairly build their own service into search above both ads and organic results.


Be careful when using Google Flight, last time I checked they use significantly less margins between flights so it’s shorter trips but much riskier.


You can screwed any time you book a connecting flight on two different airlines even if the times aren't tight. For instance if one is cancelled.

If you use the same airline they will make sure you get to the destination.


That's true, but it can save you a ton of money. You just have to be aware of the risks and plan accordingly.

I have typically used this strategy when flying back to the US from the EU. Take an EZJet or similar low cost airline from random small EU city to a larger EU city like Paris, London, Frankfurt, etc... and book the return trip to the US from the larger city. I've also been forced to do this from some EU cities since there was no connecting partner with a US airline.


The difference is mind-boggling in some cases. On one trip in 2019 I had the following coach fair choices for SFO - Moscow return trip tickets booked 3 weeks prior to departure.

* UA or Lufthansa round trip (single carrier) $3K

* UA round trip SFO - Paris + Aeroflot round trip Paris - Moscow: $1K

No amount of search could reduce the gap. I went with the second option. The gap is even bigger if you have a route with multiple segments.


https://www.airtreks.com/ will do this for you with a person. phenomenal service.

To anyone from airtreks, I love you so much!


Yeah this strategy is good, but you need to allow a long layover like 6 hours if you have to go through immigration and change airports for the connection which happens pretty often with ryanair and ezjet. It’s a big pain, but it does save money.


If you're booking each leg with different carrier, I find it best to pay the little extra with kiwi.com and they give you guarantee for the connection. I missed connection twice and they always got me on the next flight to the destination for free.


in my ideal world the software ITA wrote for airlines and is now owned by Google would be in the hands of consumers and the airlines could have adapted to shifts in demand probably without the need for abrupt cessation of services and human fatigue on industry employees caused when route optimisation analysis tempts executives with what I suspect are ultimately fictitious net present savings.


> even if the times aren't tight

Depending on the definition of "tight" each of us have. I remember having 40mins in Munich, and that is a BIG airport. Especially if you disembark on one side of the terminal and your flight is on the far/opposite end. That's 25-30mins brisk walking. With 5000 people in-between you could as well miss your flight. No discussion about stopping to get a coffee or a snack.. you'll miss your flight.


It doesn't really matter if it's on the same airline, it just has to be in the same reservation. Usually, that is the same thing; however on international and hyper-local (the kind that end with a Cessna) flights you'd often have several airlines with codeshares, and you could buy two separate tickets on the same airline if you wanted.


Another approach that's interesting is "buy long, fly short". Sometimes buying A->B->C and getting off at B is cheaper than just buying A->B. But, airlines can cancel the the A->B->C flight and replace it with an A->B and A->C, and place you on the A->C flight.


Can you elaborate on this? Do you mean shorter layovers?


It sounds like it - and third-party companies will often show you flights that involve different companies on the different legs - which can leave you in a pickle because technically each airline's job is to get you to the end of THIER flight, not the entire journey.


And sometimes with a change of airport!


I remember when in Germany some budget airlines used to say they'd fly to "Frankfurt" (FRA) but actually flew to "Frankfurt-Hahn" (HHN) - 115km away. After arrival in HHN they put you on a bus to FRA that took about 2 hours.


Oh don’t worry, you have 15 on-paper minutes to go from A1 to A70 in Detroit... in January... and the shuttle is down.


Even before it gets to that point, they routinely display snippets off regular websites and show ads next to it.

Keeping users from clicking through to organic results helps them generate more revenue.


Here’a a thought: most companies don’t actually want to serve a ton of extra pages. For example, airlines just want to fly passengers. They don’t care who puts those butts in seats and they would fully acknowledge that they aren’t able to deliver a better flight search than Google can. I mean, sure, some small team of web developers at every airline is pissed, but the CEO needs butts in seats to keep the pilot and service unions off her back. Her own web dev team is the least of her problems.


"For example, airlines just want to fly passengers. They don’t care who puts those butts in seats"

Not sure who you talked to, but I've never heard that before. They all want to sell more direct and forego GDS fees and/or other types of fees and commissions. I'd love to see a quote from an airline VP or above that they don't care about their distribution model, boosting direct sales percentages, etc.


Sure, but if you aren’t controlling the experience of getting butts in seats, it’s harder to upsell and make even _more_ money from those butts.


the OP wasn't talking about google competing with airlines, but with other flight aggregating and search/booking services, by abusing their monopoly on web search.


> Google lets the data dictate which markets to enter and on one hand they can jack up advertising fees on customers/competitors and unfairly build their own service into search above both ads and organic results.

Just like Amazon with Amazon Basics.


isn't the solution to disaffective aqui - hiring the restoration of the ability to IPO companies like ITA Software so there's another way to reach financial security for talented programmer - entrepreneurs? for that matter, do the conditions for recreating the frequency of lower level programmer millionaires (hardly a family home debt free today) like Microsoft created, require the recreation also of senior executive abuse of option schemes? It seems to me that making it a reasonable chance of becoming at least financially secure for not irrational amounts of dedication and 90 weeks, and underwriting that with the greater robustness of larger companies and hence livable salaries, instead of trying to sustain the apparent startup free for all figuring to a common heat death of the advertising budgetary universe?


Aren't there anti trust laws to prevent this kind of thing?


Antiturst laws are hard to enforce in the United States.

Monopolies themselves aren't illegal. To be convicted of an antitrust violation, a firm needs to both have a monopoly and needs to be using anticompetitive means to maintain that monopoly. The recent "textbook" example was of Microsoft, which in the 90s used its dominant position to charge computer manufacturers for a Windows license for each computer sold, regardless of whether it had Windows installed or was a "bare" PC.

Depending on how you define the market, Google may not even have a monopoly. It's probably dominant enough in web search to count, but if you look at its advertising network it competes with Facebook and other ad networks. In the realm of travel planning (to pick an example from these comments), it's barely a blip.

Furthermore, Google can potentially argue it's not being anticompetitive: all businesses use their existing data to optimize new products, so Google could claim that it not doing so would be an artificial straightjacket.


It's not that hard, we're just out of practice due to the absurd Borkist economic theories we've been operating under for 40+ years. The laws are all there if the head of the DOJ antitrust division has the gumption to go reverse some bad precedents.

> In the realm of travel planning (to pick an example from these comments), it's barely a blip.

They used their monopoly in web search to gain non-negligible marketshare in entirely unrelated industry. That's text book anti-competitive behavior.

Google can argue whatever they want, but the argument that they're enabling other businesses is a bad one. It casts Google as a private regulator of the economy, which is exactly what antitrust laws are intended to deal with.


Is web search even a "market" independent of ads?


yes


Where's the money?


That's like arguing that newspapers are not a market, because it makes money from ads.


No, I was asking if web search was a market independent of ads.

BTW, newspapers also make money from subscriptions and sales of copies, so your analogy is doubly wrong.


In collecting and selling your data to 3rd parties.


Just tell people to stop using google. Go direct.


Upvoted - regardless how pointless some people might think this comment is, it really is the ONLY way that Google is going to drop out of its aggregate lead position.

Enough people realizing Google is trapping and cannibalizing traffic to the other sites it feeds off of, and choosing to do other things EXCEPT touching Google properties, is THE ONLY way they'll be unseated.

No clear legal path to stop a bully means it's an ethical / habit path.

Not saying there's any easy way, just that this is it.


I find those little snippets actually mostly worthless, maybe because I’ve seen enough of them taken out of context or basically using a snippet from someone who figured out SEO properly, meanwhile the correct information may be down a couple links or not there at all.


Laws are one thing and enforceability another.

In fact Google had to make certain concessions in order for the Google Flights acquisition to get regulatory approval.

IIRC a Chinese firewall between Google data and Google Flights...but like many regulations they were likely written by Google lobbyists aka the industry experts. Because at the end of the day Google flights: 1. Still has the built in widget above organic results and 2. They still bid on their own ad spots jacking up costs on competition which is ultimately passed on to consumers.


Anti-trust in the US tend to not hit the big tech players as much they do other sectors. Also there is actually a debate in the judicial system about the extent of Anti trust laws themselves.


Chicago school basically published a bunch of position papers that made feudal corporations a legal entity that "aren't monopolies" because the give things away for free. Because the consumer isn't paying, it can't be bad.


The current anti-trust doctrine in the US has a goal of protecting consumers - not competition. What Google is doing is arguably great for consumers but awful to their competitors/other organizations. Technically, companies can simply block Google using robots.txt - but in reality that will lose them more money than the current partial disintermediation by Google is costing them - and Google knows this.

It's a tall order to convince the courts that Google's actions consumers, or is illegal: after all, being innovative in ways that may end up hurting the competition is a key feature of a capitalist society - proving that a line has been crossed is really hard, by design.


consumers are in this case the advertisers.

google has a monopoly on search ads and does enforce it, being a drain on the economy since in many fields you only succeed if you spend on search ads


> consumers are in this case the advertisers.

If someone could convince the courts that this is correct, then I'm sure Google would lose. However, I bet dollars to donuts Google's counter-arguement would be that the people doing the searching and quickly finding information are also consumers, and they outnumber advertisers and may be harmed by any proposed remediation in favor of advertisers.


googles answer to this at yesterdays hearing..

Search isnt a single category. If you break it down, they arent a monopoly. For example. 1/2 of PRODUCT SEARCHES begin on Amazon. It's probably hard to argue Google as a monopoly if who they see as their main competitor has half the market share.


The US is so behind in identifying markets in technology which is what is leading to this dominance by a few companies and their resulting monopoly like power. We had already figured out that you can be dominant in only a subset of a market. For example Disney was forced to sell off fox sports channels when it purchased fox because it already owned espn and would have dominated sports TV. That’s the thing it wasn’t even just TV but a subset, sports TV. That identification is where we are behind. As of now no, one in the FTC knows what makes a market or why say YouTube and Facebook may both show large amounts of video content but are absolutely not competitors in video content space. It is because the functions are completely different. YouTube is barely a social network at all despite having users and comments and pages. Facebook is hardly a video platform at all because it isn’t profitable for users to focus on Facebook videos and make ad money.


Anecdotally, its true for books too. Amazon is a great way to figure out which books are the best reviewed before deciding to get one (whether paper, kindle or 'other means').


That's so disingenuous there should be a new term for it.


gSplain or gWash


Which is shortsighted. If competition did not benefit consumers, there would be no need for it anyway.


Anti-trust above all recognizes competition benefits consumers. And so “unfairly competing” is prohibited, because it is bad for the market, thus bad for consumers.


Yes, but they lack enforcement.


That depends, would Google let us know?


not if they could avoid it


Wikipedia isn't monetized. Doesn't it benefit them if Google is serving their content for free and people are finding the information they want without having to hit Wikipedia??

And also, isn't Google the largest sponsor for Wikipedia already? In 2019 - Google donated $2M [1]. In 2010, Google also donated $2m [2].

[1] https://techcrunch.com/2019/01/22/google-org-donates-2-milli...

[2] https://en.wikipedia.org/wiki/Wikimedia_Foundation


> Wikipedia isn't monetized.

No, but they often ask for donations when you visit the site, which people won't see if they just see the in-line blurb from Wikipedia on the Google results page.

> In 2019 - Google donated $2M [1]. In 2010, Google also donated $2m [2].

$2M is a pittance compared to what I expect Google believes is the value of their Wikipedia blurbs. If Wikipedia could charge for use of this data (which another commenter claims they are working on doing), they could easily make orders of magnitude more money from Google.

Of course, my expectation is that Google would rather drop the Wikipedia blurbs entirely, or source the data elsewhere, than pay significantly more.


Unlikely that Wikipedia will be able to charge for content, seeing as all of their content is CC-BY-SA licensed. https://en.wikipedia.org/wiki/Wikipedia:Licensing_update

They may be able to charge for bandwidth (if you want to use a Wikipedia image, you can use Wikipedia's enterprise CDN instead of their own), but their licensing allows me to rehost content as long as I follow the attribution & sublicensing terms.

Google has no problem operating their own CDNs, so I find it unlikely that Wikipedia will be able to monetize Google search results in such a manner as you described.

Disclaimer: I work for Google; opinions are my own.


Fine, the content is free. But if your crawlers want access to the content, then pay! Simple as that.


Will it be a flat fee, so that I, a lowly one-man crawler developer will not be able to afford it? Will it be that only Google can afford it, thus making their monopoly position even stronger?

Is there a Wikipedia crawling "welfare" program if I'm not a trillion dollar mega company?


Actually, yes! The current API is free. The new Enterprise API is paid.


Sure! Apply to become a crawler. And if you meet certain criteria and your crawlers don’t exceed a quota then have at it. The key is not to make it technically challenging, but to erect a legal barrier.


Wikimedia recently announced Wikimedia Enterprise for "organizations that want to repurpose Wikimedia content in other contexts, providing data services at a large scale".

So they're pretty clearly looking to monetize organizations which consume their data in a for-profit context.


monetizing != for-profit

You could e.g. just cover operational cost and/or improve the service quality from it.


I think they may have meant "(organizations) (which consume their data in a for-profit context)."


Google was/is also the largest sponsor of Mozilla. This doesn't stop Google from sabotaging Mozilla.

2 mln is probably Google's hourly profit. For that they get one of the biggest knowledge bases in the world. It's basically free as far as Google is concerned.

The instant Google becomes confident they can supplant Wikipedia, they will.


> 2 mln is probably Google's hourly profit.

You don't have to guess, their numbers are public. In 2020 they made $40B in profit, so it takes them about 27 minutes to make $2M in profit.


NOT a sponsor of Mozilla. Google buys web traffic (as default search engine) for ~$300M and turns it into several times that $ in ad revenue.


Not sure why you're being downvoted; I completely agree with what you're saying (modulo questionable usage of "sponsor"). If Wikipedia were to try to charge for this use of their data, Google would likely make it a priority to drop the Wikipedia blurbs, either without replacement, or with data sourced elsewhere.


> Google would likely make it a priority to drop the Wikipedia blurbs, either without replacement, or with data sourced elsewhere.

That's an odd way of phrasing things. If Wikipedia were to take away free access to their data, Google wouldn't be dropping Wikipedia, Wikipedia would be dropping Google. This line of thinking "you took this when I was giving it away for free, but now I want to charge for it, so you are expected to keep paying for it" is incorrect.


Given the scale that google already operates at, I don't doubt that they would just take a copy of thr content and rebrand it as a google service, complete with user contribution.

Then, after two or five years, let it fester then abandon it. Nobody gets promoted for keeping well oiled machines running.


Remember Knol? https://en.wikipedia.org/wiki/Knol?wprov=sfti1

It was actually good for writing stuff when I tried it. Never brought in enough traffic. Killed.


> Google was/is also the largest sponsor of Mozilla. This doesn't stop Google from sabotaging Mozilla.

Google isn't a sponsor of Mozilla, they're a customer. Do people think Google is "sponsoring" Apple with $1.5 billion a year too?


> they're a customer.

The cynic in me thinks the product is anti-trust insurance.


$1.5 billion a year? You're off by an order of magnitude; the number is thought to be over $10 billion a year.


Google being Apple's customer doesn't mean Google isn't sponsoring Mozilla.

These are two very different companies with a very different relationship with Google. And very different influences on Google.

Google wants to be on iOS. It brings customers to Google. A lot of them. iOS is possibly more profitable to Google than Android even with all the payments Apple extracts from them.

Google needs Mozilla so that Google may pretend that there's competition in browser space and that they don't own standards committees. The latter already isn't really true, and Google increasingly doesn't care about the former.


Well then they can't nag users to donate to Jimmy Wales' trust fund.


Couldn't you make a similar argument about for-profit uses of free/libre software? The software serves a useful purpose, who cares where it came from?


>>Google donated $2M [1]. In 2010, Google also donated $2m [2].

$2 Million a year? Now I know why Googlers complained about having one less olive in their lunch salad.

How much does Google PROFIT from Wikipedia and how much does Wikipedia loses in fundraising when Google fails to send users to the info provider?


Wikipedia is drowning in money so this whole line of discussion is weird.

And most of the value of wikipedia is created by its unpaid users, not Wikimedia foundation.


> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically.

Wikipedia visitors, edits, and revenue are all increasing, and the rate that they're increasing is increasing, at least in the last few years. Is this a claim about the third derivative?

> Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

The Wikimedia Enterprise thing seems like it has nothing to do with missing visitors and that companies ingesting raw Wikipedia edits are an opportunity for diversifying revenue by offering paid structured APIs and service contracts. Kind of the traditional RedHat approach to revenue in open source: https://meta.m.wikimedia.org/wiki/Wikimedia_Enterprise


See https://searchengineland.com/wikipedia-confirms-they-are-ste... from 2015. Google's widgets that present Wikipedia data do reduce visitors to Wikipedia.

Or see page views on English Wikipedia from 2016-current: https://stats.wikimedia.org/#/en.wikipedia.org/reading/total... Looks pretty flat, right? Does that seem normal?

As for Wikimedia Enterprise, you do have to read between the lines a bit. "The focus is on organizations that want to repurpose Wikimedia content in other contexts, providing data services at a large scale".


The first link doesn't seem quite conclusive (see the part at the bottom), and also doesn't give evidence that Google's widgets are to blame.

The flattening of users could also be due to a general internet-wide reduction in long-form (or even medium-form) non-fiction reading. How are page views for The New York Times?

Seems like it should be simple to A/B test, though. Obviously Google could do it themselves by randomly taking away the widget, but would could also see whether referrals from non-Google search engines (though they are themselves a tiny percentage) continue to increase while Google remains flat.


Edit: Removed bad "simple english graph", thanks. Though the regular english wikipedia traffic is flat from 2016-present.

As for NYT, is there a better proxy to compare to? There's no public pageview stats and they have a paywall.


That first graph is Simple English, not English, and is in millions, not billions. They also explicitly call out the methodology change in 2015...


I'm not sure I agree with this. I think airline websites are so garbage filled that they've driven people to use the simple alternative of the google flights checkout.

It's a bit of a vicious cycle, but In general most websites are so chock filled with crap that not having to click into them for real is a relief!


BA had some tracking request inline on the “payment processing” page which when blocked by my pihole prevents me from ever getting to the confirmation page, just have to refresh your email and wait for the best.

I have no idea how these companies, which make quite a decent amount of money at least up until 2020, can have such utterly poor sites.

I once counted some 20+ redirects on a single request during this process heh..


I don’t know what they’re doing but most every single sign on tool I’ve seen redirects 10-20 times during the sign on process (and then dumps you to the homepage to navigate your way back).


Probably to get first party cookies on a handful of domains


Yeah, the Google flights issue is difficult. On one hand, the business practice is problematic. On the other hand, Google flights is so much better than its competitors it's ridiculous.

If there was a way to split Google flights into a separate company and somehow ensure it wouldn't devolve into absolute trash like its competitors, that would be a good thing.


It was ITA and prior to Google buying them, did a pretty good business selling backend flight shopping services to aggregators and airlines.

Shopping for flights is a surprisingly technically difficult thing to do well.


I'm talking about flight status. Not Google Flights, shopping, or booking.

There are events associated with flight status that Google doesn't know. Like change fee waivers, cash comp awards to take a later or earlier flight, seat upgrades, etc.


It's not Google's prerogative to scrape a website and display its content, no matter how awful the website.


If 1 airline let me view information in a friendly fashion and the other didn't I would do business with the first.

Lest we forget the money in that scenario is from butts in seats not clicks on a website. The particular example is ill chosen as google is actually taking on a cost, taking nothing, and gifting the airline a better ui.


If you make an awful website that can be scrapped it's a matter of when not if someone will take your data and give it to your consumers whether your trying to upsell them or not...


>However, that sort of search used to (most times) lead to a visit to the airline web site.

I don't think that's correct. In the old days you'd either call a travel agent or use an aggregator like expedia.

Google muscles out intermediaries like Expedia, Yelp, and so on. It's not likely much better or worse for the end user or supplier. Just swapping one middleman for another.


I can't prove it was that way, but I spent a lot of time in the space. For a long time, the airline's site used to be the top organic result, and there was no widget. Similar for other travel related searches (not just airlines) over time. Google has been pushing down organic results in favor of ads and widgets for a long time...and slowly, one little thing at a time. Like no widgets -> small widget below first organic result -> move the widget up -> make it bigger -> etc.


I don't think google muscling out intermediaries like Expedia is a good thing.

Just for example, Expedia is probably 5% of Google's total revenue and Google doesn't like slim margin services by and large that can't be automated.

Travel is fairly high-touch - people centric. It doesn't fit Google's "MO".

But... its shitty that google can play all sides of the markets while holding people ransom to mass sums of money to pay to play on PPC where google doesn't... i think that's where the problem shines.

In essence, you're advocating that eBay goes away because google could do it... they could.. and eBay is technically just an intermediary, but do we want everything to be googlefied?

Google bought up/destroyed other aggregators - remember the days of fatwallet, priceline, pricewatch, shopzilla and such when they used to focus on discounts/coupons/deals and now they're moving more towards rewards/shopping/experience - it used to be i could do PPC on pricewatch and reach millions of shoppers are a reasonable rate, but now that google destroyed them all, the PPC rate on "goods" is absurdly high and not having an affordable market means only the amazons and walmarts can really afford to play...

it used to be you could niche out, but even then, that's getting harder


>In essence, you're advocating that eBay goes away because google could do it... they could.. and eBay is technically just an intermediary, but do we want everything to be googlefied?

I don't think I'm really advocating for it as much as I see as a more or less neutral change.

That said, I'm pretty ambivalent about Google. Their size is a concern, but they also tend to be pretty low on the dark pattern nonsense. eBay, to use an example you gave, screwed me out of some buyer protection because of poor UX and/or bug (I never saw the option to claim my money after the seller didn't respond). In this specific instance Google ends the process by sending you to the airline to complete the booking. That, imho, is likely better than dealing with Expedia.


Companies opt in to sites like Expedia and list their properties/flights/vacations on their marketplace and they pay a commission for those being booked. Expedia doesn't just crawl them and demand a royalty for sending them traffic...

Google has a huge pay 2 play problem with PPC... i've worked for Expedia so that's the only reason i know this :)

It's the reason companies work with Expedia many times because they don't have the leverage expedia group does...

i see it as unnatural change btw... "borg" if you will.


It's actually pretty different because another middleman can basically arise only if it's a big success in the iOS App Store because coming up in Google searches would be impossible and more or less the same in the Play Store. So, Google is not just yet another intermediary.


Only if Google stays around long term. I wouldn't be surprised if each free product on its graveyard took down a dozen of competing products before it was killed of.


Then someone can start a competitor up again, right? Assuming there's actually a market for it.


Not every market is lucrative in the extreme and it can take a long time to recover from being "disrupted". I think it is also a common practice for larger shopping chains to dump prices when they open a new location in order to clear out the local competition, so the damage it causes is well understood to be long lasting.


In regards to airlines, Google and Amadeus have a partnership I believe. Amadeus is the main source of data for many of these airline websites. If Google gets the data from Amadeus directly and not these websites, they are just cutting out the middleman. I don't shed a tear for any of these middleman (together with their Dark Pattern UX design).


Amadeus isn't a source of flight status. It is a source for (some) planned schedules and fares. Global distribution systems are a complex topic that's hard to sum up on HN. For flight status, Google is pulling from OAG and Flight Aware, and also from airline websites. Though they don't show airline sites as a source.


> And I don't think Google will realize what the problem is until they start accidentally killing off large swaths of the actual sources of this content by taking the audience away.

What makes you think they care? Killing off the sources of content might even be there goal. If they kill off sources of content, they'd be more than happy to create an easier-to-datamine replacement.

Hypothetically, if they killed off wikipedia, they are best placed to use the actual wikipedia content[1] in a replacement, which they can use for more intrusive data-mining.

Google sells eyeballs to advertisers; being the source of all content makes them more money from advertisers while making it cheaper to acquire each eyeball.

[1] AFAIK, wikipedia content is free to reuse.


I'm not suggesting it's illegal. There are a great many practices that are legal that I dislike.


You're wrong on a lot of facts here. Google Flights doesn't get its data just by crawling, they get it from Sabre, the FAA, Eurocontrol, etc. Airlines are, obviously, extremely pleased to disseminate this information. Google Flights "gives back" in the exact same way as any other travel outlet: they book passengers.

As for Wikipedia, the WMF is quite happy that most of their traffic is now served by Google. WMF is in the business of distributing knowledge, not in the eyeballs business. Serving traffic is just a cost for them. The main problem has been that the average cost for Wikipedia to serve a page has gone up, because many readers read it via Google, and more people who visit Wikipedia are logged-in authors, which costs them more to serve. I'm sure there's an easy solution to this problem (for example, beneficiaries of Wikipedia can donate compute facilities and services, or something along those lines).


They don't get individual flight status (what I was talking about) from Sabre or the FAA or Eurocontrol. I didn't get into fares and planned schedules and Google Flights, that's a different topic. I was talking about the big widget you get for queries on status for a particular flight, which is not Google Flights.

They have relented in some ways, rolling out stuff in the widget like: "The airline has issued a change fee waiver for this flight. See what options are available on American's website"

But obviously, that kind of stuff isn't shown on Google for quite some time after it exists on the source site. And the widget pushes the organics off the fold unless you have a huge monitor.

As for Wikipedia, I was referring to this: https://news.ycombinator.com/item?id=26487993

"Airlines are, obviously, extremely pleased to disseminate this information"

In the same way that publishers love AMP, yes. They don't actually like it, but they are forced to make the best of it.


Oh, status. I was thinking of schedules. Still, what is the point for the consumer of being directed to an airline's terrible status page? And are they even capable of being crawled? Looking at American's site (it was the most ghastly airline that sprang to mind) I don't see how a crawler would be able to deal with it, and indeed the Google snippet for AA flight status, on the aa.com result which is far down in the results page, just says "aa.com uses cookies" which is about what you'd expect.

In this case, I want to be sent literally anywhere but aa.com.


"what is the point for the consumer of being directed to an airline's terrible status page?"

One example...

If you back up a bit, the widget didn't used to tell you there was a change fee waiver when the flight was full, while aa.com did.

That's an actual, tangible benefit that a consumer might want, worth real money. You can also even often "bid" on a dollar amount to receive if you're willing to change flights. Google doesn't present that info today.

There are more examples. My perspective isn't that Google should lead you to aa.com, but I do feel it's a bit dishonest that the widget is so large it pushes aa.com below the fold. It doesn't need to be that large.


That's the result of the crawling, and it preventing competition. Google would much prefer that people complain about the details while ignoring the root cause.


I don't understand that. The crawling access is mostly the same as it ever was. Google's SERP pages are not. A mutually beneficial search engine that respects it's sources would still crawl the same. Google used to be that.

The core problem is incentives: http://infolab.stanford.edu/~backrub/google.html "we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm."


That's incorrect. Before the search oligopolies formed, new search engines could start up. There was excite, hotbot, altavista, and more. Now they don't have access. Search these comments for census.gov.


There are companies that do pretty well in this space, like ahrefs, for example. They do resort to trickery, like proxy clients that look like home computers or cell phones. But, if a small entity like ahrefs can do it, anyone can do it.

In a nutshell, though, I don't see equal access for all crawlers changing anything. Maybe that's the first barrier they hit, but it isn't the biggest or hardest one by far. Bing has good crawler access, but shit market share.


I swear something like 50% of those digests are totally incorrect as well. It's amazing they have kept the feature because it has never had a very high signal-to-noise ratio. I never trust what's presented in these digests without double-checking the source page.


I remember when rich snippets (one type of those widgets) came out there were a lot of funny examples. One for a common query about cancer treatments that pulled data from a dodgy holistic site saying that "carrots cured most types of cancer" (or something like that).

There was a similar one where Google emphatically claimed a US quarter was worth five cents in a pretty and large snippet graphic.


The most memorable rich snippet humor I've seen is a horse breeder sharing a story of how her searches gave snippets with my little ponies as the preview image.


I recall in the last uk election google got the infographic of party leaders about 60-70% wrong.

And quite often a people also ask refine is just some random guys comment from redit.


Have you heard the story of Thomas Running? It’s a story Google will tell you.

(Search who invented running)


> The airline web site could then present things Google can't do. Like "hey, we see you haven't checked in yet" or "TSA wait times are longer than usual" or "We have a more-legroom seat upgrade if you want it".

If I'm a passenger, there's plenty of ways for airlines to notify me. If I'm searching for a flight status online, it's because I'm picking someone up. If I want more information, I'll click through. I don't see how either me or the airline are hurt.

> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

Why is that even a problem for Wikipedia?


It's a progression. It's not a huge problem, by itself, for either. But Google shareholders want to continue the same YoY gains. The only cash cow is search, so they continue to take screen real estate that used to go to others, and take it for themselves. Whether that's more ads, or more widgets, or whatever.

Yes, it's legal. But it does reduce visitor interactions for those sites. Reduced visitor interactions isn't good for web sites...it takes away incentives, reduces brand value, reduces revenue. Eventually, that is not great for consumers.

Ever been a middleman? Squeeze your suppliers enough, and you kill them. The next supplier will pre-emptively cut quality, features, etc, because they know you're going to try and squeeze them to death.

"If I'm searching for a flight status online, it's because I'm picking someone up"

That's one use case, it's not all of them. There's middle ground too, like the bones they throw Wikipedia in the form of links for more info.


> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

Do you have a link of that product/service from Wikipedia?



Thanks


> In short, I don't think the crawler is the problem.

Except that, allow other companies to crawl/compete, and you can take eyeballs away from Google (which may well then return eyeballs to Wikipedia so long as the Google competitors don't also present scraped data).


They're making it easier to search for flights and arrange a trip. It's UX and makes me not hate the airlines/travel process as much. And I end up buying the flight from the airline anyways, and in many cases doing the arranging on the airline site in the end once it's determined, so Google is giving that back. They're not taking stuff from the airlines, I mean what ads and stuff are on the airline sites anyways specifically during the search process. Where they are taking away is from the Expedia's and other aggregation sites that offer a garbage/hodgepodge experience that drives people crazy.


You're talking about Google Flights, which is completely unrelated to flight status.


The way that the web has been fundamentally broken by Google and other companies is one of the reasons I am excited about an alternative protocol called Gemini. It doesn't replace the web entirely, but for basic things like exchanging information, it's great. https://gemini.circumlunar.space/


>Google's widgets have been reducing traffic to Wikipedia pretty dramatically.

But wouldn't this be a good thing? Since wikipedia is a nonprofit aiming to provide knowledge, google stealing & caching their content might help serve information to more while reducing wikipedia's server load, so IMO it might not be so bad for wikipedia.


Large swaths of web are garbage. Wasting people's time and attention on visiting pointless sites for something presentable in a small box is obviously not economical.

And if some of the sources somehow die? New sources will spring up. It doesn't matter.


> You see this recently with Wikipedia. Google's widgets have been reducing traffic to Wikipedia pretty dramatically. Enough so that Wikipedia is now pushing back with a product that the Googles of the world will have to pay for.

What product is that?


I think the flight arrivals/departures is a bad example. A good example might be putting flights.google.com on the first page or even allow it to exist.


Not sure what you mean. Both for flight status as well as flight shopping, Google drops a huge widget at the top and pushes everything else down, below the visible fold.


Standardized interoperability enables overall progress.

Every airline doesn't need their own webpage. They could all provide a standard API.


"Every airline doesn't need their own webpage. They could all provide a standard API."

That's sort of how it works in the corporate booking tool world. It is decidedly not a better experience for end users, IMO.

There's quite a lot about each airline that is different, so any unified approach is a lowest common denominator. You'll notice things like loyalty points, for example, have more rich data on the airline's website. And that some fares are ONLY on the website. Or that seat maps have more useful detailed info, etc.

And that's all shopping/booking. Departure control, flight status, upgrade/downgrade, check-in, seat upgrades, standby, etc, are for the most part only on the airline's website.


The way to look at this from Google’s point of view is to realise that most websites are slow and bad[1], so if Google sent you there you would have a bad experience with a bad slow website trying to find the information you want. Google want to make it better for you.

[1] it feels like Google have contributed a lot to websites being slow and bad with eg ads, amp, angular, and probably more things for the other 25 letters of the alphabet.


> Google want to make it better for you.

Hehe, sure, nothing nefarious or greedy here... move along, move along, nothing to see...


I've noticed that sometimes Google had updated flight information before the displays at the airport.


For the most part individual airports own that infrastructure. So it's hard to generalize. For most types of notable flight status/time changes, however, airlines usually know first.

There are exceptions, like an airport-called ground stop.


Does the concergie of a hotel take anything away when he informs you that your flight has been delayed?


It's hard for me to make that an apt analogy. She's not well known as a portal to find websites, which is what Google had been for most of its existence.

It's pretty difficult to come up with a non-computer analogy for how Google works now. Pick a different space, and the power imbalance is quite clear. If they wanted, they could destroy StackExchange very quickly with these widgets.


Only if you are presuming they are going to make a question answering widget, too, since the content on stackexchange doesn't materialize out of thin air


Perhaps I am misunderstanding or over simplifying things but it always surprises me that there are legal cases brought against companies who scrape data when so many of Google's products are doing exactly this.

It definitely feels like one set of rules for them and a different set for everyone else.


I mean it's not that weird that a company would authorize major search engines scraping them but no one else.

I don't really see this as Google playing by different rules so much as economic incentives being aligned in Google's favor.


Google doesn't scrape anything that the site owner objects to.


https://knuckleheads.club/the-googlebot-monopoly/ has actual details.

> Let’s take a look at the robots.txt for census.gov from October of 2018 as a specific example to see how robots.txt files typically work. This document is a good example of a common pattern. The first two lines of the file specify that you cannot crawl census.gov unless given explicit permission. The rest of the file specifies that Google, Microsoft, Yahoo and two other non-search engines are not allowed to crawl certain pages on census.gov, but are otherwise allowed to crawl whatever else they can find on the website. This tells us that there are two different classes of crawlers in the eyes of the operators of census.gov: those given wide access, and those that are totally denied.

> And, broadly speaking, when we examine the robots.txt files for many websites, we find two classes of crawlers. There is Google, Microsoft, and other major search engine providers who have a good level of access and then there is anyone besides the major crawlers or crawlers that have behaved badly in the past that are given much less access. Among the privileged, Google clearly stands out as the preferred crawler of choice. Google is typically given at least as much access as every other crawler, and sometimes significantly more access than any other crawler.


Broadly speaking, robots.txt files are often ignored. I used to run a fairly large job ad scraping organization, and we would be hired by companies (700 of the fortune 1000 used us) to scrape the job ads from their career pages, and then post those jobs on job boards. 99 of 100 times, the robots file would disallow us to scrape. Since we were being paid by that company's HR team to scrape, we just ignored it because getting it fixed would take six months and 22 meetings.


> Broadly speaking, robots.txt files are often ignored.

If you wanna go nuclear on people who do that, include an invisible link in your html and forbid access to that URL in your robots.txt, then block every IP who accesses that URL for X amount of time.

Don't do this if you actually rely on search engine traffic though. Google may get pissed and send you lots of angry mail like "There's a problem with your site".


> Don't do this if you actually rely on search engine traffic though. Google may get pissed and send you lots of angry mail like "There's a problem with your site".

Ah, but of course you would exclude Google's published crawler IPs from this restriction, because that is exactly what they want you to do.


We would occasionally have customer try doing that. AWS has lots of IP addresses :-).


Nice insight - use different IP address for hidden links!


So from the website's point of view there is no difference between 'crawling' and 'scraping'. Census.gov I assume has a ton of very useful information which is in the public domain which a host of potential companies could monetize by regularly scraping census.gov. Census.gov's purpose to make this information available to people is served by google, yahoo and bing. On the other hand if I have a business which is based on that data, in fact I'm at cross purposes to them.


The census data is available for bulk download, mostly as CSV (for example [1]). Scraping census.gov is worse for both the Census Bureau (which might have to do an expensive database query for each page) and for the scraper (who has to parse the page).

Blocking scrapers in robots.txt is more of a way of saying, "hey, you're doing it wrong."

It's also worth noting that the original article is out of date. The current robots.txt at census.gov is basically wide-open [2].

[1] https://www.census.gov/programs-surveys/acs/data/data-via-ft...

[2] https://www.census.gov/robots.txt


Scrapers don't care about robots.txt. I have scraped multiple websites in a previous job and the robots.txt means nothing. Bigger sites might detect and block you but most don't.


I'm generally anti business. But I have to disagree. "The Public" that the government serves includes businesses. Businesses (ignoring corporate personhood bullshit) are owned and operated by people.

I do not want the government deciding "what purposes" e.g. non-commercial, serve the public good. The public gets to decide that. (charging a license for commercial use is maybe ok (assuming supporting that use costs government "too much").

And I very do not want current situation with the government anointing a handful of corporations (the farthest thing from the public possible) access and denying everyone else including all of the actual public.


A specific case where this favorite-picking by government enables corruption: https://en.wikipedia.org/wiki/Nationally_recognized_statisti...

And an example from the quickly-approaching future, when there will be Nationally Recognized Media Organizations who license "Fact-Checkers," through which posts to public-facing will have to be submitted for certification and correction.


Favorite-picking by the government is corruption by itself already.


> I do not want the government deciding "what purposes" e.g. non-commercial, serve the public good. The public gets to decide that.

the public's "decision" on things like this is made manifest by government policy, no?


In theory. In practice, is every single policy that our government upholds currently popular with the majority of people?

It's possible to have government policies that the majority of people disagree with, that remain for complicated reasons related to apathy, lobbying, party ideology, or just because those issues get drowned out by more important debates.

Government is an extension of the will of the people, but the farther out that extension gets, the more divorced from the will of the people it's possible to be. That's not to say that businesses are immune from that effect either -- there are markets where the majority of people participating in them aren't happy with what the market is offering. All of these systems are abstractions, they're ways of trying to get closer to public will, and they're all imperfect. But government is particularly abstracted, especially because the US is not a direct democracy.

I'm personally of the opinion that this discussion is moot, because I think that people have a fundamental Right to Delegate[0], and I include web scraping public content under that right. But ignoring that, because not everyone agrees with me that delegation is right, allowing the government to unilaterally rule on who isn't allowed to access public information is still particularly susceptible to abuse above and beyond what the market is capable of.

[0]: https://anewdigitalmanifesto.com/#right-to-delegate


In the case of Census.gov, they offer an API to get the data[0]. It's actually pretty nice. Stable, ton of data, fairly uniform data structure across the different products. Very high rate limits, considering most data only needs retrieved once a year. I think they understand the difference between crawling and scraping.

[1] https://www.census.gov/data/developers.html


But Google, Yahoo and Bing are also monetizing the data. Why are they allowed to provide “benefits” but “scrapers” are not? Why is it wrong to monetize public data?


Having data in the right format as a download or via an API would be the best way to go for public data.

If people have to 'scrape' that data from a public resource, I'd say they're presenting the data in the wrong way.


I used to run a fairly large job ad scraping operation. Our scraped data was used by many US state and federal job sites. "Scraping" is just using software to load a page and extracting content. "Crawling" is just load a page, find hyperlinks (hmm... a kind of content), and then crawling those links. Crawling is just a kind of scraping.


Is it legal for a government entity to issue a robots.txt like that? Maybe the line between use and abuse hasn't been delinated as well as it needs to be.


Is failure to honor a robots.txt a crime? Or rather, would it be unlawful to spoof a user agent to access this publicly available data? After the linkedin [0] case it seems reasonable to think not.

[0]: https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-l...


Spoofing user-agents hasn't worked in a long time for anything but small operations because search engines publish specific IP ranges their scrapers use.


The CFAA is so broad and broadly interpreted that I would assume that failure to honor any any site's robots.txt file may incur criminal liability if the U.S. government can claim American jurisdiction (e.g., because the site's owners are U.S. persons or a U.S. corporation, or because the site's servers are located in the U.S.).


> Is it legal for a government entity to issue a robots.txt like that?

I may be wrong (this isn't my area), but I was under the impression that robots.txt was just an unofficial convention? I'm not saying people should ignore robots.txt, but are there legal ramifications if ignored? I'm not asking about techniques sites use to discourage crawlers/scrapers, I'm specifically wondering if robots.txt has any legal weight.


Yep, it's defined at https://www.robotstxt.org/

Looks like Google is trying to turn it into an RFC though.


Perhaps there could be some kind of 'Crawler consortium'?

Under this consortium, website owners would be allowed to either allow all crawlers (approved by the consortium) or none at all (that is, none that is in the consortium, i.e. you could allow a specific researcher or something to crawl your website on a case-by-case basis).

This consortium would be composed of the search engines (Google, MS, other industry members), as well as government appointed individuals and relevant NGOs (electronic frontier foundation, etc?). There would be an approval process that simply requires your crawl to be ethical and respect bandwidth usage. Violations of ethics or bandwidth limits could imply temporary or permanent suspension. The consortium could have some bargain or regulatory measures to prevent website owners from ignoring those competitive and fairness provisions.


> Perhaps there could be some kind of 'Crawler consortium'?

An industry-wide agreement not to compete for commercially valuable access to suppliers of data?

Comprised of companies that are current (and in some cases perennial) focusses of antitrust attention?

I think there might be a problem with that plan.


I don't see the problem. If a bunch of non-google companies pooled resources to make a crawl, that would reduce market concentration, not increase it.


Well, yes, and a common solution to anti-trust cases, that I know of, is some kind of industry self-regulation. In this case I wouldn't trust the industry only to self-regulate; hence, they should at invite (while keeping a minority but not insignificant position) governments and civil society (ngos and other organizations) to participate.

Could you better describe your objections?


Are there any actual repercussions for just ignoring robots.txt?


There is if you are doing it for work. For example, your company could get sued if you are found using that data and ignoring the ToS. If you are a public figure, you could get your name tarnished as doing something unethical or the media may call it "hacking". If you are rereleasing the data then you risk getting a takedown notice.


robots.txt is not a terms of service. Even if it was, it wouldn't be enforceable for a public website. You would need to prove that a web crawler is maliciously causing disruption to your service, and that is not easy.


All it takes is your company execs or lawyers to be afraid of a stern letter, and ask you to cancel your project. If you're violating their robots.txt, you're probably violating their terms of service that's hidden somewhere. And your company doesn't want to risk having to pay hundreds of thousands to fight a court case. There's also venues besides courts for them to attack you, like contacting the publishers or hosting platforms for your derivative works. It's a chilling effect.

And I'm not making this up. This kind of stuff has happened to me many times.


Your crawler's IP might get banned, eventually.


Sometimes website admins will also try to report your ips to the service provider as a source of attacks (even if not true).


Given how often I've had misbehaving crawlers slow own servers in the early 2000s, I do not see how a crawler that disobeys robots.txt is not an attempted attack.


On a related note, Cloudflare just introduced "Super Bot Fight Mode" (https://blog.cloudflare.com/super-bot-fight-mode/) which is basically a whitelisting approach that will block any automated website crawling that doesn't originate from "good bots" (they cite Google & Paypal as examples of such bots). So basically everyone else is out of luck and will be tarpitted (i.e. connections will get slower and slower until pages won't load at all), presented with CAPTCHAs or outright blocked. In my opinion this will turn the part of the web that Cloudflare controls into a walled garden not unlike Twitter or Facebook: In theory the content is "public", but if you want to interact with it you have to do it on Cloudflare's terms. Quite sad really to see this happen to the web.


On the other hand, I do not want my site to go down thanks to a few bad 'crawlers' that fork() a thousand http requests every second and take down my site, forcing me to do manual blocking or pay for a bigger server/scale-out my infrastructure. Why should I have to serve them?


You can use the same rate-limiting for all crawlers, Google or not.


Googlebot is pretty careful and generally doesn’t cause these problems.


Right, then they shouldn't be effected by the rate-limiting, as long as its reasonable. If it was applied evenly to all clients/crawlers, it'd at least allow the possibility for a respectful, well designed crawler to compete.


The problem is, if you own a website, it takes the same amount of resources to handle the crawl from Google and FooCrawler even if both are behaving, but I'm going to get a lot more ROI out of letting Google crawl, so I'm incentivized to block FooCrawler but not Google. In fact, the ROI from Google is so high I'm incentivized to devote extra resources just for them to crawl faster.


We know that. No one claims websites are doing this for no reason. It's explicitly written in the article.

But this sub-thread is about misbehaved crawlers.


Agree... this entire argument seems to think I give a rats ass about 9000 different crawlers that give me literally zero benefit and only waste server resources. Most of those crawlers are for ad-soaked piss poor search engines. I'd rather just block them all and allow crawlers that don't know how to behave.


In the early 90s there were various nascent systems for essentially public database interfaces for searching

The idea was that instead of a centralized search, people could have fat clients that individually query these apis and then aggregate the results on the client machine.

Essentially every query would be a what/where or what/who pair. This would focus the results

I really think we need to reboot those core ideas.

We have a manual version today. There's quite a few large databases that the crawlers don't get.

The one place for everything approach has the same fundamental problems that were pointed out 30 years ago, they've just become obvious to everybody now.


I wonder what happens to RSS feeds in this situation. Programs I run that process RSS feeds will just fetch them over HTTP completely headlessly, so if there are any CAPTCHAs, I'm not going to see them.


I've found that Cloudflare isn't great at this. I even found cases where my site was failing to load to googlebot (a "good" bot that they probably have the IPs for) because they were serving a captcha instead of my CSS.

So your best bet is setting a page rule to allow those URLs.


In my experience, those either get detected(?) and let through (rss can be agressively cached after all) or you're out of luck and the website owner set up e.g. wordpress (which automatically included rss URLs) but did not configure cloudflare to let rss through.


That will be interesting to see with regards to legal implications. If they (in the website operator's name) block access to e.g. privacy info pages to a normal user "by accident", that could be a compliance issue.

I don't think it's mass blocking is the right approach in general. IPs, even residential, are relatively easy and relatively cheap. At some point you're blocking too many normal users. Captchas are a strong weapon, but they too have a significant cost by annoying the users. Cloudflare could theoretically do invisible-invisible captchas by never even running any code on the client, but that would be wholesale tracking and would probably not fly in the EU.


Cloudflare is an agent for website owners. Nearly everything is configurable and the defaults are permissive.


How hard is it to ask Cloudflare to let you crawl?


It's not Cloudflare who is deciding it. It's the website owners who request things like "Super Bot Fight Mode". I never enable such things on my CF properties. Mostly it's people who manage websites with "valuable" content, e.g. shops with prices who desperately want to stop scraping by competitors.


I can say this will give a lot of businesses false sense of security. It is already bypassable.

the Web scraping technology that I am aware of has reached end game already: Unless you are prepared to authenticate every user/visitor to your website with a dollar sign, lobby congress to pass a bill to outlaw web scraping, you will not be able to stop web scraping in 2021 and beyond.


But what about captchas?

Due to aggressive no-script and uBlock use I, browsing the website as a human, keep getting hit by captchas and my success rate is falling to a coinflip. If there's a script to automate that I'm all ears.


100% doable. Like I said these type of blanket throttling seems to be the new trend but it's already defeated.

I just no longer see it possible to 1) put information on the web (private or public) 2) give access outside your organization (customers or visitors) 3) expect your website will not be scraped.

ToS is NOT the law unfortunately.


So, one more reason to hate Cloudflare and every single website that uses it.


Or maybe don’t “hate” folks who are just trying to put some content online and don’t want to deal with botnets taking down their work? You know, like what the internet was intended for.


> don’t want to deal with botnets taking down their work

Botnets and automated crawling are completely different things. This isn't about preventing service degradation (even if it gets presented that way). It's an attempt by content publishers to control who accesses their content and how.

Cloudflare is actively assisting their customers to do things I view as unethical. Worse, only Cloudflare (or someone in a similarly central position) is capable of doing those things in the first place.


Internet was certainly not intended for centralization. I hit Cloudflare captchas and error pages so often it's almost sickening. So many things are behind Cloudflare, things you least expect to be behind Cloudflare.


It's easy enough to bypass most Cloudflare “anti-bot” with an unusual refresh pattern or messing with a cookie. (It's easier to script this than solve the CAPTCHAs.)


Anyone malicious who is determined enough will just pay their way through any captcha. Yet for me, as a legitimate user, these "one more step" pages feel downright humiliating. At this point, if I see one, I either just nope out of it, or look for a saved copy on archive.org.


Can we take a moment to talk about this club's business model?

There's not even any information to see what the "private forum access" that you have to pay for is about, what kind of people are in it...or even to know about what happens with the money.

For me, this sounds like a scam.

I mean, no information about any company. No imprint. No privacy policy. No non-profit organization. And just a copy/paste wordpress instance.

I mean, srsly. I am building a peer-to-peer network that tries to liberate the power of google, specifically, and I would not even consider joining this club. And I am the best case scenario of the proposed market fit.


Not being set up as a 527 nonprofit[0] is the biggest red flag - no donation or membership money has to be spent for political purposes. They also use memberful for their membership/payment system, which doesn't require owning a business, so you might be paying out to the owner directly instead of to a business with its own bank account. Maybe the owner is looking at HN and can clarify.

To add, there are a lot of businesses that use the terms 'Knucklehead' so finding their business on secretary of state business searches might be impossible.

0: https://www.irs.gov/charities-non-profits/political-organiza...


They want you to pay them to "research" google's web crawling monopoly. It's really just a donation, but they don't frame it like that. Probably more credible than using a crowd funding website, because it sounds like their pushing for actual legislation.

> Meet with legislators and regulators to present our findings as well as the mock legislation and regulations. We can’t expect that we can publish this website or a PDF and then sit back while governments just all start moving ahead on their own. Part of the process is meeting with legislators and regulators and taking the time helping them understand why regulating Google in this way is so important. Showing up and answering legislators’ questions is how we got cited in the Congressional Antitrust report and we intend to keep doing what’s worked so far.


I'd like to see some data on their claim that website operators are giving googlebot special privileges. As far as I can tell it would be a huge pain in the ass to block crawler bots from my servers, not that I've tried. I have some weird pages that tend to get crawlers caught in infinite loops, and I try to give them hints with robots.txt but most of the bots don't even respect robots.txt.

If I actually wanted to restrict bots, it would be much easier to restrict googlebot because they actually follow the rules.

I don't disagree in principle that there should be an open index of the web, but for once I don't see Google as a bad actor here.


A company I worked for ~7 years ago ran its own focused web crawler (fetching ~10-100m pages per month, targeting certain sections of the web).

There were a surprising number of sites out there that explicitly blocked access to anyone but Google/Bing at the time.

We'd also get a dozen complaints or so a month from sites we'd crawled. Mostly upset about us using up their bandwidth, and telling us that only Google was allowed to crawl them (though having no robots.txt configured to say so).


Isn't that the website owners right though? I'm not sure I understand the problem here.

If Google is taking traffic and reducing revenue, a company can deny in robots.txt. Google will actually follow those rules - unlike most others that are supposedly in this 2nd class.


Yup, no problem here, was just making an observation about how common such blocking was (and about the fact that some people were upset at being crawled by someone other than Google, despite not blocking them).

The company did respect robots.txt, though it was initially a bit of a struggle to convince certain project managers to do so.


> Isn't that the website owners right though?

No. The internet is public. Publishers shouldn't get any say in who accesses their content or how they do it. As far as I'm concerned, the fact that they do is a bug.


No, it's not. I can setup a login page and keep you out if I want. And I can do it however I want.


But your login page will be public and subject to being crawled.


My server, my rules.


I usually recommend setting only Google/Bing/Yandex/Baidu etc to Allow and everything else to Disallow.

Yes, the bad bots don't give a fuck, but even the non-malicious bots (ahrefs, moz, some university's search engine etc) don't bring any value to me as a site owner, take up band width and resources and fill up logs. If you can remove them with three lines in your robots.txt, that's less noise. Especially universities do, in my opinion, often behave badly and are uncooperative when you point out their throttling does not work and they're hammering your server. Giving them a "Go Away, You Are Not Wanted Here" in a robots.txt works for most, and the rest just gets blocked.


> they're hammering your server

Why can't you just ratelimit IPs that are "too active" for your server to handle?


From some I could, but why would I? If they're not adding value and they don't want to behave, I don't see a reason to spend money to adapt my systems to be "inclusive" towards their usage patterns.


In context, you're justifying blocking all automated traffic, even that which does behave, by pointing out that some of it doesn't. That attitude seems lazy at best, malicious at worst.


CGNAT


Now that's a really good point. I wonder why there isn't a standard protocol for signalling upstream that a particular connection is abusive and to please rate limit the path at the source on your behalf? It would certainly add complexity, but the current situation is hardly better.


When you operate commercial sites at scale, bots are a real thing you spend real engineering hours thinking about and troubleshooting and coding to solve for.

And yes, that means google gets special treatment.

Think about the model for a site like stackoverflow. The longest of long tail questions on that site: what’s the actual lifecycle of that question?

- posted by a random user - scraped by google, bing, et al - visited by someone who clicked on a search result on google - eventually, answered - hopefully, reindexed by google, bing et al - maybe never visited again because the answer now shows up on the google SERP

In the lifetime of that question how many times is it accessed by a human, compared to the number of times it’s requested and rerequested by an indexing bot?

What would be the impact on your site of three more bots as persistent as google bot? Why should you bother with their requests?

So yes, sites care about bot traffic and they care about google in particular.


See figure I.4 on page 24 of this UK government report: https://assets.publishing.service.gov.uk/media/5efb1db6e90e0...

Additional evidence here: https://knuckleheads.club/the-evidence-we-found-so-far/


A lot of news websites restrict any crawler other than Google. And this does not happen only via robots.txt.


Indeed, years ago I had scripts to automatically fetch URLs from IRC and I quickly realized that if I didn't spoof the user agent of a proper web browser many websites would reject the query. Googlebot's UA worked just fine however.


> Googlebot's UA worked just fine however

They obviously don't care enough then - Google says you should use rdns to verify that googlebot crawls are real[0]. Cloudflare does this automatically now as well for customers with WAF (pro plan).

0: https://developers.google.com/search/docs/advanced/crawling/...


Spoofing your user-agent as googlebot is a common way to bypass paywalls, is (was?) a way to read Quora without creating an account, etc. Publishers obviously need to send their page/article to Google if they want it to be indexed but may not want to send the same page content to a normal user: https://www.256kilobytes.com/content/show/1934/spoofing-your...

This was common even back in the mid-2000s:

https://www.avivadirectory.com/bethebot/

https://developers.google.com/search/blog/2006/09/how-to-ver...



Google aren't the bad actor in the sense that they are actively doing something wrong, but they are definitely benefiting from the monopoly that they created and work on maintaining. If this continues then nobody will really ever be able to challenge them, which means possibly "better" products will fail to penetrate the market.


> but for once I don't see Google as a bad actor here.

As inflammatory as the headline of the page looks, they literally admit it's not google's fault in the smaller text lower down:

"This isn’t illegal and it isn’t Google’s fault, but"


LinkedIn profile/Quora answer are accessible by Google bot without signin


The studies and data to support their claim is in the first paragraph of the article you "read" before posting the question.


It's hilarious to think there exists people who think googlebot does not get special treatment from website operators. Here is an experiment you can do in a jiffy, write a script that crawls any major website and see how many URL fetches it takes before your IP gets blocked.

Googlebot has a range of IP addresses that it publicly announces so websites can whitelist them.


> Googlebot has a range of IP addresses that it publicly announces so websites can whitelist them.

Google says[1] they do not do this:

"Google doesn't post a public list of IP addresses for website owners to allowlist."

[1]https://developers.google.com/search/docs/advanced/crawling/...


From that same page they recommend using a reverse DNS lookup (and then a forward DNS lookup on the returned domain) to validate that it is google bot. So the effect is the same for anyone trying to impersonate googlebot (unless they can attack the DNS resolution of the site they’re scraping I guess).


I don't whitelist googlebot, but I don't block them either because their crawler is fairly slow and unobtrusive. Other crawlers seem determined to download the entire site in 60 seconds, and then download it again, and again, until they get banned.


I have never had that problem running screaming frog on big brand sites apart from one or two times.


I don't scrape a website often, but when I do, I'm using a user agent of a major browser.


Do any of them intersect with Google Cloud IP addresses? If so set up a VPN server on Google Cloud.


The idea of a public cache available to anyone who wishes to index it is ... kind of compelling.

If it was the only indexer allowed, and it was publically governed, then enforcing changes to regulation would be a lot more straightforward. Imagine if indexing public social media profiles was deemed unacceptable, and within days that content disappeared from all search engines.

I don't think it'll ever happen, but it's interesting to think about.


Common Crawl is attempting to offer this as a non-profit: https://commoncrawl.org


o/t but what the hell are they doing to scroll on that page? I move my fingers a centimeter on my trackpad and the page is already scrolled all the way to the bottom.

Hijacking scroll like this is one of the biggest turnoffs a website can have for me, up there with being plastered with ads and crap. It's ok imo in the context of doing some flashy branding stuff (think Google Pixel, Tesla splashes) but contentful pages shouldn't ever do this.


Add *##+js(aeld, scroll) to your uBO filters. That will stop scroll JS for all websites.


> If it was the only indexer allowed, and it was publically governed

Which would put it under government regulation and be forever mired in politics over what was moral, immoral, ethical or unethical and all other kerfuffle. To an extent, it’s already that way, but that would make it worse than it is currently.


Here's an idea... what if search became a peer-to-peer standardized protocol that is part of the stack to complement DNS? E.g. instead of using DNS as the primary entry point, you use a different protocol at that level to do "distributed search". DNS would still play a role too, but if "search" was a core protocol, the entry point for most people would be different.

Similar to some of the concepts of "Linked Data", maybe - https://en.wikipedia.org/wiki/Linked_data.

The problem is getting to a standard, it would essentially need to be federated search so a standard would have to be established (de facto most likely).

Also, indexes and storage, distribution of processing load.. peer-to-peer search is already a thing, but it doesn't seem to be a core function of the Internet.

This is basically the same concept as making an "open" version of something that is "closed" in order to compete, I guess.



I'd have to look more but maybe running a cache isn't dead simple. I can imagine that the benefits of manipulating what's in the cache either adding or removing would be very high. Google and the others are private companies so they're not required to do everything in the public view.

A public cache wouldn't be able - indeed shouldn't - to play cat and mouse games with potential opponents. I suspect most of the games played require not explaining exactly what you're doing.


An alternative but similar idea, apply your own algorithms to a crawler/index. That's half the problem with these large platforms commanding the majority of eyeballs, you search the entire web for something and you get results back via a black box. Alternatives in general are most definitely a good thing.

Knuckleheads' Club at the very least are doing a great job of raising awareness and the potential barriers to entry for alternatives.


That would be a very cool use case for something like STORJ or IPFS.


Imagine if Donald Trump decided that indexing Joe Biden's campaign site was unacceptable.

A mandated singular public cache has potential slippery slopes.


>A mandated singular public cache has potential slippery slopes.

That may be, but it seems like everything has a slippery slope - if the wrong person gets into power, or if the public look the other way/complacence/ignorance/indifference, etc, etc. It shouldn't stop us evaluating choices on their merits, and there is a lot of merit to entrusting 'core infrastructure' type entities to the government - or at-least having an option.


Imagine if Donald Trump decided to tax campaign donations to Joe Biden's campaign at 100%.

I am unconvinced by the "slippery slope" argument being deployed by default to any governmental attempt to combat tech monopolies.


This is an argument against centralization more than it is against government.

"One index to rule them all" seems more fraught with difficulty than, "large cloud providers are unhappy that crawlers on the open web are crawling the open web".


If the impact stopped at "large cloud providers" being unhappy, I think that you're correct. But I think we've seen considerably downstream "difficulty" for the rest of society from search essentially being consolidated into one private actor.


So out law web scrapping entirely?


While I don’t disagree with the idea that all crawlers should have equal access, we also need to address the quality of many crawlers.

Google and Microsoft have never hammered any website I’ve run into the ground. Crawlers from other other, smaller, search engines have, to the point where it was easier to just block them entirely.

Part of the problem is that sites want search engine to index their site, but not allow random people just scrapping the entire site. So they do the best they can, and forget that Google isn’t the web. I doubt it’s shady deals with Google, it’s just small teams doing the best they can and sometimes they forget to think ideas through, because it’s good enough.


I think this is a problem which should be solved by automatic rate-limiting and throttling at the application/caching layer (or just individual web server for smaller sites). Requests with a non-browser UA get put into a separate bots-only queue that drains at a rate of ~1/sec or so. If the queue fills up you start sending 429s with random early failures for bots (UA/IP/subnet pairs) that are overrepresented in the traffic flow.

I don't know if such software exists, but it should. It would be a hell of a lot healthier for the web than "everyone but Google f*ck off", and it creates an incentive for bots to throttle themselves (as they're more likely to get a faster response than trying to request as fast as possible).


I suspect that at least some of the bots use web server response times and response codes as part of the signal for ranking. If your website does not appear capable of handling load then it won't rank as highly, because it is not in their best interests to have search results that don't load.


We've had the Bing crawler make a obscene number of requests quite often but fortunately it doesn't bring us down.


> Let’s take a look at the robots.txt for census.gov from October of 2018 as a specific example to see how robots.txt files typically work. This document is a good example of a common pattern. The first two lines of the file specify that you cannot crawl census.gov unless given explicit permission.

This was eyebrow-raising. Actually seems like this is not (any longer?) true:

https://census.gov/robots.txt:

User-agent: *

User-agent: W3C-checklink

Disallow: /cgi-bin/

Disallow: /libs/

...

That first line wildcards for any user agent but does nothing with it. It should say "Disallow /" on the next line if it blocked all unnamed robots. It looks like someone found out about it and told the operators, rightfully so, that government webpages with public information (especially the census) shouldn't have such restrictions. They then removed only the second line and left the first. Leaving the first line has no impact on the meaning of the file.


Actually, thinking more about this, I think they might be misconfigured, because they clearly don't want robots touching /cgi-bin etc (reasonable!) but they are actually only asking the named robots to do that, all other bots have no guidance about what not to touch



Interesting that the most comments it got before was 11, and today it succeeds and makes it to the front page! This is a good illustration of whether or not submissions get any traction can be fairly stochastic.

On topic, stack overflow does exactly what the article is talking about; They lock down their sitemap and make special exceptions for the Google bot:

https://meta.stackexchange.com/a/98087

https://meta.stackexchange.com/questions/33965/how-does-stac...

I can understand SO's reasoning but it only perpetuates the incumbents' stranglehold on the internet.


I think it's partly because they create a website which reported on the status of the Ever Given which rose to 1. on the front page.

I feel like I often see submissions which are, even tangentially, related to front page material rise very quickly.

Regardless, congrats to Knuckleheads Club for fighting the good fight.


You are right, that was how I found it.


> They lock down their sitemap and make special exceptions for the Google bot:

Their robots.txt, on the other hand, is more restrictive of Googlebot:

https://stackoverflow.com/robots.txt

  User-agent: Googlebot-Image
  Disallow: /*/ivc/*
  Disallow: /users/flair/
  Disallow: /jobs/n/*
  ..


Wasn't aware of that.

Resubmitting interesting content that hasn't got traction earlier on is however explicitly allowed in the guidelines IIRC.


And linking past threads on the same subject is helpful.


Hooray! Looks like I'm one of today's lucky 10,000. :)

https://xkcd.com/1053/


This is not really about Google.

Websites block crawlers because they get abused / crashed by Crawlers. In the early days (2000-2010) Google not only got banned by some websites, it even got DNS-banned for abusing some DNS domains. You see, Google already has already built the "megacrawlers" described in this article, it can melt any website on the Internet, even Facebook - the largest, and they paid a high price for letting the early Google crawlers run free.

Google today has a rate-limit for every single website and DNS sub-domain on the internet. For small websites the default is a handful of web pages every few seconds. Google has a very slow (days) algorithm to increase its crawl rate, and a very fast (1d) algorithm to cut the rate limit if it's getting any of the errors likely due to website overload.

To summarize, Google has several layers of congestion control custom-designed into the crawl application. Most small web crawlers have zero.

None of these other crawlers have figured this out, so they abuse websites, causing all small-scale crawlers to get banned.

- ex-Google Crawl SRE


Thank you for those insights, it's a topic I'm interested in. Agree with what you're saying about naive bots hitting websites/hosts/subnets too hard, in the context of site owners being hit by multiple bots for multiple reasons and them questioning the return they'll get.

I'd be interested to know more info wrt DNS lookups. Did you apply a blanket rate limit on the number of DNS requests you'd make to any particular server?

From past experience I know the .uk Nominet servers would temp-ban if you were doing more than a few hundred lookups per second. At the next host level down, was there a blanket limit or was it dependent on the number of domains that nameserver was responsible for?


I believe there is a lot of hidden fight behind the scenes for Google to monopolize the web.

There are a lot of expectations from the public that Google maintains. Apart from delivering the best search results, they have to respect robots.txt, limit crawling frequency, deliver search results really fast, etc.

At the same time, lots of people want to game Google and spam search results so they can make money with ads. The competition is not fair at all - the websites can tell which traffic came from GoogleBot and craft legitimate responses while Google is not allowed to publicly crawl the website with a fake User Agent.

Many websites are concerned about their data being crawled (like Amazon, they certainly wouldn't be happy if all their price information is dumped as a database), and Google has to make sure that no robot can crawl too much of a website by automatically searching Google, and to that end they invented reCATPCHA.

It's not easy at all to build a search engine that behaves responsibly both to websites and to users. It's the ability to deal with all these matters really well that gives Google the monopoly power.


I think there are plenty of other people crawling the web. There's Common Crawl, there's the Wayback machine... it's not just Google. Then there is a very long tail of crawlers that show up in the logs for my small-potatoes personal website. Whatever they're doing, they seem to be existing in peace, at the very least.

To some extent, I agree with this site that people are nicer to Google than other crawlers. That's because the crawl consumes their resources but provides benefits -- you show up on Google, the only search engine people actually use. But at the same time, they are happy to drag Google in front of Congress for some general abuse, so... maybe there is actually a little bit of balance there.


Even that won't change much. There is no way Google can be out-googled by other search engines because of its market dominance: more traffic means more clicks, more clicks mean better search results, better search results will drive more traffic.

I try bing and DDG for a week or so every 6 months. I always switch back to google eventually because the results are so much better.

Google can only be disrupted if something new is invented, something different than search but delivering way better results. I have no clue what that might be. But I hope someone is working on it.


I've had the exact opposite reaction to the comparison between Google and DuckDuckGo. I use the latter daily and only rarely revert to Google. Even then I usually don't find the results to be any better and often find them to be worse.

In my estimation, Google's search results have significantly declined in recent years.


Agreed. I've fully changed over to DDG on my phone and rarely add the !g to get a google search.


Ha, maybe I should give it a try again :) My 6 months period is almost over again.


Yup. My opinion has long been that the only thing that will take down google is a massive increase in NLP, such that the historical click data can be outperformed by a straight up really good NLP model


That's interesting. Is anyone working on this already? SV startup? And: don't you think Google is in the best position to build such a thing?


Around a decade ago, I was part of the team responsible for msnbot (a web crawler for bing). There used to be robot.txt (forgot the extension now). Most of the website was giving 10-20x higher limits to googlebot than rest other crawler.

Google definitely has unfair advantage there.

Bing and duckduckgo still provide very reasonable result with 10-20x less resources but not at par of google.


Maybe it would be nice if some sort of simple central index of "URLs + their last updated timestamp/version/eTag/whatever" would exist, updated by the site owners themselves? ("push"-notification)

Meaning that whenever a page of a website would be created or updated, that website itself would actively update that central index, basically saying "I just created/deleted page X" or "I just updated the contents of page X".

The consequence would be that...

1) ...crawlers would not have anymore to actively (re)scan the whole Internet to find out if anything has changed, but they would only have to query that central index against their own list URLs & timestamps to find out what needs to be (re)scanned.

2) ...websites would not have to just wait&hope that some bot would decide to come by to have a look at their sites, nor they would have to answer over and over again requests that are just meant to check if some content has changed.


I have seen sites behave differently if you use a Googlebot UA, but am I missing something or does this merely mean that anyone doing something like this

curl -A 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

will get Google-level crawler access?


That would work on website that have a naive check for just user agent. Google also publishes the IP address ranges their crawlers run on. Lot of websites check for that, and there's no way around that.

https://developers.google.com/search/docs/advanced/crawling/...


I can't really trust a website that spells its own name wrong on their homepage. "Knucklesheads’ Club"

Edit: https://imgur.com/a/inqYrjV


Everyone makes mistakes


If the shared cache ever became significant enough to matter it would be devastated by marketers, scammers and other abusers. Google employs the groomers that make their index at least tolerable, if still clearly imperfect. Without that cadre of well compensated expertise to win the arms race against such abusers the scheme is not feasible.

I suppose this could be crowdsourced if I didn't know about politics and how any attempt at delegating the responsibility for blessing sites and their indexes would become a controversy. Google takes lots of heat about its behavior already, but Google is a private entity and can indulge its private prerogatives for the most part. Without that independence this couldn't function.


I don't really understand your comment. Marketers, scammers and other abusers already publish to the web with the intention to be included in a crawl. Postprocessing crawl data is already a thing.

Assuming this hypothetical shared crawl cache were to exist, it does not preclude google (and all consumers of that cache) doing their own processing downstream of that cache. Does it?

What's the new attack vector?


> I don't really understand your comment.

If you don't then you fail to appreciate the amount of labor it takes to thwart bad actors from ruining indexes. Abusers do publish to the web, and we enjoy not wallowing in their crap because small army of experienced and expensive people at a select few Big Tech companies are actively shielding us from it.

It's easy to anticipate the malcontent view; 'Google spends all its resources on ads and ranking and we don't need all that.' That is naïve; if Google completely neglected grooming out the bad actors people wouldn't use Google and Google's business model wouldn't be viable.

So the obvious question is; where is this mechanism without Google et. al? Will the published caches be 99% crap (and without an active defense against crap you can bet your life it will) and anything derived from it hopelessly polluted? If so then it isn't viable.

Now the instinct will be to find a groomer. Guess what; that's probably doomed too. No selection will be impartial to all, so you get to fight that battle. Good luck.


>Will the published caches be 99% crap

Yes. It will be exactly as crap as whatever's published on the web.

And the utility of google's search engine would be to perform their proprietary processing on top of the publicly-available crawl results. Analogous to how their search is already preforming proprietary processing on top of a crawl cache.

>If you don't then you fail to appreciate the amount of labor it takes to thwart bad actors from ruining indexes.

Did you miss the part where I said "Assuming this hypothetical shared crawl cache were to exist, it does not preclude google (and all consumers of that cache) doing their own processing downstream of that cache. Does it?"


I tried to set up YaCy [1] at home to index a few of may favorite smaller websites, so I could quickly search just them. That turned out to be a bad idea. Some ended up blocking my home IP address and others reported me to my ISP. None of these sites were that large, and I wasn't continuously crawling them...

[1] https://yacy.net/


I have been running my own Searx instance in AWS for a while and have not gotten blocked yet anywhere


Wow, I've never heard of Searx - https://searx.space/

It's privacy friendly proxied search results, a lot like the Nitter project for Twitter. https://github.com/zedeus/nitter/wiki/Instances


How often were you searching?


I was regularly searching, but I was rarely indexing any of the sites. I struggled to even get an initial index of many of the sites, due to being blocked or being reported.


If you were getting blocked on an initial load, you were either hit with rate limiting or an unrecognized user agent


I just don't see this working out legally. How would it even work?

From the "learn more"

> Sometime soon we will be publishing what we think should happen and what we think will happen. These two futures diverge and we believe that, while the gap between them exists, it will entrench Google’s control over the internet further. We believe that nothing short of socialization of these resources will work to remove Google’s control over the internet. Our hope is that in publishing this work right now we will let the genie out of the bottle and start a process towards socialization that cannot be undone.

Sorry, but I deeply skeptical of this. This sounds like the first step towards a non-free internet. At the end of the day, it is your box on the web, and if you want or don't want someone/something to crawl it, that is your call to make.


Not sure why http://commoncrawl.org/ wasn't mentioned.


I've definitely scraped by this problem on several occasions. Recently I was writing a tool to check outgoing links from my site, to see which sites are offline (it's called notfoundbot). What I found was that many sites have "DDoS Protection" that makes such an effort impossible, other sites whitelist the CuRL headers, others like it when you pretend to be a search engine.

Basically writing some code that tests whether "a website is currently online or offline" is much, much harder than you think, because, yep, the only company that can do that is Google.


This "club" charges a membership fee of $10 a month (or $100 a year) to comment.

Does this go to some sort of nonprofit or holding entity that's governed by its members? Or do people have to trust the owner?


I'm not sure if it is a good thing if there is a public cache of everything that Google has. The issue is websites will simply stop serving content to Google to protect their content from being accessed by their competitors, this in turn will make search much worse and will force us back to the pre-search dark ages of the internet. The sites may even serve an even more crippled version of their content just to get hits but there is no doubt search quality will suffer.

We're left with a monopoly that is Google, destroying it now could be foolish.


I'm working on a search-related project. There is a paradox here. Many content producers hate Google taking their content without giving them a fair share of the revenue that search generates. But they block any competitors to Google from spidering or crawling their websites, or rate-limit programmatic queries.

The way things work in practice, much of the web tries to prevent any type of programmatic access though a combination of edge-tech and robots.txt policies.

Content producers are addicted to Google and block competition, even though it is killing them. As Shakespeare put it: "Like rats that ravin down their proper bane, A thirsty evil; and when we drink we die."


Maybe a naïve question but what prevents Knuckleheads’ from ignoring the robots.txt and crawl the side anyway? And if it's so easy to do, how does Google have a monopoly on crawling then?


On smaller sites, nothing usually. But on bigger sites you will be blocked. You will probably be blocked even if you do follow robots.txt


It's just rude to do so, and there are some technical issues with doing that as well (such as crawling admin panel which might trigger backend alarms/security alerts). Google also doesn't have a legal monopoly on crawling, only a natural monopoly thanks to a lot of websites independently choose to only allow Google and Bing because of the many issues with third-party crawlers (eg. crawling all pages at once, costing money/slowing down the site[0]).

0: https://news.ycombinator.com/item?id=26593722


Their IP range can be blocked. Google has known IP ranges and if the website admin wants can allow them based on that rather than robots.txt, and that would really mess up everyone else.


And those spiders can sometimes inadvertently act as a DDoS attack.

https://security.stackexchange.com/questions/16609/spider-at...


They really missed an opportunity to get creative with their own `robots.txt` implementation.


A somewhat offtopic question: why isn't there a lighter version of HTTPS that only signs messages, but doesn't encrypt them? This way it would be possible to cache the content and use the cache for indexing.


> Only a select few crawlers are allowed access to the entire web, and Google is given extra special privileges on top of that.

Hmm, so set up a VPN on the Google Cloud so you have a Google IP address, use a Google User-Agent, and go!


https://developers.google.com/search/docs/advanced/crawling/...

describes the procedure for checking "is this source Google it". You couldn't fake it just by running on gcp


Public cache is an awful idea. It initially sounds like a great public good, but many of us can see where it will go:

- google will still have some "extra" crawling to keep his monopoly

- everyone else would be fighting for an access to said cache, which will not be able to carry everyone who wishes. So, there will be rationing and favors

- that will quickly become a bureaucracy which would envelop every aspect of internet activity

- "undesirable" sites will be easily ejected from cache and forgotten forever

- you will end up with a single entity paid by public, can't go bankrupt, in charge of whole internet. Google will go bankrupt if they don't satisfy people - these people don't even have to do a good job - there's nobody else on the market (remember, we started with the need to eject google? This will eventually happen).


This argument would be made a lot stronger with numbers...

What percentage of the top million webpages allow Googlebot? What percentage allow other robots?

Why not simply pretend to be Googlebot? After all, browsers pretend to be Mozilla...


Coincidentally, this item [1] has just turned up in HN - Common Crawl

[1] https://news.ycombinator.com/item?id=26594172


This isn’t illegal and it isn’t Google’s fault

Right there in the article..


Again, with critical context.

This isn’t illegal and it isn’t Google’s fault, but this monopoly on web crawling that has naturally emerged prevents any other company from being able to effectively compete with Google in the search engine market.


> There Should Be A Public Cache Of The Web

This might be closest to it: https://commoncrawl.org/


The irony is that they bitch about you not scraping search or other platforms without paid plans & want to do the same to you


Or use MaidSAFE where you get paid to serve your website as opposed to the other way around.


I disallow scanning on all my projects. After GDPR I also removed all analytics - I realised it is just a time sink - instead of focusing on content I would often focus on getting the bigger numbers. I am not a marketer, so it didn't have much value to me and it would just enlarge Google dataset without any payment. I get that you cannot find my projects in the search engine. I am okay with that :-)


Any word on or opinions about Brave's initiative to challenge search?


Yes, there was a large discussion here https://news.ycombinator.com/item?id=26328758


dupe/posted earlier etc

I also got confused about this page as there's another project of theirs around right now about RIP Google Reader that's on a seperate domain...

Funny a site that's all about google this and that doesn't have clear URL/pages for their articles that can be linked to easily geez

Original post/discussion from the source, 3 months ago: https://news.ycombinator.com/item?id=25417067


https://knuckleheads.club/introduction/

That seems like an easy link?


so you want money. what about telling who you are and what the money specifically supports. and why not a free join tier?


How about a system whereby we tell others whether or not we want to be crawled/not crawled by them? /s


I think the solution here is everybody masquerades as Googlebot so we can render the whole thing moot


Ignoring robots.txt is trivial, that's why some(many?) sites enforce it by verifying source IP and recognize Googlebot from its IP addresses - how will you get access to one of those?


What does "recognize Googlebot from its IP addresses" mean? If I'm a human and I access a site, I have some other IP than Googlebot, how should this side know if I'm a human or knuckleheadsbot?



if you're claiming to be User-Agent: Googlebot, but your IP doesn't seem like it belongs to Google, don't you think it's a clear sign that you're FAKING IT?

The check itself could be implemented for example with ASN or reverse DNS lookup or hard-coding known Google's IP ranges (though that's prone to become stale)


I'm looking for $ 576


Money earns more money. Privilege begets more privilege.

This is not just true in the case of Google but in other other domains as well like the financial markets.

Would you blame capitalism?


I have an idea: remove the art of web crawling from the domain of a single company and instead create a international group of interested parties to run it instead. I'm thinking broadly along the lines of the Bluetooth SIG. Maybe it will be a bit more complicated, and require international political efforts, but it will make the search engine market way more democratic.


Seems like a private cache of the web would solve the problem? Why does it need to be public?


Seriously? Google is a private cache of the web. That is the problem.


Google doesn’t give anyone access to said cache. I mean one crawler with a shared api among competitors. So exactly the same as the public cache, but run my a private company and accessed for a small fee


I don't think you're quite clear on what the words "public" and "private" mean. "Public" is not a synonym for "run by the government" and "private" is not a synonym for "closed to everyone but the owner". Restaurants, for example, are generally open to the public, but they are not public. A restaurant owner is, with a few exceptions, free to refuse service to anyone at any time.

If it's "exactly the same as a public cache" then it's public, even if it is managed by a private company. The difference is not in who has access, the difference is in who decides who has access.


Ok I am not clear then, but I’m less clear after your comment! In a public cache, who would you want to decide who has access? Is simply saying “anyone who pays has access” enough to qualify as public? if so, then I agree and this was my (possibly poorly phrased) intention in the original comment.

But imo the restaurant model is also fine; in most cases people have access and it works.


> Is simply saying “anyone who pays has access” enough to qualify as public?

No because someone has to set the price, which is just an indirect method of controlling who has access.

> the restaurant model is also fine

It works for restaurants because there is competition. The whole point here is that web crawling/caching is a natural monopoly.

A better analogy here would be Apple's app store, or copyrighted standards with the force of law [1]. These nominally follow the "anyone who pays has access" model but they are not public, and the result is the same set of problems.

[1] https://www.thebrandprotectionblog.com/public-laws-private-s...


> run [by] a private company and accessed for a small fee

That is exactly the opposite of a public cache.


Not really. It serves the same function. Either you pay this hypothetical company or ??? pays to keep up the public one.


Just because it serves the same function does not mean the implementation is the same. Private military contractors and a US infantry squad serve the same function, but the implementation completely changes their context.

That being said what I think you're arguing for would be the implementation of a public utility or private-public business. If that's the case then yes, what you're saying is correct.


You can API google search results to make a meta-search engine if you want to but it's like $5 / 1k requests.


> Google doesn’t give anyone access to said cache.

It would also be useful for deep searches, exceeding the 1000 result limit, empowering all sorts of NLP applications.


Google makes $150+ billion from Google Search per year. Running Google Search could be operated for likely (much less than) $10 billion per year.

So, Google is in effect taxing us all $140 billion per year.

It's not dissimilar from how Wall Street effectively taxes us all for an even larger amount.

In both cases, we could use some kind of non-profit open system to facilitate web search and stock trading.

The Great Lie that Google is doing a good thing by charging money to insert "relevant ads" above the search results is totally wrong. If those ads are the most relevant, they should just be the top organic results, obviously.

Google mostly solved search 20 years ago. There's really nothing that impressive about Google Search in 2021. It should be relatively easy to replace it with something open, leveraging the massive improvements in hardware and software. It could operate like Wikipedia or Archive.org. The hard part is probably getting the right team and funding assembled.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: