Hacker News new | past | comments | ask | show | jobs | submit login
Google Has Officially Killed Cache Links (gizmodo.com)
234 points by f_allwein 3 months ago | hide | past | favorite | 113 comments



This sucks. Cache basically guaranteed that whatever Google thought was on the page could actually be found.

These days Google will offer a result where the little blurb (which is actually a truncated mini cache) shows a part of the information I'm looking for, but the page itself does not. Removing cache means the data is just within reach, but you can't get to it anymore.

Internet Archive is a hassle and not as reliable as an actual copy of the page where the blurb was extracted from.


Google used to have a policy of sites being required to show Google the same page that they serve users. It seems that has been eroded.

I'm not sure how that serves Google's interests, except perhaps that it keeps them out of legal hot water vs the news industry?


It's called cloaking and it's still looked down upon from an SEO perspective. That said, there's a huge gray area. Purposefully cloaking a result to trick a search engine would get penalized. "Updating" a page with newer content periodically is harder to assess.


There's also "dynamic rendering," in which you serve Google/crawlers "similar" content to, in theory, avoid JS-related SEO issues. However, it can just be a way to do what the parent commenter dislikes: render a mini-blurb unfound on the actual page.

Shoot, even a meta description qualifies for that - thankfully Google uses them less and less.


Google will reliably index dynamic sites rendered using JS. And other search engines do the same. There's really no good reason to do this if you want to be indexed on search engines.


Agreed. Yet whether it should be done is different than whether it is done. Google was recommending it in 2018, and degraded it to a "workaround" just two years ago. Sites still do it, SaaS products still tout its benefits, and Google does not penalize sites for it. GP's gripe about SERP blurbs being missing is still very much extant and blessed by Google.


People on HN have repeatedly stated that Google is "stealing their content" from their websites so it seems like this is a natural extension to that widespread opinion.

Isn't this the web we want? One where big corporations don't steal our websites? Right?


> People on HN have repeatedly stated

It is definitely an area where there is no single "people of HN" - opinion varies widely and is often more nuanced than a binary for/against⁰ matter. From an end user PoV it is (was) a very useful feature, one that kept me using Google by default¹, and I think that many like me used it as a backup when content was down at source.

The key problem with giving access to cached copies like this is when it effectively becomes the default view, holding users in the search providers garden instead of the content provider being particularly acknowledged never mind visited and the search service making money from that through adverts and related stalking.

I have sympathy for honest sites when their content is used this way, though those that give search engines full text but paywall most of it when I look who complain about the search engine showing fuller text, can do one. Also those who do the "turn off your stalker blocker or well not show you anything" thing.

----

[0] Or ternary for/indifferent/against one.

[1] I'm now finally moving away, currently experimenting with Kagi, as a number of little things that kept me there are no longer true and more and more irritations² keep appearing.

[2] Like most of the first screen full of a result being adverts and an AI summary that I don't want, just give me the relevant links please…


Cached was only a fallback option when the original site was broken. When the original site works, nearly everyone clicks on it.


People on HN will always find a way to see a good side from Google's terrible product and engineering


Which is great that there's one less of their products, it's terrible anyway.


ownership over websites does not work the way people expect nor want

I'm tired of saying this, yelling at clouds


I most commonly run into this issue when the search keyword was found in some dynamically retrieved non-body text - maybe it was on a site's "most recent comments" panel or something.


That policy was never actually enforced that way, however. They'd go after you if you had entirely different content for google vs for users, but large scientific publishers already had "full pdf to google, HTML abstract + pay wall to users" 20 years ago and it was never an issue.

It makes some sense, too because the edges are blurry. If a user from France receives a french version on the same URL where a US-user would receive an english version, is that already different content? What if (as it usually happens), one language gets prioritized and the other only receives updates once in a while?

And while Google recommends to treat them like you'd treat any other user when it comes to e.g. geo-targeting, in reality that's not possible if you do anything that requires compliance and isn't available in California. They do Smartphone and Desktop-crawling, but they don't do any state- or even country-level crawling. Which is understandable as well, few sites really need to or want to do that, and it would require _a lot_ more crawling (e.g. in the US you'd need to hit each URL once per state), and there's no protocol to indicate it (and there probably won't be one because it's too rare).


> It makes some sense, too because the edges are blurry. If a user from France receives a french version on the same URL where a US-user would receive an english version, is that already different content?

The recommended (or rather correct) way to do this is to have multiple language-scoped URLs, be it a path fragment or entirely different (sub)domains. Then you cross-link each other with <link> tags with rel="alternate" and hreflang (for SEO purposes) and give the user some affordance to switch between them (only if they want to do so).

https://developers.google.com/search/docs/specialty/internat...

Public URLs should never show different content depending on anything else than the URL and current server state. If you really need to do this, 302 Redirect into a different URL.

But really, don't do that.

If the URL is language-qualified but it doesn't match whatever language/region you guessed for the user (which might very well be wrong and/or conflicting, e.g. my language and IP's country don't match, people travel, etc.) just let the user know they can switch URLs manually if they want to do so.

You're just going to annoy me if you redirect me away to a language I don't want just because you tried being too smart.


> Public URLs should never show different content depending on anything else than the URL and current server state.

As a real-world example: you're providing some service that is regulated differently in multiple US-states. Set up /ca/, /ny/ etc and let them be indexed and you'll have plenty of duplicate content and all sorts of trouble that comes with it. Instead you'll geofence like everyone else (including Google's SERPs) and a single URL now has content that depends on the perceived IP location because both SEO and legal will be happy with that solution, and neither will be entirely happy with the state-based urls.


> You're just going to annoy me if you redirect me away to a language I don't want just because you tried being too smart.

So what do you propose that such a site shows on their root URL? It's possible to pick a default language (eg. English), but that's not a very good experience when the browser has already told you that they prefer a different language, right? It's possible to show a language picker, but that's not a very good experience for all users, then, as their browser has already told you which language they prefer.


See sibling comments.


What about Accept headers?


Quoting MDN https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ac...

> This header serves as a hint when the server cannot determine the target content language otherwise (for example, use a specific URL that depends on an explicit user decision). The server should never override an explicit user language choice. The content of Accept-Language is often out of a user's control (when traveling, for instance). A user may also want to visit a page in a language different from the user interface language.

So basically: don't try to be too smart. I'm more often than not bitten by this as someone whose browser is configured in English but often would like to visit their native language. My government's websites do this and it's infuriating, often showing me broken English webpages.

The only acceptable use would be if you have a canonical language-less URL that you might want to redirect to the language-scoped URL (e.g. visiting www.example.com and redirecting to example.com/en or example.com/fr) while still allowing the user to manually choose what language to land in.

If I arrive through Google with English search terms, believe it or not, I don't want to visit your French page unless I explicitly choose to do so. Same when I send some English webpage to my French colleague. This often happens with documentation sites and it's terrible UX.


"Accept" specifies a MIME type preference.


You said accept headerS and since the thread was about localization I assumed you meant Accept-Language.

To answer your comment: yes you should return the same content (resource) from that URL (note the R in URL). If you want/can, you can attend to the Accept header to return it in other representation, but the content should be the same.

So /posts should return the same list of posts whether in HTML, JSON or XML representation.

But in practice content negotiation isn't used that often and people just scope APIs in their own subpath (e.g. /posts and /api/posts) since it doesn't matter that much for SEO (since Google mostly cares about crawling HTML, JSON is not going to be counted as duplicate content).


Why are XML and JSON alternates of the same resource but French and German are two different resources?


Because the world is imperfect and having URLs instead of using content negotiation makes a far better user and SEO experience so that's what we do in practice.

IOW pragmatism.


Giving the content in the user agent's configured language preference also seems pragmatic.


In what way is ignoring this pragmatic?

https://developers.google.com/search/docs/specialty/internat...

> If your site has locale-adaptive pages (that is, your site returns different content based on the perceived country or preferred language of the visitor), Google might not crawl, index, or rank all your content for different locales. This is because the default IP addresses of the Googlebot crawler appear to be based in the USA. In addition, the crawler sends HTTP requests without setting Accept-Language in the request header.

> Important: We recommend using separate locale URL configurations and annotating them with rel="alternate" hreflang annotations.


Content delivery is becoming so dynamic and targeted, there is no way that can work effectively now -- even for first impression as one or more MVTs may be in place


>Internet Archive is a hassle and not as reliable

The paradox of the internet archive's existence is if all that data were easily searchable and integrated (i.e. if people really used it) they would not exist by way of no more money for bandwidth and by way of lawsuit hell. So they exist to share archived data, but if they share archived data they would not exist.

and so it is a wonderful magical resource, absolutely, but your "power user" level as well as "free time level" has to be such that you build your own internet archive search engine and google cache plugin alternative... and not share it with anyone for the above existential reasons


I don't have a current example of this, but I'd just like to add that they also do this for images.

E.g. images deleted from Reddit can sometimes be seen in Google image results, but when following the links they might lead to deleted posts.


That is because the image and the post are treat as separate entities, with the post providing metadata for the image so the search engine will find it when you query. Even from Reddit's sure, if the post is gone the image may remain so you could be being served that by Reddit (or third party) rather than from Google's cache (any thumbnail will come from Google of course).


I've been puzzled by this move for a long time, since they first announced it. None of their provided reasoning justifies removal of such a useful feature.

The simplest answer could be that making the cache accessible costs them money and now they're tightening their purse strings. But maybe it's something else...

For sites that manipulate search rankings by showing a non-paywalled article to Google's search bot, while serving paywalled articles to regular users, the cache acts as a paywall bypass. Perhaps Google was taking heat for this, and/or they're pre-emptively reducing their legal liabilities?

Now IA gets to take that heat instead...


I assume it's to stop people using the cached copies as source material for LLM's.

The cache is arguably a strategic resource for google now.


If it is a scraping thing, I'd rather they added captchas than took the feature away entirely. I know captchas can be bypassed, but so can paywalls.


I always wondered if it was legal for Google to store those cached pages.


The killing presumably has to something do with the legal costs of maintaining this service.


This breaks my heart a bit. My first browser extension Cacheout was around 2005. Back in the days of sites getting hugged to death from Slashdot. The extension gave right context menu options to try loading a cached version of the dead site. Tried Google cache first, then an another cdn caching service I can't remember, and finally waybackmachine. Extension even got included in a cd packaged with MacWorld magazine at one point.

This has always been one of Google's best features. Really sad they killed it.


> Tried Google cache first, then an another cdn caching service I can't remember, and finally waybackmachine.

Coral Cache maybe? The caches you listed were my manual order to check when a link was Slashdotted.

Google's cache, at least in the early days, was super useful in the cache link for a search result highlighted your search terms. It was often more helpful to hit the cached link than the actual link since 1) it was more likely to be available and 2) had your search terms readily apparent.


I remember CacheFly being popular on Digg for a while to read sites that got hugged.


Coral Cache! Thanks. Totally forgot that one.


I am surprised that no one has mentioned the most obvious alternative: Bing Cache.

It is not as complete as Google's, but it is usually good enough.


Yandex also has a pretty extensive cache, although recently they seem to have disabled caching for reddit. Otherwise it is good for finding deleted stuff, I've seen cached pages go as far back as a couple of years for some smaller/deleted websites.


I only ever used cache to find what google thought was in the site (at the time of crawling) as these days it is common to not find that info in the updated page. For everything else, there is the Internet Archive.


Thanks! I never go to Bing but I probably will now.


I recall the links disappearing quite a while ago. It's a bummer because cached links are genuinely useful - helps one visit a site if it's temporarily or recently downed, sometimes can bypass some weak internet filters, can let one view the content of some sites without actually visiting the server which may be desirable (and maybe undesirable for the server if they rely on page hits).


The article is from February 2024, so you probably noticed them going around the time it was published! For some reason people seem to be talking about it again as though it only just happened, I’ve seen this and similar articles/threads posted a couple of other places this week.


It's odd, because I haven't seen cache links on Google for years. I used to rely on them quite a bit and once in a while would try again and run into "oh yeah, they seem to have dropped this feature." This whole thread is strange to me, sounds like they've been around for people much more recently? Or maybe moved location and I haven't found them (which is weird cause I looked...)


Not just you, I haven't seen links to the feature, even when I've gone looking for it, in years. Even the link on archive.is to use Google's cache if the page wasn't already archived hasn't worked in quite a while.


> It's odd, because I haven't seen cache links on Google for years

For quite a time they stopped being a simple obvious link but where available in a drop-list of options for results for which a cashed copy was available.


Not OP, and yes they "hid" it that way too. But I got the distinct sense that they removed it many years ago for certain websites (and more and more over the years I guess till now). They probably had some sort of flag on their analytics dashboard that website owners were given the privilege of changing so that people couldn't see the cache. Or for all we know it was some sort of "privacy" feature similar to "right to be forgotten".


For some sites. For years many search results didn't have a cache link in the drop down.


Same here, I found this submission really odd since I haven’t seen them in years. Maybe they did some slow roll by country?


That was never Google's job anyway. It boggles my mind how there is very little public investment in maintaining information, while tons of money is being wasted keeping ancilarry things alive that nobody uses. We should have multiple publicly-funded internet archives, and public communication infrastructure fallback, like email.


Part of the ideas behind the EU’s “very large online platform” rules. Basically saying that if your platform gets big enough where its basically important infrastructure, that then comes with responsibilities.

I would welcome some rules and regulations about this kinds of stuff. Can you imagine that google wakes up one day and decides to kill gmail? It would cause so many problems. It’s already impossible that even as a paid gmail user you can’t get proper support in case something goes awry. Sure you can argue they can decide what ever they want with their business. But if you have this many users, I do think at some point that comes with some obligations of support, quality and continued service.


I wouldn't trust google to decide what information should be stored and what should be condemned to damnatio memoriae


What is Google's job? Is it only to leech off the public internet?


Broker ads intelligence


The cache link predates Google's ads business


As a private sector company their job is maximizing revenue forever. There old slogan was "don't be evil" they have a new slogan, which is "vanquish your enemies, to chase them before you, to rob them of their wealth, to see those dear to them bathed in tears"


Use the internet for a week without any search engine.


Google is not, in fact, the only search engine.

For most users the internet has 5, maybe 10 web sites. I can use Wikipedia search or LLMs when I have questions.


I see your point with Wikipedia, but the writing is on the wall for LLMs: since they are replacing search engines for some users, it's only a matter of time before that experience gets polluted with "data-driven" ad clutter too.


Every search engine and also LLMs engage in the same problematic behavior.

Maybe just use Wikipedia search only then!


Compared to the experience of using google without ad Block?


Well, you can put it like that, or you can answer in good faith.


Most people want to know what is happening here and now, and if they want information about the past they prefer the latest version. Archival is a liability, not an asset, in Google's case


But they made it their job, got people to depend on it, and then yanked it away without telling anyone first.


This article is from February. Since then, the IA partnership did materialize, and the "on its way out" `cache:` search workaround (which is still wholly necessary imho) still works.

https://blog.archive.org/2024/09/11/new-feature-alert-access...


3 days ago - "Google partners with Internet Archive to link to archives in search" - https://news.ycombinator.com/item?id=41513215

Looks like cached pages just got more useful, not less.


search “cache:https:// gizmodo(.)com/google-has-officially-killed-cache-links-1851220408” on Google.. the cache is still around, just the links are gone. also this article is from February


From the article:

  For now, you can still view Google’s cache by typing “cache:” before the URL, but that’s on its way out too.


Oops, didn’t see that


> There’s another solution, but it’s on shaky ground. The Internet Archive’s Wayback Machine preserves historic copies of websites as a public service, but the organization is in a constant battle to stay solvent. Google’s Sullivan floated the idea of a partnership with the Internet Archive, though that’s nothing close to an official plan.

Man, wish the Internet Archive hadn't staked it all tilting at copyright windmills...

(see e.g. https://news.ycombinator.com/item?id=41447758)


TBH this is why I'm partial to Microsoft Recall or something similar, because inevitably it's going to get monetized to address link rot... and private data. Too bad there isn't a P2P option where you can "request" screenshots of cached webpages from other people's archives. Maybe it's all embedded in LLM training data sets and will be made public one day.


Hah, this is definitely going to happen. First llms kill the original public internet by simultaneously plagiarizing and deincentivizing everything original, then after it disappears they can sell it back to us again by unpacking the now-proprietary model data which has become the only “archive” of the pre llm internet. In other words: A product so perfect that even avoiding the product requires you have to use the product, what a complete nightmare


I think I've seen an extension (?) that would auto-save every webpage to your device, probably on r/datahoarder, that I'm still trying to find. I also have used a relatively easier auto-archive-to-wayback-machine extension that's probably close enough for most people.


I have this set up with archive box. Unfortunately, if you do much browsing, it will very quickly saturate memory and CPU on whichever machine is running the archive box. It also gets really big, really fast. There are also increasingly websites that are blocking it, so when you look at the archive it is either empty or worthless. Still worth it to some people, but it does have its challenges.


Another one to be added to the list:

https://killedbygoogle.com/


"Google Has Officially Killed Cache Links" (Feb 2024)

The cache is often still accessible through a "cache:url" search. There's been no official announcement, but it does seem like that could go away at some point too. That is even more likely now that Google has partnered with the Internet Archive.

What I'd really like to see, and maybe one good possible outcome of the mostly bogus antitrust suits is to have a continuously updated, independent, crawl resource like Common Crawl.


Misleading: article from Feburary.

Lots of discussion then:

https://news.ycombinator.com/item?id=39198329

More recently:

New Feature Alert: Access Archived Webpages Directly Through Google Search

https://news.ycombinator.com/item?id=41512341


Good open decentralized project oportunity.


I once had to reconstruct a client's website from Google's cache links. It was a small business that had payed for a backup service from their ISP, that turned out never to have existed.


Cached pages were amazingly useful in my prior role where a main objective was to detect plagiarism. There were only a handful of cheater sites in play, and 100% of them were paywalled.

So searching them in Google was exactly how students found the answers, I assume, but we wouldn't have had the smoking gun without a cached, paywall-bypass, dated copy. $Employer was definitely unwilling to subscribe to services like that!

(However, the #1 most popular cheat site, by far, was GitHub itself. No paywalls there!)


“There’s another solution, but it’s on shaky ground. The Internet Archive’s Wayback Machine preserves historic copies of websites as a public service, but the organization is in a constant battle to stay solvent. Google’s Sullivan floated the idea of a partnership with the Internet Archive, though that’s nothing close to an official plan.”

Too lazy to find a link, but this is now public and live, although pretty well hidden. Three dots menu for a search result -> More about this page.


It seems like just a templated link in a hidden corner.

"The Wayback Machine has not archived that URL."

A large part of the usefulness of the cache links came from the inherent freshness and completeness of the Google indexing.


If the partnership with Internet archive happens, I would be glad that IA will get better funding to keep operating. But I am also concerned with Firefox like situation happening with IA, where Google pulling funding might pose existential risk to IA.


If Google doesn't want to maintain their own cache why would they pay to maintain someone else's cache?


Weird, it still seems to be working for me: https://webcache.googleusercontent.com/search?q=cache:http:/...

Was invisible on the search UI for some time now, but the service itself is still accessible.


I’m behind a corporate proxy. This means that a very very large portion of the internet is now unavailable to me.


If you need to access these sites for work, I suggest requesting them sequentially. Generally, people don’t adjust filters until people complain. After you become the number one ticket creator for mundane site requests, they’ll usually bend the rules for you or learn to adjust the policy.

The reality is that people who create these filter policies often do so with very little thought, and sans complaints, they don’t know what their impact is.


If your company actually does this, that's impressive. The vast major Big corporates that I have seen do not even really review these requests unless they come from a high-ranking person. When they do actually review them, it's usually a cursory glance or even just a quick lookup of the category that their web filters have it on, followed by a rapid and uninformed decision to deny the request. Oftentimes they won't even read the justification written by the employee before they deny the request.

God help you if you need something that's not tcp on port 443. Yes, I'm still a little bit bitter, but I have spent a lot of time explaining the difference between TCP and UDP to IT guys who have little interest in actually understanding it, and ultimately won't understand it and will just deny the request. Sometimes after conferring with another IT person who informs them that UDP is insecure and/or unsafe, just like anything, not on Port 443.


I see this as a continued sad slide away from Google as research tool towards Google as marketing funnel.



That explains why they they added a link in the results' additional info to the Internet Archive.

And some people considered that a "victory" for IA.

They'll just foot the bill while Google reap the rewards


I don't think I've used a cache link in some time. It stooped being reliable years ago, and the archive.ph type of services seemed to pick up the slack and do a much better job.


really just one more if not the final nail in google searches coffin tbh.

VERY rare these days a google search result actually contains what was searched for - anything with a page number in the url and cache was guaranteed to be the only way to access it.

Combine that with the already absolute epic collapse of their search result quality and ms copilot locally caching everything people do on windows, and this may well be recorded in history as the peak of google before its decline.

very sad day.


Remember the client of most search engines are advertisers, which incentivizes the engines to not serve the most relevant results right away. You could give a (free) try at paid search engines and see if they would be worth your money.


Didn't they do that like... six months ago? Thus why they partnered with the Internet Archive recently.


Wonder if this is really just a cost cutting measure. Those “cache links” were essentially site archives.


Seems like a really good opportunity for a browser extension to offer links to other sources


Just paid for 1 year of kagi. See ya


fwiw, Yandex still frequently has cached versions, and you can save the cache in archive.today.


so depressing. but bing still provides a link back to the cached version.


Many complaints about the passive voice are overblown: it’s a perfectly fine construction and most appropriate in some places. (It’s also frequently misidentified, or applied to any evasive or obfuscatory sentence, whether grammatically active or passive.)

But here is an instance where all the opprobrium is justified:

> So, it was decided to retire it.

“It was decided”? Not you decided or Google decided, but it was decided? Come on.


What??? Oh no. I love that feature so much. What should I use in the future then? IA can be a solution but often the link I am interested in is not there. For example, foreign news from developing country.


Nothing, modern website owners think that Google, IA and similar sites think their IP is being stolen by archiving it and the law agrees.

You wouldn't want to be a thief... right?


I see. I was sad but you are right. This is an understandable decision.


[flagged]


What do you mean? I'm not supporting Google at all, it's great that another of their services has been turned off so they won't evilly download webpages anymore.


you have a not very smart way to demonstrate you don't support Google by blaming the content creators.


Why must it be a binary? Either you support Google or you support the content creators?

You don't think it's possible for someone to simultaneously think that Google has made a terrible call, while also thinking that the IP industry , copyright people, and yes many content creators have gotten insane with their "rights"? (And I say this as a content creator)


> You don't think it's possible for someone to simultaneously think that Google has made a terrible call

Yes it is possible, but it is not what you did in not just one but multiple posts. And it is also offtopic so your justification is nonsense


Bing search engine has a cache


JFC ... another nail in the coffin ...


I left google a while ago, removing cache is yet another reason to leave.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: