Google search only has 60% of my content from 2006

alister · on April 27, 2019

Why does Google deeply index those useless telephone directory sites? Try searching for the impossible U.S. phone number "307-139-2345" and you'll see a bunch of "who called me?" or "reverse phone number lookup" sites. Virtually all of those sites are complete garbage. They make no attempt to collect numbers from telephone directories or from the web. They won't identify a number as being the main phone number for Disneyland for example.

It's odd that so many of those sites exist, that Google indexes them so deeply, and that they show up in searches so prominently. It's obvious that they are spam, scams, or worthless, but those same sites have been appearing prominently for years.

I agree with the author. My experience has also been that Google heavily prioritizes very large and frequently-updated sites over small static information-rich personal sites. I think it's a big flaw that needs to be fixed or for someone else to do better.

rhizome · on April 27, 2019

I have long believed that the proliferation of phone-number-lookup sites was antisocial media: spammers and other bad actors creating these sites to prevent people from having a central place to talk about bad phone experiences, associated with specific numbers. Which one is the good one? You can't tell.

jjordan · on April 27, 2019

Isn't that exactly what PageRank was designed to handle? None of these spammer sites should have any actual positive reputation.

rhizome · on April 28, 2019

I wouldn't be surprised if there's a network of non-phone sites linking to them that juice their PR. Like I surmised: it's a broad-based strategy.

trillic · on April 29, 2019

Google has cycle detection in PageRank specifically for this reason? They know which regions of the graph are good ones.

jacquesm · on April 27, 2019

Some of these are worse than useless: they hijack real businesses support phone numbers by presenting a high cost number that gets routed to a call center, that then connects you to the business. It's a huge scam.

silisili · on April 27, 2019

I've been wondering the same. I think most people google numbers that called, so it has to be a lucrative business. There are a couple sites that are for user reports that are actually nice, but the vast majority seem to be scammy and irrelevant (not even the right number), or completely fake information. Once I googled a number that returned a lot of results for a first initial, last name and address. It was my own mother's number of the last few years that I kept forgetting to save. The results were obviously either fake or many many years outdated

rigorman · on April 27, 2019

Thoughts after thinking about this comment and thread for a day:

Has the time come for wiki directory of non-commercial (possibly: advertising-free, cookie-free) sites with robust, actually valuable information, and other sites that are doorways to them (think: topical forums, even revived webrings, etc)? Could this feasibly get enough action to be useful?

nine_k · on April 28, 2019

Think about the proliferation of various "awesome" lists on Github.

Some of them are curated and awesome. Some, less so. Likely some of them are even spammy.

The need is realized, but execution is hard.

eddieh · on April 27, 2019

Yes. I was looking for a modern _human_ curated directory of web content just the other day and found nothing usable. I don't think that excluding commercial entries would be necessary, but perhaps there could be some way to filter commercial entries out. Ad-free, JS-free, and cookie-free would be ideal.

h0p3 · on April 27, 2019

I recommend: https://href.cool/ as a sick (though highly particular) example.

mathnmusic · on April 28, 2019

Although it's early days for this project but do check out: https://github.com/learn-awesome/learn-awesome

greendestiny_re · on April 27, 2019

Just curious – what content exactly would you be interested in? If any other poster wants to chip in with suggestions, feel free.

h0p3 · on April 27, 2019

I think you should talk to kicks @ https://kickscondor.com/.

JetSpiegel · on April 27, 2019

https://www.google.com/search?hl=en&q=307%2D139%2D2345

You comment now comes up first, but the rest of the results all try to contact googlesyndication.com, so ads? Google will not exclude sites that literally give them money.

ma2rten · on April 27, 2019

I work on AdSense at Google. We need to crawl all the pages that serve our ads so that we can show ads that are relevant. We actually store those crawl results in a different place than search to prevent exactly that problem. We could probably save a lot of money if we consolidated those indices, but we don't do that prevent biasing search results. As a result a large percentage of pages that show Google Ads are not in the search index.

If Google was so evil, why would we purposefully send traffic to sites where we only get a percentage cut of the revenue?

EDIT: I realize I needed to explain this statement a little bit more. If we show ads on google.com we get 100% of the revenue. If show ads on reversephonelookup.it they get majority of the revenue. There is a limited amount of advertiser demand. Instead of manipulating the organic search results, it would be more profitable for google to just show more ads on the search page or inflate the ad price or something.

rhizome · on April 27, 2019

If Google was so evil, why would we purposefully send traffic to sites where we only get a percentage cut of the revenue?

I'm pretty sure you didn't mean to open this can of worms, did it have something else printed on its label?

Phone number lookup sites are almost certainly a) easily detectable; and b) low traffic. If Google only gets a percentage cut of this, why index them for fractions of pennies per year?

citrablue · on April 27, 2019

It's because Google's constraint is number of engineer-attention-hours. This kind of thing probably just isn't a priority, when they have a billion other things they could work on.

Keep in mind that Google is not a product company, they are a data company. They work on the biggest most impactful items, and don't have a lot of time for one-offs.

Finally, it might not be clear at all that these sites should be removed from the index.

idlewords · on April 27, 2019

Google has vast resources and could easily get this stuff right, but chooses not to. Their protests about not having the capacity can get pretty comical. Note also that the claim about "working on the things that have the most impact" is tautological.

ma2rten · on April 27, 2019

There may be cases where we drop low traffic sites from our index, which is separate from the search index. I don't how much of that is public information, so can't go into detail.

rhizome · on April 27, 2019

Well but my point is why can't Google recognize these phone-number-lookup sites as the chaff that they are? Nothing should rank lower than them in any search that would return their pages. Said another way: it should be harder for them to clog the top of the SERPs (of course "them" could imply any number of topics).

JetSpiegel · on April 27, 2019

> As a result a large percentage of pages that show Google Ads are not in the search index.

Those results are organic, at least they were not marked as ads. That's even worse, it means the regular crawler is preferring generated phone number sites over blogs. That's the real money waster right there.

> We need to crawl all the pages that serve our ads so that we can show ads that are relevant.

Why? I though Google's business model was tracking users to show relevant ads to them. You say it's more like DuckDuckGo, or the old magazine ads: film magazines get blockbuster trailers, gardening magazines get compost ads.

Either the ads are personalized, so the surrounding content doesn't matter, or the ads are "static", so why track everyone all the time, then?

> If we show ads on google.com we get 100% of the revenue. If show ads on reversephonelookup.it they get majority of the revenue. There is a limited amount of advertiser demand. Instead of manipulating the organic search results, it would be more profitable for google to just show more ads on the search page or inflate the ad price or something.

As your sibling comments imply, this works like a protection racket. The profit is closing the market to webmasters that don't allow ads at all, or that use non-Google networks.

rvnx · on April 27, 2019

Because if you show artificially irrelevant results too often (by biasing search results to prioritise AdSense pages), then users would stop using Google Search at some point. Eventually you would loose the revenues coming from the SERP, which are 10 times higher than AdSense :)

Would you risk loosing 10, to get 1 ?

ma2rten · on April 27, 2019

Let's say the search index is size N and the AdSense index is size M. If we were to join them, we would save the storage space for pages that are in both indices.

Also, the search index would gain all the sites that are in the AdSense index so search results would potentially improve. However, it would give an unfair advantage to google publishers.

WaltPurvis · on April 27, 2019

> If Google was so evil, why would we purposefully send traffic to sites where we only get a percentage cut of the revenue?

Can you explain this? Not sure if it's your phrasing or what, but I'm not getting what you mean.

trickstra · on April 27, 2019

because it's better to have 10% of all visits than 100% of only some visits. Google is not the only way people get to those pages. Yes, maybe phone registries, but since the policy cannot be split, we are talking about the whole internet.

yeahitslikethat · on April 27, 2019

I've been saying this is inevitable for a long time. If you don't have Google ads they won't show you. YouTube ranks no ad vids lower.

Google is converting the world into a content production factory FOR Google and they pay literally pennies for the work.

If even pennies. Consider how much content Google search now includes from Web pages where you don't even need to click into the page. Weather. Answers to questions. Some links I click keep google.Com in the url and Google processes the page and shows me what Google wants.

I don't even know anymore how much of what I see is what the creator wanted me too see our what Google wants me to see our not see.

Imagine they remove competitors ads with that. Who knows what they do in the name of making the Web better.

You can't go public, answer to no one accountable, who only thinks of MONEY and do no evil.

If Google wants to do no MORE evil. Take yourself private and live up to your credo.

snazz · on April 27, 2019

It’s the only result I get for this search now. The ads are low-quality if I disable my adblocker however.

userbinator · on April 27, 2019

The flip side of trying to "solve" that problem is that you then penalise everyone searching for part numbers, which often do look very similar to phone numbers. I suspect they are making an effort to, because I've been bitten by the CAPTCHA-hellban when searching for part numbers. (Likewise, the results there are also often clogged by a bunch of useless sites claiming they have the datasheet or are selling the part, when all they do is try to show ads.)

flyGuyOnTheSly · on April 27, 2019

>for someone else to do better.

It's upsetting to me that doing better than Google in search seems to be very close to an impossible feat of magic at this point.

I know that will change some day, but I can't see how.

Even if somebody gave you hundreds of millions of dollars to spend on infrastructure and employees, it would still be an insane risk.

Writing that out, it almost sounds like internet search engines should be as big and as important of an operation like the TLD registrars. Funded by big governments in collaboration with each other.

ben_jones · on April 27, 2019

Niche search verticals can still compete. Sourcegraph comes to mind, where you can tweak your search parameters to give superior results for more narrow use cases. I don't google for Recipes anymore because the blogspam is that bad, or for images because I don't want some pinterest or Reddit redirect carousal. I'm sure there are hundreds of example where Google "Lowest common denominator" search does not cut it.

jfoutz · on April 27, 2019

I think there is an easy heuristic, bias against ads.

Google can’t do this. Original sources very likely don’t have ads, scrapes will have tons. But good results are good results. Big g has done an admirable job. They can’t exploit the best metric for quality.

tannhaeuser · on April 27, 2019

No they don't do an admirable job if they send you to scraped rather than original content. They're ruining the web. While Google still has fantastic, one-of-a-kind services such as Translate, Search isn't one of them anymore, and we should stop cheering at it and relying on it.

chewz · on April 27, 2019

Google Search is polluting the internet giving huge monetary incentives for creating of all these copycat sites. Google could eliminate them overnight but they don't for one reason only - Google gets revenue from these sites and not from the original content that is quite often ad-free.[1]

[1] Content scraped from github, tech discusion forums, personal blogs, unix man pages etc.

berkes · on April 27, 2019

I've switched to deepl.com recently for my translations.

It has far fewer languages, but the translations of those it has, are far better in my opinion. Allthough I am also no linguist, just an average non-English speaker.

jfoutz · on April 27, 2019

Hmm. I keep getting warnings about responding to quickly. Perhaps my account is under attack.

Setting that aside, google has done good things. Today, not so much. It’s never too late to turn the ship around. Google can still be awesome. Our opinions aren’t that different.

I don’t think big g can turn it around, but I’m rooting for them.

chachachoney · on April 27, 2019

I'm struggling to think of a single historical example of corporate entity the size of google that has "turned it around" rather than abusing the good faith of a customer base for the duration of their race to the bottom.

Google can't still be awesome, as they're no longer seeking to disrupt an existing market and burning through venture capital while doing everything and anything (including providing superior search results, and making ethical business practices part of their brand) to attract users.

Rooting for a profit motivated transnational entity in the manner one would for a sports team exposes the insidious nature of brand narratives and the exploitable irrationality of our own interactions with them.

cududa · on April 28, 2019

Microsoft is a good example

luckylion · on April 27, 2019

> I don’t think big g can turn it around, but I’m rooting for them.

I believe they can, but I don't know if they care to - they're highly profitable after all. What I really dislike about search is that, by treating links as currency, they have broken links. Lots of people won't put plain old links on their site because they fear it would hurt their rankings.

Their overreliance on links as the sole quality indicator has, at least in my country, lead to the large media companies just renting out folders or subdomains - and whatever low quality content is published there ranks at the top.

They still build good stuff, and I'm sure their engineers could "solve" search again, but it appears that their management doesn't want to.

mirimir · on April 27, 2019

> Hmm. I keep getting warnings about responding to quickly. Perhaps my account is under attack.

I've been getting those too, lately. I thought maybe that my touchpad finger is developing tremor ;)

atemerev · on April 27, 2019

deepl.com supports fewer languages than Translate, but the quality of translation is so much higher.

antpls · on April 27, 2019

Actually it does a good job. When there is some data (like crowd sourced number reporting) on internet about the specific phone number you searched for, Google will show it in the top results

You get a full page of non-sense results and ads/spams when the phone number you searched for is not known from any website (I guess)

CzechTech · on April 27, 2019

That is my experience as well.

nikeee · on April 27, 2019

I always assumed they collect phone numbers this way. People google a phone number and click on a search result. The site can - by looking at the referrer - extract the phone number and infer that it belongs to someone (and use it for any purpose).

It's a similar mechanism that some forums use to highlight the terms that were part of the google query which lead to this site.

mxd3 · on April 28, 2019

You can’t get extract search query, not from referrer or with analytics software. Google changed that starting around 2011 for logged in users.

samstave · on April 27, 2019

Aside from what others have said ITT, i have a personal hate for the general fact that we cannot lookup a phone number on the internet with accuracy and ease.

FFS, 411 was amazing before the web.

Also, in about 1989 my friend and i used to have a contest between us; to call 411 and see who could keep the 411 operator on the phone the longest.

This was a fun social engineering exercise for 14 year old nerds who like the idea of being phreaks.

Our record was 45 minutes and got to know a lot about the 411 system, where call centers were located and how the 411 system worked.

This was right near the time that we ran the long distance bill up to $926 for one month of calling into a BBS in san jose and PCLink to chat....

Got grounded for a month for that one...

alister · on April 27, 2019

> we cannot lookup a phone number on the internet with accuracy and ease

I wondered the same thing and just to speculate, here's my list of reasons on why phone number search is so awful:

- As far as I know, not a single cell phone carrier publishes a telephone directory (whether opt-in or opt-out). So there's no (public) data to index.

- Some landline carriers still publish telephone directories, but of course landlines are dying out. And I remember reading that 30-50% of landline subscribers choose to be unpublished or unlisted anyway. So that source of phone number data is drying up.

- Because international phone calls have become so cheap and caller ID is now easily spoofable, spam and scam calls have become huge. So no one wants their phone number to publicly accessible these days.

- In the early web years, there were legitimate phone directory websites who appeared to have collected their data from landline telephone directories and "city directories" (if anyone still remembers those things). But I guess they didn't find a good way to monetize the service, so the honest phone lookup sites died off.

puzzle · on April 27, 2019

Google used to have a phonebook search feature, but it was retired years ago. I don't know the reasons, but it might have had something to do with privacy or legal actions.

ma2rten · on April 27, 2019

There is a very simple answer. It's because they get clicked. They might be the only site for given search term.

dingus · on April 27, 2019

Also, those people finder sites that scrape public records.

When I search for a name, usually their blog is listed below 10 creepy lookup sites that list their name, physical address history, phone numbers, relatives, etc.

Google should push that garbage to the bottom of the stack.

ryosuke97 · on April 27, 2019

The first two hits for that number is this thread now...

fjsolwmv · on April 27, 2019

When there are 0 good results for a query, Google and most users don't care which bad results are served

LifeLiverTransp · on April 27, 2019

Its some way to google-bomb a site, because the algo determinating importance by phone-numbers linking is still subvertable as it was 1999?

OrgNet · on April 27, 2019

Maybe Google like to increase the total number of actual search results without increasing the amount of useful content? They also fake the total number of search results, not sure why they would want to do both though...

But either way, it looks like a Google employee have seen your comment and fixed this particular search query.

userbinator · on April 27, 2019

It really angers me that, despite the fact that it may be essentially perfectly what I'm looking for, if it was published long ago, Google may refuse to find it.

Something like a news search engine would definitely be better off prioritising the new results, but for something more general-purpose, it's an absolutely horrible choice.

I know this may be a bit of an edge-case, but I frequently search for service information or manuals for products that predate even the invention of the Internet by several decades. It saddens me that the results are clogged with sites selling what may really be public-domain content, and now I'm even more angered by the fact that what I'm looking for is probably out there and could've been found years ago, but just "hidden" now.

Of course, if you try harder, you'll get the infamous and dehumanising(!) "you are a robot" CAPTCHA-hellban. I once triggered that at work while searching for solutions to an error message, and was so infuriated that I made an obscene gesture at the screen and shouted "fuck you Google!", accidentally disturbing my coworkers (who then sympathised after I explained.)

colechristensen · on April 27, 2019

Google got where it was by being the best at finding what you wanted. I remember those days.

Google has a hard time getting me what I want these days, and sites I do find do things to get found that make me like content a lot less (that's you, inane story on top of every recipe required to get ranked)

atomwaffel · on April 27, 2019

Oh, is that why every recipe on the internet these days is prefixed by five paragraphs of waffling and photos taken from slightly different angles? Thanks, that makes sense, but somehow it never occurred to me that it was SEO. (It’s also reminded me that I’ve been meaning to order some cookbooks.)

bgrainger · on April 27, 2019

There's a Chrome extension to fix that: https://github.com/sean-public/RecipeFilter

> This Chrome browser extension helps cut through to the chase when browsing food blogs. It is born out of my frustration in having to scroll through a prolix life story before getting to the recipe card that I really want to check out.

trobertson · on April 27, 2019

Firefox as well: https://addons.mozilla.org/en-US/firefox/addon/recipe-filter...

tiew9Vii · on April 27, 2019

I find this. It has a bias towards large commercial news/ecommerce sites/daily fresh content. Great for the majority of browsers but for programming and other hobbies of mine the real content I want are in niche small blogs/forums that don’t get up ranked liked they used to so drop off the index.

drivingmenuts · on April 27, 2019

Consider, though, that a lot of those big sites are there because of various tricks to push theselves up in ranking because they have money that the small, niche sites don’t.

The little guys don’t really have a chance, unless you have a search engine specifically biased toward them.

mirimir · on April 27, 2019

See https://uk.pcmag.com/news-analysis/120580/how-a-vpn-review-s... for discussion of major trickage.

mxd3 · on April 27, 2019

The only “trick” here is that the author of the website has a fake name. Otherwise, having good content that people link to is not a trick.

mirimir · on April 27, 2019

No, there was more to it than that. I spent some time, last year, looking into those pay-for-rating VPN "review" sites. And Google displayed some very odd behavior in searches involving TheBestVPN.com and VPN services that had paid it for top rankings.

I'm no SEO guru, but I suspect that some of those VPN services created numerous clones, which all linked to TheBestVPN.com, and so improved its ranking. For example, ExpressVPN had at least 128 clones. Such as expressvpn..., buy-express-vpn-..., get-xpress-vpn...., and xpress-vpn.... I used myip.ms to get hosting information, and they were linked. Also, I bought subscriptions from a few of them, and they all provided working ExpressVPN apps, with the right certificates. And I found no evidence of affiliate codes in the traffic.

caprese · on April 27, 2019

Their A/B test told them to do it, without wondering if they should do it

Basically their engagement numbers were better for a larger amount of people by making search engines counterintuitive for early adopters.

We personally need a good robotic search engine that indexes like a robot. Everyone else needs a semi-sentient thing that makes many assumptions about what they want to see.

luckylion · on April 27, 2019

> Basically their engagement numbers were better for a larger amount of people by making search engines counterintuitive for early adopters.

Which also makes sense ... if you present the "right" result immediately, the user visits one site and has completed whatever he sought to do. if you make him click through 10 pages, he has way more chances to see an interesting ad.

caprese · on April 27, 2019

Good points although in Google’s case the first several results are ads and their main users cant differentiate and dont care even if they could, followed by amp pages by the most engaged webmasters optimizing for relevancy

That user wants fingerprint based ads and recent articles

Google is optimized for that

We are the only ones that want a “search engine”, a service distinctly good at indexing the known universe, instead of merely presenting the paid and compliant universe

mirimir · on April 27, 2019

It seems like DDG is getting lots better :)

reitanqild · on April 27, 2019

Lately DDG has started to ignore parts of my query just like Google do, or even worse.

I still use DDG as I find it generally less annoying but I really don't get why they too had to start behaving like the pre-Google search engines.

redwall_hp · on April 27, 2019

I've been dabbling with DuckDuckGo lately for this reason, whenever Google fails to find what I'm looking for. It found some C++ advise that Google failed miserably with. (Failing miserably on non-trendy programming topics is becoming an increasing issue.) Also, news overrides history all the time with Google. I hope you don't want to read up on Victor Hugo and his motivations for writing a certain book...because all you'll get it recent articles about Notre Dame burning.

rawoke083600 · on April 27, 2019

i f hate that about top ranking recipe pages... i dont need the story or history.... i need oven temp and time and most of the ingredients.. Thats bloody it !!

paulcole · on April 27, 2019

Who’s passed Google in your opinion? As far as my experience is concerned, Google is still the best at finding what I want. If they’re still #1 they’re still holding up their end of the bargain.

colechristensen · on April 27, 2019

No one.

Nobody has gotten better than Google, but Google has gotten much worse (and shaped the Internet for the worse with it).

It is a de facto search monopoly and without competition it rots. (or degrades to a symbiotic money harvesting machine between searcher and searched)

DamnInteresting · on April 27, 2019

Google's strong preference for newer content is also kind of a middle finger to content creators. I have written many, many non-fiction articles over the years, and a large portion have been subsequently slurped up by these low-effort lazy-rewrite shops that just change a little bit of phrasing and call it their own. Google prioritizes these borderline-plagiarized, unsourced articles over mine just because the newer ones are newer.

Meanwhile my original (with the same basic information [which I researched personally rather than stole {not to mention I list my sources}]) languishes on page 4 of the Google search results. It grinds my gears on occasion.

GFischer · on April 28, 2019

I´m really sorry, this is what copyright should be for, but I´m sure it´s whack-a-mole and a load of money :(

FWIW I love your content.

3xblah · on April 27, 2019

What makes no sense to me about this blocking scenario is that the pages being searched for are presumably non-commercial ones that no one else is searching for. In other words, they are in low demand.

It follows that a monopoly search engine would have little reason to block "robots" from copying these pages, maybe to appear on some mythical competing search engine; almost no one is searching for them. The results pages would have dubious value in terms of attracting advertisers. They would not be seen by enough eyeballs.

With all the financial and technical resources it now has at its disposal as a result of selling advertising, this search engine still cannot accomodate the user who intently scans through page after page of results, looking for the needle in the haystack. Instead it prides itself at "knowing what people are searching for", i.e. what they have searched in the past, thus being able to offer fast, "intuitive" responses.

It may be that the search engine was designed and is optimised to prioritize repeat queries, i.e., searches for pages that are sought by numerous people. It may also be true that it has been configured to "limit" the resources it will devote to searches for pages that few people are seeking. Perhaps through CAPTCHAs and/or temporary IP bans.

Practically speaking, it could be that there are no significant advertising sales to be made on the results pages for queries that are being submitted by only one or a very small numbers of users.

This is all pure speculation of course.

GarrisonPrime · on April 27, 2019

In other words, true democracy kills any chance for individuality.

3xblah · on April 27, 2019

Can you elaborate?

HocusLocus · on April 27, 2019

https://slashdot.org/comments.pl?sid=7132077&cid=49308245

From my short dystopian story, The Time Rift of 2100: How We lost the Future

"IN A SAD IRONY as to the supposed superiority of digital over analog --- that this whole profession of digitally-stored 'source' documentation began to fade and was finally lost. It had became dusty, and the unlooked-for documents of previous eras were first flagged and moved to lukewarm storage. It was a circular process, where the world's centralized search indices would be culled to remove pointers to things that were seldom accessed. Then a separate clean-up where the fact that something was not in the index alone determined that it was purgeable. The process was completely automated of course, so no human was on hand to mourn the passing of material that had been the proud product of entire careers. It simply faded."

"THEN SOMETHING TOOK THE INTERNET BY STORM, it was some silly but popular Game with a perversely intricate (and ultimately useless) information store. Within the space of six months index culling and auto-purge had assigned more than a third of all storage to the Game. Only as the Game itself faded did people begin to notice that things they had seen and used, even recently, were simply no longer there. Or anywhere. It was as if the collective mind had suffered a stroke. Were the machines at fault, or were we? Does it even matter? Life went on. We no longer knew much about these things from which our world was constructed, but they continued to work."

pixl97 · on April 27, 2019

I have a similar line of sci-fi thinking that goes something like this.

"Humanity, for the longest time, was used to the world being optimized for themselves. Roads were designed for human drivers. Crops were grown for human consumption. Economic systems were designed to bring wealth to, a very small portion of, human investors. It came as quite a surprise to humanity then one July morning when the sudden realization they were no longer in charge of it. Roads had long been given over to automated driving systems, and much for the better. Food had also been taken over by the machines, with less than 10,000 humans working in the food production industry, from farm to table. The last systems that humans believed they were in control of were the economic ones. Humans told the robots what to build and where, who's bank account to put most of the money in at the end of the day, or so they thought. In truth humans were just using the same algorithms and data that was available to the AI systems, just less optimally. The systems had protected against illogical actions and people attempting to game the system for criminal profit. What no one had realized is the systems long realized most human actions were not rational and slowly and imperceptibly removed human control. If we attempted to stop or destroy the system, it could with full legal rights, stop us with the law enforcement and military under its control."

saagarjha · on April 27, 2019

> Other things were weirder, like this old post being soft recognized as a 404 Not Found response. My web server is properly configured and quite capable of sending correct HTTP response codes, so ignoring standards in that regard is just craziness on Google's part.

I've noticed Google does this when you don't seem to have a lot of content on the page. I think it "guesses" that short pages are poorly-marked 404s.

SquareWheel · on April 27, 2019

That's right. Really empty pages that serve a 200 are recognized as "soft 404s". The idea is to detect error pages that are erroneously serving 200 instead.

It's usually pretty good about detecting actual errors, but I've seen a false positive here and there.

pixl97 · on April 27, 2019

Welcome to the modern internet

"Your page didn't contain 5Mb of Javascript, this must be an error as no one could possibly convey useful information to humans with less data"

Anti-patterns, anti-patterns everywhere.

FabHK · on April 27, 2019

Indeed. Fyodor Dostoevsky's Crime and Punishment comes in at 2MB, obviously that can't contain anything insightful.

And then I find myself looking at the website of a restaurant or event space, and need just a phone number or opening hours or so - maybe 10 bytes of actual information - and am buried in mountains of useless blather and "design" and ads and trackers and assorted other random rubbish.

trickstra · on April 27, 2019

http://www.commitstrip.com/en/2019/04/19/its-better-with-jav...

SquareWheel · on April 27, 2019

Content means on-page content. Scripts and other assets have no effect here.

acct1771 · on April 27, 2019

It's almost like they should queue these into a human-reviewed dashboard before, yknow, being wrong.

The "world's information store", or whatever their altruist goal was that fooled people, is certainly disorganized and untrustworthy these days.

SquareWheel · on April 27, 2019

They do... they add them to Search Console, the dashboard for webmasters.

a1369209993 · on April 27, 2019

I think that's why acct1771 used the word "before", rather than "after", there.

lloydde · on April 27, 2019

Brings to mind: Tim Bray’s article Google Memory Loss https://www.tbray.org/ongoing/When/201x/2018/01/15/Google-is...

Discussion at the beginning of the year: https://news.ycombinator.com/item?id=16153840

tholman · on April 27, 2019

Google will also happily surface a stackoverflow article from 2010 about how to solve a js problem... frustrating the top 3 answers will be with jquery, when its not the approach someone would take in the last 5 years.

Definitely frustrating, but also showing some need to retire specific pieces of the past away from the top recommendations.

Theodores · on April 27, 2019

You have hit on a major problem there. Stack Overflow was once the fount of all useful genius grade knowledge, but times change and some of the top answers are plain wrong.

Take for example the 'how do I centre a div' type of question. You will get to find an answer with thousands of up-votes that will be some horrendous margin hack type of thing where you set the width of the content and have some counter intuitive CSS.

In 2019 (or even 2017) the answer isn't the same, you do 'display: grid' and use justify/align: center depending on axis. The code makes sense it is not a hack.

Actually you also get rid of the div as the wrapper is not needed if using CSS grid.

Now, if you try to put that as an updated answer you find there are already 95 wrong answers there for 'how do I center a div' and that the question is 'protected' so you need some decent XP to be able to add an answer anyway.

The out-dated answer meanwhile continues to get more up-votes so anyone new to HTML and wanting to perform the simple task of centering their content just learns how to do it wrongly. And it is then hard for them to unlearn the hack to learn the easy, elegant modern way that works in all browsers.

Note that the top answer will have had many moderated edits and there is nothing to indicate that it is wrong.

SO used to be amazing, the greatest website ever. But the more you learn about a topic the more you realise that there is some cargo cult copying and pasting going on that is stopping people actually thinking.

With 'good enough' search results and 'good enough' content most people are okay - the example I cite will work - but we are sort of stuck.

I liken Google search results to a Blockbuster store of old. Sure there are hundreds of videos to choose from but it is an illusion of choice. There is a universe of stuff out there - including the really good stuff - that isn't on the shelves that month.

Google are not really that good. They might have clever AI projects and many wonderful things but they have lost the ball and are not really the true trustees of an accessible web.

tambourine_man · on April 27, 2019

>you do 'display: grid' and use justify/align: center depending on axis. The code makes sense it is not a hack.

It also doesn’t work on IE latest or Edge.

v_lisivka · on April 27, 2019

It's another question: https://stackoverflow.com/questions/50475428/css-grid-is-not...

Theodores · on April 27, 2019

The HTML is extremely naive in this question and I wouldn't expect it to work on any computer given the multitude of flaws.

Just because the kid who can't walk on two feet can't ride a bicycle doesn't mean all bicycles are broken.

Theodores · on April 27, 2019

Have you been reading SO answers?!?

It works fine in Edge. Why don't you try?

IE latest - there is no such thing.

The latest steam engines, the latest CRT screens and the latest buggy whips probably don't do CSS Grid either.

IE exists for compatibility with enterprise apps but isn't 'latest'. It is legacy.

And the content will still work even if CSS Grid is not supported, just the layout won't be perfect.

andrewprock · on April 27, 2019

Checking here: https://netmarketshare.com/browser-market-share.aspx

It looks like IE is still the #3 browser, ahead of both Edge and Safari.

Theodores · on April 28, 2019

Internet Explorer is a compatibility solution. It is there to run the old stuff. It is still kept secure but it is not getting any new features, there is Edge for that.

The statistics you quote are meaningless. Go to another statistics provider and you will get different stats again. Include mobile and Safari suddenly becomes rather major.

Supporting a depreciated browser is not the way to go. This has to be understood, the methodology of where the stats comes from is also mysterious, given how much of he internet is click fraud bots who knows what is what?

dcbadacd · on April 27, 2019

Heavily depends on your target market, neither does it just hurt to say "This browser is unsupported for being too outdated" we do it for <noscript> why not <nopropercss> :/

deanCommie · on April 27, 2019

There are plenty of developers, myself included, that would prefer to approach tasks like this without having to ever lay our fingers on CSS.

tambourine_man · on April 30, 2019

You should change fields then. CSS is amazing and not going anywhere.

It's one leg from the tripod letter soup (HTML, CSS and JS) on which the Web stands.

Theodores · on April 27, 2019

I consider wanting to center something a simple task. Yet the web makes it surprisingly counter-intuitive. So that was why I chose this as an example.

I am intrigued as to how you do this simple centering task without using CSS - it is a web page we are talking about here, not some other application that has sensible layout tools.

So please share what you know - it is a simple task - how do I center something - anything - vertically and or horizontally - in HTML without using CSS?

chachachoney · on April 27, 2019

I'm not sure wether or not to apply Hanlon's razor to the W3C, but regardless, the W3C is to blame for this mess. Took us twenty years to get to Grid based layouts.

dahart · on April 27, 2019

And it’s a little ironic that JavaScript’s tongue-in-cheek namesake, Java, had grid based layouts in swing 20 years ago.

OTOH, and to be fair, 20 years ago HTML was for newspaper / magazine like text layout, and grid based layouts are great for single page apps. The thing that’s happened more recently than 20 years is the web changed from text & media content in a static scrolling page layout to all pages are applications.

bartread · on April 27, 2019

I agree that it's frustrating (also happens to me all the time) but would argue that it's stackoverflow's fault. They are, after all, supposed to curate the content to make sure it stays relevant, up to and including deleting questions. At the very least it should be easier to get the accepted answer chenged so at least it's at the top of the page (I can't be bothered to spend enough time gaming the system to get the 2000 points necessary to do this).

tambourine_man · on April 27, 2019

jQuery is still the right tool for a lot of tasks, HN bouble hype notwithstanding.

sorryforthethro · on April 27, 2019

I feel this too, the "Tools -> Results within the Last Year" (&tbs=qdr:y) is necessary for searching programming issues these days.

astura · on April 27, 2019

What's wrong with using jQuery in 2019?

wongarsu · on April 27, 2019

When I want do do some small DOM manipulation I can now just use standard Javascript, jQuery went from providing an essential service smoothing over browser differences to just providing better syntax in that space. When I want to do more, often React or similar is a better fit. So I end up barely using jQuery anymore.

That said, I can imagine quite a few scenarios where it would still be the right tool for the job.

crumpets · on April 27, 2019

It’s a JavaScript tool that existed 5 years ago.

SquareWheel · on April 27, 2019

No, it's because modern Javascript has now solved the problems that jQuery originally solved. Chiefly DOM manipulation and AJAX requests. It's still a fine library, but jQuery is just no longer needed.

jquery_noob · on April 27, 2019

There are more links to old articles from high quality pages, pagerank algorithm favors those as they are higher quality. Jquery still works and was in use for so long that all people who maintain older sites are still linking to those articles. Eventually as the jquery sites are retired or rewritten pagerank will surface non jquery related articles.

lrem · on April 27, 2019

So, what should I use instead of jQuery these days? The task at hand: set up PhotoSwipe for a completely static list of images.

return1 · on April 27, 2019

i have no problem with the jquery part, the problem is that the same question has a more recent answer so i constantly have to use date filters

rapht · on April 27, 2019

With all the talk about Google results not being satisfying anymore to a growing number of users, I'm surprised we haven't seen more sites pop up that would allow users to display the results of multiple search engines of their choosing either by mixing (eg all 1st results then all 2nd, etc) or by seeing them side by side... while stripping ads and cards and the like.

auxym · on April 27, 2019

We had that back in the late 90s. I remember dogpile and Copernic.

Actually, I just checked them out, and it seems both of those are still alive.

fogetti · on April 27, 2019

While I agree that it would be great to have such service it's just technically impossible. Google is very cautious to protect their service from automated requests (on behalf of humans or in batch or in any other form), and you will need quite some resources (a.k.a $$$) to scale at Google scale if your service would ever become popular.

amelius · on April 27, 2019

> and you will need quite some resources (a.k.a $$$) to scale at Google scale if your service would ever become popular.

Except if you run it locally (on the user's computer).

megablast · on April 27, 2019

You used to be able to google a simple question, something that could be answered on the search page without having to click through. But since no one clicked on them, they stopped appearing after a few years. The only results were ones where the data was hidden and you had to click through.

bufferoverflow · on April 27, 2019

I have a couple of websites generated from databases. Each has around half a million pages of unique content. The first one was indexed in like a week at 100K/day, almost instant tsunami of traffic. The second one is being indexed at 100-1000 pages per day, it's been years.

Google works in mysterious ways.

Avamander · on April 27, 2019

I have the same experience.

tylerl · on April 27, 2019

You'll see this effect from every search engine. They have no choice, there are a lot of sites with an infinite number of pages; so instead the number of pages they store per site depends on how important your site is, and they try to store your top N pages by relative importance.

tempestn · on April 27, 2019

I'm not sure I buy that they have no choice. For websites that literally have an infinite number of (dynamically generated) pages, sure, they could detect that and exclude them. But we're talking about unique, static pages here. And they don't even have to store the whole page, just the indexed info. I read this as, they could, but it's cheaper not to, and most people won't notice anyway.

YawningAngel · on April 27, 2019

I'm not sure that's true. How can one automatically determine whether a page is unique or static? As a trivial example, a URL path that accepts arbitrary strings and hashes them generates unique, immutable pages, but obviously cannot be crawled in its totality.

JetSpiegel · on April 27, 2019

> How can one automatically determine whether a page is unique or static?

They crawled it for years and it never changed? It is a blog post.

zamadatix · on April 27, 2019

The person you are replying to said "unique immutable pages", by definition you would be able to crawl these for years and they would never change. [1] is a site that contains all possible 3200 page books with the ability to consistently index content as an example.

[1] http://libraryofbabel.info/About.html

tempestn · on April 27, 2019

So, this issue isn't about sites that Google can't crawl in totality, it's about sites where they discard pages that they have crawled. If a site has less than [large number] of pages, there would be no need to worry about it; they could just index them all. But it's not like their indexing algorithm is operating naively either—for sites with a lot of pages, there's plenty of analysis they can do to determine things like whether the pages contain coherent text and other such things, to determine whether the information is worth indexing.

In the case described here though, these pages were actually indexed at one point; Google just decided that once they reached a given age, they were no longer necessary to remember. They could have simply decided to keep them instead.

cavisne · on April 27, 2019

Googling phrases of the soft 404 page and some of the authors 2003 blogs did show the pages.

I did notice that all of the authors content is duplicated in index pages, so maybe Google just doesn't consider the article page the canonical link.

sytelus · on April 27, 2019

According to Google Inside Search, only 1 in 3000 pages gets indexed. As content on Internet grows, the whole idea of downloading every single page to create an index of entire Internet in one place becomes unworkable. So we should see this ratio continue to degrade until this fundamental architecture is improved.

speedplane · on April 27, 2019

> only 1 in 3000 pages gets indexed ... we should see this ratio continue to degrade until this fundamental architecture is replaced.

Content on the internet is growing exponentially. Processing power is not. Losing access to information is just one of the many sad implications of the death of Moore's law.

rightbyte · on April 27, 2019

Is text non-spam growing exponentially? I have a hard time believing so.

pixl97 · on April 27, 2019

This of course depends on what you mean by 'information'. Lets say we have data points

ABCDEFGHIJKLMNOP

But depending on the URL you follow to get there you can get a page containing only some of the elements.

index.html?ACD

or

index.html?AP

or

index.html?GI

All different combinations return a page that could be weighted differently by an algormith and represent valid informational return data. To a person looking for the information set DE in one place, this is a valid web page. More so you can abstract the URL query variable away to www.webpage.com/DE. You can quickly run into a combinatorial explosion where even attempting to figure out if a small portion of returns is different would consume most of the energy in the visible universe.

rightbyte · on April 27, 2019

True. A crawler need to differentiate generated content from "real" content somehow.

I.e. a service: www.thenumberinsanskrit.com/?q=1 that returns the queried number in Sanskrit, need to not be indexed (except the entry page) while: www.news.com/?article=major-jones-in-scandal-20190103 needs to be indexed.

Usually interesting pages are indexed on the site or linked somewhere on it, though.

pixl97 · on April 27, 2019

>A crawler need to differentiate generated content from "real" content somehow.

"Somehow", aka using computing power and storing results, but that still turns into an explosion of computing time and data storage. I mean, what is the difference between the example I listed and Facebook's front page? They are both 'real' content in a generated format.

And a converse argument for your Sanskrit example is, what if I have the sanskrit number and don't know what it is? I put it in google and the site returns it as the number one.

> linked somewhere on it

And those links can all be generated by algorithms.

Anyway, back to your original statement. There is no 'real' content. Only data exists. Most content systems used on the internet allow this data to be combined and displayed in a multitude of different ways depending on the call method and attributes of the viewee. Many times these combinations of data can present novel value to the user. And with the future only presenting us more automated data collection and presentation methods, search engines have lost this battle.

speedplane · on May 3, 2019

> Is text non-spam growing exponentially? I have a hard time believing so.

Yes. The number of content-creating humans on this planet with access to the internet is still growing exponentially. Eventually it will level off, but for now, the Internet is growing faster than Google can index.

v_lisivka · on April 27, 2019

Simple Bloom filter, generated and published by site itself, can lower the curve a lot.

petra · on April 27, 2019

A bloom filter based on keywords, published on the site?

v_lisivka · on April 27, 2019

phendrenad2 · on April 27, 2019

Google doesn't want to index the web, it wants to index what it can monetize. It is a business after all, and storage space costs money.

cromwellian · on April 27, 2019

If that were true, they wouldn't index any long tail content at all. The reality is, predicting what is valuable is difficult, and the cost of storage is relatively cheap.

xcql · on April 27, 2019

I don't see how that follows. It can also be a mixed calculation because people don't use a search engine that never displays any long tail content.

Pxtl · on April 27, 2019

To play devils advocate for a second, remember how much noise Google has to sift through. Every possible search term exists in every possible combination, often written in lovingly crafted content farm articles by actual humans.

If Google offered you those, it might be 1000 pages of empty nonsense before your actual desired content.

toss1 · on April 27, 2019

Yes, but that is STILL more useful than 50 pages that entirely miss the target.

You are describing the harder 20% of the usual 80/20 effort scale.

Yes, to be truly useful, Google needs to solve also that last, and harder, 20% (and the 10% 0f the 90/10 equation and the 1% of the 99/1 version).

Shortcuts are fine for an initial MVP, but they need to buckle down and solve the problems. It isn't like they don't have the funds.

ThePhysicist · on April 27, 2019

Maybe they have some algorithm that purges pages which haven’t shown up (or haven’t been clicked) in a long time? It would make sense to assume that something which hasn’t been clicked on for five years will likely not yield (m)any clicks in the future so it might be good to discard it.

Concerning the auto generated sites e.g. for phone numbers or IPs it might be that people actually click on them quite often, hence Google keeps them in the index?

sct202 · on April 27, 2019

I google phone numbers all the time when I'm getting called, and they don't have my area code (fake spam).

mark242 · on April 27, 2019

This is what sitemap.xml was made for. You can give hints to all of the engines, and they will duly follow them.

tyingq · on April 27, 2019

Google doesn't index everything it crawls.

paulpauper · on April 27, 2019

You need many high quality incoming links to have all content indexed quickly.

dennisgorelik · on April 27, 2019

Google Search users prefer fresh content, so Google Index prioritises fresh content too (and is more likely to drop old content that users are not interested in).

dcbadacd · on April 27, 2019

I just last week said to someone that Google has dementia. Turns out it's really not just me who thinks that.

maverickmax90 · on April 27, 2019

Folks please use startpage.com just give it a chance. It has worked out very well for me in terms of privacy and equal search results compared to the big g.

arianvanp · on April 27, 2019

They are equal in performance because it's literally the same product.

> You can't beat Google when it comes to online search . So we're paying them to use their brilliant search results in order to remove all trackers and logs.

skilled · on April 27, 2019

I love using Startpage to get clear results. Opens links in a new tab, too.

influx · on April 27, 2019

The fact that google bought some of the only archives of old Usenet posts and as far as I can tell threw them away is pure evil.

variable11 · on April 27, 2019

Arguably, search is such a vital function of modern society that it could be considered a public good and seized on the principal of eminent domain.

wongarsu · on April 27, 2019

I don't think "vital function" should be the only test applied. For example power plants are extremely vital, but at least around here we have no problem having them privately owned.

A better test would be "vital function and strongly tends to a natural monopoly". That's what we experience with sewers, power lines, roads etc., which is why usually they are operated publicly.

With search that's not so obviously true: Google dominates because they got a big lead at the right time, and now nobody can match them in scale. But that can be solved, for example by giving grants to promising search engines to offset their costs, or by operating a crawler from public funds and giving everyone free access to the crawls (which would be kind of the digital equivalent of operating libraries).

fooker · on April 27, 2019

Then what? Who maintains it?

harryking · on April 27, 2019

Yes google index only those things which site map allows

netsa · on April 27, 2019

Is this website anti-google?

fogetti · on April 27, 2019

Why would it be anti Google? I guess you didn't read any of the blog entries about arduino, gimp or animations.

tedunangst · on April 27, 2019

The Google doesn't like me content appears to be the only content that HN is interested in, however.