Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are these low quality “code snippet” sites?
594 points by endofreach on Dec 1, 2021 | hide | past | favorite | 316 comments
Whenever i am trying to google a code issue i have, there is countless low quality sites just showing SO threads with no added value whatsoever. It is so annoying it actually drives me mad.

Does anyone know what's up with that?

I am really disappointed because the guys creating these sites (i guess for some kind of monetization) must have some relation to coding. But i feel this is an attack against all of us. Every programmer should be grateful for the opportunity to find good quality content quickly. Now my search results are flooded with copy & paste from SO. They are killing that.

Am I the only one experiencing this or being that annoyed by it?

P.S: I don't name URLs because if you don't know what I am talking about already, you probably don't have that issue.




For years now I've ran a programming site (stackabuse.com) and have closely followed the state of Google SERPs when it comes to programming content. A few thoughts/ramblings:

- The search results for programming content has been very volatile the last year or so. Google has released a lot of core algorithm updates in the last year, which has caused a lot of high-quality sites to either lose traffic or stagnate.

- These low-quality code snippet sites have always been around, but their traffic has exploded this year after the algorithm changes. Just look at traffic estimates for one of the worst offenders - they get an estimated 18M views each month now, which has grown almost 10x in 12 months. Compare that to SO, which has stayed flat or even dropped in the same time-frame

- The new algorithm updates seem to actually hurt a lot of high-quality sites as it seemingly favors code snippets, exact-match phrases, and lots of internal linking. Great sites with well-written long-form content, like RealPython.com, don't get as much attention as they deserve, IMO. We try to publish useful content, but consistently have our traffic slashed by Google's updates, which end up favoring copy-pasted code from SO, GitHub, and even our own articles.

- The programming content "industry" is highly fragmented (outside of SO) and difficult to monetize, which is why so many sites are covered in ads. Because of this, it's a land grab for traffic and increasing RPMs with more ads, hence these low-quality snippet sites. Admittedly, we monetize with ads but are actively trying to move away from it with paid content. It's a difficult task as it's hard to convince programmers to pay for anything, so the barrier to entry is high unless you monetize with ads.

- I'll admit that this is likely a difficult problem because of how programmer's use Google. My guess is that because we often search for obscure errors/problems/code, their algorithm favors exact-match phrases to better find the solution. They might then give higher priority to pages that seem like they're dedicated to whatever you searched for (i.e. the low-quality snippet sites) over a GitHub repo that contains that snippet _and_ a bunch of other unrelated code.

Just my two cents. Interested to hear your thoughts :)


I wonder if we'll see a comeback of hand-curated directories of content? I feel like the "awesome list" trend is maybe the start of something there.

I would be willing to pay an annual fee to have access to well-curated search results with all the clickbait, blogspam, etc. filtered out.

Until then, I recommend uBlacklist[0], which allows you to hide sites by domain in the search results page for common search engines.

0 - https://github.com/iorate/uBlacklist


> hide sites by domain

This gives me the idea to build a search engine that only contains content from domains that have been vouched for. Basically, you'd have an upvote/downvote system for the domains, perhaps with some walls to make sure only trusted users can upvote/downvote. It seems like in practice, many people do this anyway. This could be the best of both worlds between directories and search engines.


I don't think this would change a lot, you would probably raise big sites (Pinterest, Facebook) a lot higher in the rankings as the 99% non-programmers would vouch for them.

You could counter that somewhat by having a "people who liked X also like Y" mechanism, but that quickly brings you back to search bubbles.

In that sense Google probably should/could do a better job by profiling you and if you never click through to a page lower it in the rankings. Same with preferences, if I am mainly using a specific programming language and search for "how to do X" they could only give me results on that language.

In the end that will probably make my search results worse, as I am not only using one language ... and sometimes I actually click on Pinterest :-(


You don't need an upvote/downvote. If someone searches for X and clicks on results you just record when they stop trying sites or new search terms as you can assume the query has been answered. Reward that site. Most of them are already doing this in some form.


This is what Google already does, does it not? Why else would they be recording outbound clicks?

Unfortunately, this doesn't entirely solve the problem. Counting clicks doesn't work because you don't know if the result was truly worthwhile or if the user was just duped by clickbait.

As you say, clicking when they stop trying sites is better, but I don't know how good that signal to noise ratio is. A low-quality site might answer my question, sort of, perhaps in the minimal way that gets me to stop searching. But perhaps it wouldn't answer it nearly as well as a high-quality site that would really illuminate and explain things and lead to more useful information for me to explore. Both those scenarios would look identical to the click-tracking code of the search engine.


If I click on link 1 then click on link 2 several minutes later, 1 probably sucked. The difficulty is if I click on 1 and then 2 quickly, it just means I’m opening a bunch of tabs proactively.


Often you don’t know if a site is legit or not without first visiting it.

And new clone sites launch all the time, so I’m always clicking on search results to new clone sites that I’ve never seen before so can’t avoid them in results.


Yeah, when I get caught by these SEO spam sites it's because they haven't had a similar ranking to the SO thread that ripped off, so it wasn't immediately apparent.


> This gives me the idea to build a search engine that only contains content from domains that have been vouched for.

Just giving us personal blocklists would help a lot.

Then if search engines realize most people block certain websites they could also let it affect ranking.


You can access one without paying a dime.

http://teclis.com

Problem is people usually want one general search engine, not a collection of niche ones.


It’s not a directory. Hand-crafted descriptions instead of random citations and/or marketing from the site itself is what makes it a directory. This one is a search engine. Maybe it’s a good one for its purpose, but who knows, without an ability to navigate you can’t tell.

Problem is people usually want one general search engine, not a collection of niche ones.

In my opinion, the reason they want a general search engine is that they think in their box (search -> general search). What they really want is a way to discover things and quick summaries about them: “$section: $what $what_it_does $see_also”. Search engines abuse this necessity and suggest deceitfully summarized ads instead.


The trouble is, how do you prevent Sybil attacks? The spammers might vote for their own sites

https://en.wikipedia.org/wiki/Sybil_attack


would be better if my votes only affect my search result


There's an interesting dilemma here - if the algorithm were to favor "pages with lots of content alongside the exact-match phrase you're looking for" then it would incentivize content farms that combine lots of StackOverflow answers together. And if you favor the opposite, where there's less content alongside the exact-match phrase, you incentivize content farms that strip away content from StackOverflow before displaying it. Ideally, of course, you'd rely on site-level reputation - is Google just having trouble recognizing that these are using stolen content?


> is Google just having trouble recognizing that these are using stolen content?

It's very possible. In general Google will penalize you for duplicate content, but that might not apply to code snippets since code is often re-used within projects or between projects.

The code snippet sites also typically have 2 or more snippets on the same page. When combined, it might then look unique to Google since their algorithm probably can't understand code as well as it understands natural text. Just a guess


Also, while I imagine Google probably has (or at least easily could have) code-specific heuristics at play, it seems like it may be harder to reliably apply duplicate content penalties to source code listings, especially short code snippets.

Between the constrained syntax, literal keywords, standard APIs and common coding conventions it seems like even independently-authored source code listings will be superficially similar. Certainly the basic/naive markov-chain-style logic that works really well for detecting even small examples of plagiarism in natural language content isn't going to as effective on `public static void main(String[] args)` or whatever.

Obviously there are strategies that could distinguish superficial/boilerplate stuff from truly duplicated code, but between the volume of actual (and often legitimate) duplication on the internet and the (maybe?) low ROI for duplicate content penalties for code relative to other value metrics/signals, maybe this just isn't terribly important as a quality factor?


Wikipedia is a good example. Google considers it a high value website (less so than before but still) while it only has a single article about each topic. Other projects have entire websites dedicated to the single topic, products, experts on staff, live support. I presented as an example to some google engineers the wp page on suicide that explains popular ways of doing it vs dedicated prevention projects. Today (for me) it ranks the topic: 1-3) news articles, 4) world health organization statistics, 5) wikipedia, 6) Suicide prevention.

ABC news, the NYT, WP and the WHO are considered high profile but the topic is not their area of expertise. Non of them would consider themselves the go-to place for people searching for it.


as it seemingly favors code snippets, exact-match phrases

If only... it seems like the search results have gotten far worse for exact-matching. I regularly search for part numbers and markings, somewhere where exact matches are pretty much the only thing I want, and I can clearly see the decline over the years as it starts including more and more results that don't even have the string I'm looking for.


Funny, it was the "lots of internal linking" bit that felt wrong to me. Not that these low-quality sites don't do that, but I'm surprised to hear that the new algorithm rewards internal links. I'm certainly not a full-time SEO guy, but I happen to have or work with a few sites - some fairly established - that make extensive use of internal links for design/editorial reasons. As far as I can tell they are helpful for (a) user navigation and (b) getting search engines to discover/index the pages but I don't think I've seen any notable advantage in or even impact on ranking based on those internal links (whether in the body copy or in head/foot/side "navigation" areas).

Searching just now I do see some other sources making a similar claim, so maybe I'm just out of the loop. But in my cursory scan I haven't found much detail or real evidence beyond "google says they matter" either. I mean, that's not why those internal links were created in the first place, but it sure would be nice to get a ranking boost out of them too. I wonder what I'm doing wrong :)


...which even applies if you enclose the specific term with quotes. These used to help against unrelated results, but not so much anymore. I don't know why. Same thing with DDG and Bing.


> I'll admit that this is likely a difficult problem because of how programmer's use Google

It's beyond simple for Google to fix. Just drop those sites from the search index. But Google won't do that because it's in their interests to send you to those shit holes because they're littered with Google ads.


Recently I've just punted and begun searching SO and Github directly.

One thing Google has gotten really good at lately is knowing that when I search for "John Smith" I mean John Smith the baseball player not the video game character or vice-versa.


I've always just searched 'John Smith baseball' and that works well in DDG too.


you can also add 'site:github.com' or 'site:stackoverflow.com' to your search


It's not as good, especially for Github.

GH also breaks down the results into types which is very helpful when you only want code or are looking for documentation or discussion.


>it's hard to convince programmers to pay for anything

Offtopic but I'm curious why this is the case? Is the Free Software movement responsible for this mindset?


It's been a hard habit for me to break, but when you know that you could do something and theoretically do it better, it can bias you against paying for something. None of that is meant to sound arrogant - every company could do a better job given more time and resources, just as I could. But my time isn't infinite and I've found that paying for solutions in my life is a good thing. Sure, I could run my own email, but I'll just pay for it. Sure, I could create an app that does exactly what I want, but this one is close enough and it's $10.

With knowledge, the problem can be worse. You can't even evaluate if it's any good until it's been given to you. At that point, you don't need to pay for it because you have it. The number of paid knowledge things that I see that don't really have good information and avoid all the real problems that they promised they were going to solve for you can be high.

I think sites can build trust with users, but it can mean creating a lot of free content in the hopes of converting some people to paid users. Of course, if that model is too successful, then there will be an army of people dumping free content hoping to do the same, but then your paid content is competing with the free content available from many trying to do the same business model. If Bob creates a business doing this, do Pam, Sam, Bill, James, and Jess try to do that a few months later which then means that the amount of free content is now 5x what it was and there's no need to pay for it because it'll be a freebie on one of those sites?


I train programmers, and strongly recommend they buy books or do other types of money/time investment to make themselves better (and more highly-paid programmers).

They won't do it.

I've had multiple programmers literally shocked and avoid my outstretched book. Once I got a question, said "I just read the answer to this in this book right here"... and the programmer refused to read the book to answer his question.

I don't get it.

This, coupled with companies lack of investment in their expensive engineers, is mystifying.

None of the above has anything to do with the FSF.


Maybe books are too low density. Like the “quantity of information” per amount of words is lower than a blog post for example.

I dunno, I’m not really a reading type but I do own programming-related books. It’s the only type of book I own. I learned a lot from books like The Clean Coder, The Pragmatic Programmer and some 90’s book about Design Patterns with c++ examples and I don’t even write c++


If anything, a good book (or video course) will be a high-density, concentrated pill of everything you need to know about the subject to know what everything is and how it interacts. By comparison, reading blog posts and random YouTube videos is more akin to "grazing" - sure you can learn, but not as fast and you'll be missing context until the lightbulb goes off.


Good programming books are some of the highest information density of any writing you can find. College professors tend to prefer the lower density versions which might be biasing people.


Is that true? Or even your actual experience?

The recommendations I got from lecturers at university were thick academic textbooks. The 'good programming books' popular on HN etc. - without taking anything away from them - tend to be more waffley blog-turned-book type things.

Aside from classics from the former category that are also popular in the latter. Sipser, the dragon book, Spivak, etc.


This comes to personal preferences, I generally prefer reference books vs educational books. Find a good fit between what you’re working on and what the book is trying to do and they can be extremely helpful

Best example that comes to mind is the first edition of SQl in a nutshell. It was thin reference that covered SQL and the major differences between different SQL databases that I picked up back in 2002 ish. Quite useful if your working with several at the same time, but not something to teach you SQL. I haven’t looked at a recent edition because I am not in that situation any more, but I presume it’s still useful for that purpose.

Granted most books are significantly more verbose, but the physical limitations of books can promote extremely precise writing with minimal extraneous information.


It's my experience that this is often the case. I don't like high-theory books, so that means that things like the dragon book aren't likely to be in my sample.


Some of the highest, but nothing like quantum mechanics, where my head explodes at page 2.


Do you have some book recommendations for web development?


The content I want already exists. It’s provided for no charge on stackoverflow, reddit, Wikipedia, cppreference, and a handful of other high quality sources. All of which do not charge users a fee, most of which obtain the content from users in the first place.

So as far as I see it, the problem is not that the content is uneconomical to produce. The problem is that searching doesn’t discover the content I want. It brings up geeksforgeeks or worse.


Precisely. For me, the value in paying for a course or a book is not that they produce this knowledge, but that they collate it, filter through subject matter experience they already have to remove misleading additions, and add in any missing parts that may take a long time to grasp without a "Eureka!"-inducing explanation.


The service worth paying for is turning the collection of facts into usable information.


Exactly. Filtering information from misinformation is also a good selling point.


I think a lot of it has to do with the sheer friction of paying. In a corporate context it's going to be a days long battle, at best to get a new piece of software approved- just from an expense pov- we aren't even talking the IT and infosec teams. If it's technical content on a website, sure maybe I can pay out of pocket, but it's actually a lot less friction to just move onto the next site than to open up my wallet, get charged and maybe get what I am looking for that justifies the price.


Most developers will rather spent 2 days to build something than pay $60 for something better.


They're not wrong. When I build something, it's mine and I control it. I get to learn about all sorts of interesting stuff, making me a better programmer in the process. If I publish it on GitHub, it becomes résumé material, helps other people and might even start a community if enough people start contributing. I get to contribute to the state of the art in free computing, advancing it not just for myself but also for everyone else.

If I pay for something else, I don't get source code, I get borderline malware that spies on me constantly, I'm subjected to idiotic non-negotiable terms and conditions and I'm at the mercy of some gigantic corporation that can kill the product at any time.

We don't pay for "something better" because it's not actually better at all. We're not laymen, we actually understand the fact it's a shitty deal. We're not some mindless consumer companies can just dump these products on. We have the power to create. The more we use this power, the better off we are.


I suspect the reason is the fact that not all developers live in the bay area, and $60 is a good money for them, and could worth more than 2 days.

Also if you code for fun anyways, you might as well build what you need, and get chance to use that shiny new technology while you do it. You save money, have fun, improve your resume, share projects with your friends and communities for kudos, all at the same time.


If I can do it in a way that's better suited to my use case, learn something from it and/or entertain myself, that may not be as bad a tradeoff as it may otherwise seem.

Reinventing wheels out of plain curiosity has exposed me to a variety of problems and their solutions, and I believe it exercises my general problem solving skills.


A big part of programming is learning. You need to learn. It's not only about learning a programming language. Programming language is just a tool. What you need to learn is:

1. How to use that tool effectively

2. How to build better products with it.

You are never done learning those. And the best way to learn is to (at least try) to build it yourself. Therefore I think it makes sense for programmers to try to build it.


Teenagers are likely to do just that. Their money is expensive, while their time is cheap. If they get to learn in the meantime, even better.


The Free Software Movement is for free as in free speech and not free as in free beer.

It's about freedom to access, modify and distribute the source code.


It's hard to convince anyone to pay for anything.


I've seen programming newsgroups, those things from the 90's, with what can best be described as MITM attacks having taken place when coders have been looking for solutions to problems and the solutions have not been correct. Most newsgroups were never secure so vulnerable to MITM from day 1 and what is being reported today is just the latest variation in that attack process.

I've also seen Bing & Google citing StackOverFlow and the replies in SO awarding or agreeing on a solution comes straight from this "text book" "The Gentleman's Guide To Forum Spies" https://cryptome.org/2012/07/gent-forum-spies.htm

Perhaps it would be useful to dig into a posters history on a site and then decide who you trust instead of just trusting a random on the internet?

How many people have download code from SO into VS and found it doesnt even do what its purported to do? I've seen plenty of that.

Resource Burning the population, in this case programmers, is a perfectly valid technique for a variety of reasons, but the main one being, you are stuck in front of computer and that means you cant get into mischief away from work. Religions have been using that technique for hundreds of years and colloquially its know as "The devil makes work for idle hands to do" or something to that effect.

Choose carefully what you want to believe and trust.


> I've seen programming newsgroups, those things from the 90's, with what can best be described as MITM attacks having taken place when coders have been looking for solutions to problems and the solutions have not been correct. Most newsgroups were never secure so vulnerable to MITM from day 1 and what is being reported today is just the latest variation in that attack process.

Well, that's also the side effect of taking Cunningham's Law to heart, which says "the best way to get the right answer on the Internet is not to ask a question, it's to post the wrong answer."


> It's a difficult task as it's hard to convince programmers to pay for anything

I wonder if there's an opportunity for a paid bundle of programming-related sites. I indeed will not pay for a single site (nor for a news site for that matter), but a $10-20/month subscription that covers quality programming websites could be interesting.


> The programming content "industry" is highly fragmented (outside of SO) and difficult to monetize

I think part of the problem is that the content producers are trying to cater to everyone: google algo (by artifically inflating word count, using specific keywords), beginner programmers, advanced programmers, potentially paid users, ephemeral users that arrive on the site via a referral or via googling. In the end you end up catering to no one.

As a side note, RealPython.com is going to go down even more if they're going to keep their "register to view" policy I've started to see recently.


I guess that explains why I've been seeing so many results that are nothing more than scraped content lately :-(


I don't know what is going on with Google but I thought their main idea is to give points to sites which other sites link to. Are these code-sites linking to each other perhaps?


They’re called PBNs or “private blog networks”. A bunch of low quality sites all backlinking to each other.

I don’t like the SEO industry very much.


I pay for Interview Cake.

If you could create a site with both interview prep and quality code tips I'd pay for it.


Who is the offender?


Please consider releasing any paywalled content after some deadline, e.g., after a year. It makes content practically inaccessible to a lot of people, including non-adults and people in developing countries.


Somewhat tangential but I believe Google Search is going downhill, they seem to be losing the content junk spam SEO fight. Recently, I've had to append wiki/reddit/hn to queries I search for because everything near the top is shallow copied content marketing.


They seem to keep giving more and more confidence in their language semantic ai.

Quoted words in the search should return only pages with that actually on the page. That used to work. Now it often shows pages that don't contain it all, not in the search results page itself, nor on the page when you go there. So you'd think it just ignored the quoted text and gave me results without it instead, right? Wrong. Removing the quoted text to either unquoted text or removing it entirely both result in different results. So it DOES process the quotes SOMEHOW. But it's not a clear RULE because probably it's also just processed as an input to the semantics engine. Just tell me you don't have any pages with the text instead, please. Which it actually also some sometimes...

It's an unfixable mess. And I don't think this can be turned back. Building a competitor costs hundreds of billions. A competitor will likely end up taking the same approach anyway.

I just wanted a list of google's big search engine updates but even searching this I get SEO-d blogspam ABOUT THE SEO IMPACT OF THE UPDATES.


Google Search has been going downhill _fast_ over the past 5 years or so. Since it's probably incompetence rather than malice, what the heck is going on at Google Search? It seems to me the "let's use AI for everything" camp has taken over the entire place even though it's making Search worse than it ever was.

And please spare me the excuse that now Google can answer questions. It can't, it just answers with snippets extracted from websites it deems relevant, and often the answer is flat out wrong or irrelevant.

I just hope the ML/AI mania that has taken over these big tech companies proves to be a fad that just goes nowhere like it did in the 70s and we return to plain old algorithms and good software engineering.


>It can't, it just answers with snippets extracted from websites it deems relevant, and often the answer is flat out wrong or irrelevant.

I recently searched the name of a YouTuber I was watching, the top "People also ask" suggestion was "Is [name of youtuber] married?"

I didn't particularly care but clicked out of curiosity. It expanded to show a snippet from an old (1980s) NY Times article about a completely different person getting married long before this particular person was even born.

Google AI using other companies' content to provide wrong answers to questions I didn't even ask... That says it all.


Just like pretty much every other thing online Google (search) is constantly in danger of being run over by people acting in bad-faith to game the system.

Of course their actions have not been perfect but it is a mistake to say that their search would be better as a "plain old algorithm" no matter how well-engineered it is.

I'm certain that search results would be worse than they are now if the algorithm was just "grep but for the whole internet". Or that, in that case, the careful complexity necessary in each search to exclude all the SEO garbage would be unbearable.


Eh, I don't know. There was spam but search was useful.

Now there's still spam, and search is useless.

Also, I don't think people remember how much interesting stuff is on the internet. There used to be tons of results from small sites of blogs which are still there, but not listed on Google anymore. Modern Google has made the internet incredibly smaller. Everything is still out there, hidden from our searches. It's like their algo has been tuned to favour silos, spam or content silos that is, to the detriment of the long tail of small, hobbyist websites with high signal-noise ratio.


If I search "who gave away Anne Frank's hiding place?", Google confidently gives me the answer "Miep Gies".

I don't know why Google would even suggest this -- Miep was one of Anne's helpers. Imagine all the other people out there having their names unfairly smeared by Google's algorithm.


Interesting. Using that query on Google Denmark (google.com/?gl=dk) gives a more plausible answer, Van Maaren, but Google US (google.com/?gl=us) gives the wrong answer, Miep Gies.


That is interesting! Unfortunately if I try Google Netherlands (google.com/?gl=nl) I also get Miep, so it's not just a US thing.


Happy to report that Kagi Search (a newcomer and one of the three search engines that can answer questions like this) gave the right answer. Both Google and Bing got it wrong.


No way I'm going to pay 10$/mo for search cleaning (they're just a search proxy), if it was reasonably priced at something like 1$/mo I'd consider it.


For a counterpoint, in my world search is a close third after my development toolchain and my browser so I would find it reasonable to pay - given that I get better service from the paid service.


You have to ask the really hard questions. Like the ones every 5 yo could answer without thinking. E.g. try: "What can you do with a car/book/gun?"

Be suprised the answer isn't drive/read/shoot^^.


Maybe some PM realized:

Worse search results -> more Google searches -> more ad views


I don't think it is a PM. Googlers are generally nice.

I think it is a second level machine learning system that has gone haywire and nobody knows how to fix it, or they lost the keys, see https://news.ycombinator.com/item?id=5397797 for a real life (or not : ) story.


Good point. Users only know about Google at this point. And won't even consider another search engine.


And if the spammy, garbage pages now appearing on results page #1 also use Google ads....


That backfired then (well, for me). I use DDG now because if my results are going to suck either way, I’d rather have privacy.


I've been using DDG for years, and its quality has recently gone down too. I was accustomed to using !g to search Google, but now I'm very surprised to find that the Google results are worse!

Google has completely lost the plot if DDG is the engine with the better results.


Yeah, it’s at the point where search engines are now closer to shortcut bars for me. It’s rare that I don’t have to append “wiki” or “Reddit” to my search in order to find a decent result (which I assumed existed before I even searched it).

I wonder what the consequences are for discoverability on the web. It can’t be good. Maybe I’m waxing nostalgic, but it seemed like there was a time I discovered new and cool web sites. Now all I discover is ad spam clogging up the tubes.


> I wonder what the consequences are for discoverability on the web. It can’t be good. Maybe I’m waxing nostalgic, but it seemed like there was a time I discovered new and cool web sites. Now all I discover is ad spam clogging up the tubes.

I feel the exact same way too. I can't even discover blogs with solutions any more (even ones I know exist because I've seen them before), it's all just spam.


>I just hope the ML/AI mania that has taken over these big tech companies proves to be a fad that just goes nowhere like it did in the 70s and we return to plain old algorithms and good software engineering.

It's now being used by law enforcement. Sponsored by an errant swat raid near you. Don't worry, they'll prosecute you anyway to cover their ass. Can't risk losing their pensions, ya know. /scared

https://www.policechiefmagazine.org/product-feature-artifici...


The plummet began around '08 or '09, to my recollection. It was as if, over a period of a few months, they just completely gave up trying to fight webspam shit. Simply surrendered.


It reminds me of the scene from the original willy wonka from that era, where they are trying to use a computer to predict where the next winning ticket will be.

https://www.youtube.com/watch?v=tMZ2j9yK_NY


I just want a search engine that will let me mark sites and links and let me search in that subset when I want so I can add curated sources to my little index.


“Engagement with the site is up!”

Translates to: “The users can’t find what they’re looking for, so they’re clicking around a lot.”


https://programmablesearchengine.google.com/about/

I have stack exchange, github, and all the canonical documentation sites for my projects in mine.


Or a good support for search results block lists, then users can deal with SEO spam same way they deal with ads. Pinterest will be sooo gone...


> they seem to be losing the content junk spam SEO fight

Google clearly isn't even trying.

So many of these sites are polluting search results for months. It isn't a case of sites that pop up for a few hours until Google notices and blacklists them.

Google Search has gone so far downhill. I'm not sure what they're optimising for. Long-term irrelevance, it seems.


Well I think you already nailed it with “not even trying”. You don’t need to try or optimize when you’re a monopoly. Competition forces that function, but a lack thereof enables mediocrity and complacency.


Oh come on! I am no fan of Google and their privacy policies etc., and I agree that results like the sites mentioned in this thread are annoying but I am not so cocky as to think I have any idea what Google is fighting.

There are a lot of very smart people from all over the world putting everything they have into getting to the top of that site. It's not a trivial task for them.


I disagree. Google has absolutely dropped the ball - I'm not saying that it's easy to algorithmically rank good/bad content, but they're not even trying.

A stop-gap solution would be to simply penalize anything with ads. Legitimate websites will still have enough "SEO juice" (for the lack of a better term) to offset the penalty, but brand new copycats with otherwise no inbound links to them from other legitimate websites (and no other business model beyond scammy ads) will never be able to rank high, essentially removing the incentive from setting up these sites in the first place.

Not to mention, these problems can be identified, prioritized and then tackled manually one-by-one. Stackoverflow copycats can be dealt with by downloading the SO data dump, parsing it and then severely penalizing any website where the bulk of the content matches the dump. Pinterest can be penalized by simply excluding it from image searches until they actually display the searched image without asking for login. So on and so forth.

You can't win them all, and you can't train an algorithm to win every time either, but you can manually observe what's happening and deal with the biggest problems.

The problem however is that the current status-quo is good for Google. They've got the monopoly on search (every other one is typically even worse when it comes to these issues), and spammy copycat websites happen to have Google ads or analytics so Google actually benefits from them too.


Well, quite. Why would a company that makes its money selling ads penalise sites that show its ads?

And when you get to that state, you've pretty much turned the entire web into content marketing for your ad sales business.

So - of course search quality suffers. Search quality is not the point.


How do you know Google is not already doing all those things? You said it yourself, you can't win them all and we're witnessing the result of that. If Google truly did nothing it would be an unimaginable chaos.


Because I’m not seeing the results. Let’s ignore SO copycats for now because that solution requires at least a slight amount of effort and look at the elephant in the room - Pinterest. Dealing with it is a simple “if image_search and domain == pinterest.com then skip” and they’re not even doing that.

They have a site that’s been polluting image search results for years, isn’t even trying to hide (unlike the SO copycats which could technically churn out an infinite amount of domains to work around bans) and they can’t even deal with that.


Consider it from a management/budgetary point of view. Super smart people? Oh hell yeah, and I don't fault them for this one bit. But bean counters? Why the hell invest in hiring those people, or budgeting for search quality improvements, when you have essentially no competition to speak of? Makes more sense to put those same smart people on projects that correlate people's activities outside of Google with activities they can observe either directly or indirectly to create a more wholistic picture of the consumer for advertisers. Those are much harder problems to solve, especially now that the public is finally becoming slowly more aware of the massive abuses of privacy the bean counters like I described above have been pushing for years.

At the end of the day Google doesn't exist for the good of the public, they exist to return the highest possible investment for shareholders. It's literally the opposite of their job to improve search quality when competition doesn't force them to (because investors see that as a waste of money that could have been spent on further eroding people's privacy to enable higher and higher ad revenues.)


Fair point. But again, that would be less of an issue if there were like 3 or 4 big search engines with different algorithms, all changing and adapting on their own schedules. Would be harder to game the system then.


You're so right. It's sad that even if you remove 99% of something, to the outside it looks like you are removing 0% because you have no reference.


One of the problems with things like this is how do you know which site has copied from another, especially if you don't want a list of hard-coded exceptions.

Related is if you have a lot of content from git repositories that are mirrored from different locations (GitHub, GitLab, etc.), all of which are showing the same content. Or if different sites are hosting versions of public domain texts. You don't want to derank those results, even if they are similar to the "copy a popular site" websites.


> One of the problems with things like this is how do you know which site has copied from another, especially if you don't want a list of hard-coded exceptions.

The core function of that is actually pretty simple:

1. Strip all X/HTML tags 2. Run `diff`

Sure, it's not perfect, but an organization that pursues academic quantum computing research can sure as hell afford to run the results of the above against an AI to check for similarities.


The issue is not figuring out if site X has a copy of a page or text from site Y. The issue is how do you know which site is the source of the information. In other words, the technical bit is easy but determining what to do with that is hard. Especially if you don't want a lot of hard coded exceptions.

In the case of things like other sites copying from stack overflow, github, etc. you can figure out which is the source of the information.

Lets day you have decided to make github the source of that information, and derank any other sites that also have that information. As a result, you will derank gitlab for having git mirrors of projects, as the source will match that on github. You will also derank sites like lkml as that contains the commit message descriptions, and the patches will partially match content from the kernel source hosted on github.

Lets also say you've decided to make wikipedia and other wikimedia sites the source of information. Congratulations, you've now deranked project gutenberg for hosting the same public domain texts as wikimedia, along with other sites like the official sites for authors like Jack London. Plus any blog that includes portions of these to discuss or analyze them.

Do you want a blackbox AI making these decisions?


> he issue is how do you know which site is the source of the information. In other words, the technical bit is easy but determining what to do with that is hard.

I interviewed with a 3 or so person company that did just that I think.

That was over a decade ago.


Google knows instantly how much time you spend on a page and how quickly you return to the search results because it is a useless copy/paste page from SO.


And probably they track content age too? Since they somehow can lookup by content (they can detect duplicated content), there's probably a date remembered too

So they could know if a code snippet was already X years old? And from where, originally. (Unless it got edited a lot but that's not the case here?)


> Google Search has gone so far downhill. I'm not sure what they're optimising for. Long-term irrelevance, it seems.

I assume they're optimizing for the most common denominator: standard, non-power users. Early search didn't return great results for human-like questions such as What are the wavelengths of the colors on the visible spectrum? The result might be within the first two pages, but that query had too many irrelevant search terms. A better query would have been wavelengths color visible spectrum. That query only has the necessary key terms. Sometimes queries required the user to know search operators (e.g. exact match, date range, synonyms) just to get relevant results.

The average person probably didn't know that early searches gave better results when constructed in the second way. Google changed search to adapt to how normal people search. Now the human-like query will return good results. Combine that with locales, search history, and personal interests, even the most basic user can get worthwhile results from asking Google question. The cost is that power users who understand operators and the power of key terms get less relevant results but likely still correct.


I'm skeptical that it's all that much better for any group of user. I search for normal things all of the time and it doesn't do a good job at returning relevant results for things like businesses, recent movies / TV shows, etc. and natural language syntax doesn't make the results less bad.

What it looks suspiciously like to me is a lack of an effective feedback loop for user frustration — if it takes a number of queries to get correct results or someone doesn't stop using Google search entirely, this would be easy to confuse with improved engagement and I'd especially believe that managers whose job it is to get a number to go up are not in a hurry to question whether that growth is meaningful.


If Google actually used its search history, it would see I was a sw eng power user, and return the types of results that users like me want to see.

Then again, I have been using Google Maps for 16 years and it still cannot show me distances in km rather than miles. A boolean setting ffs that I've had to manually switch probably a thousand times.

If billions of dollars of AI R&D can't even figure that out...


Funny how people complain about Google tracking them and then get angry when they are not being tracked enough.


Pretty much this. It’s likely down to the lack of incentives for leaders and line workers. Ie: search spam volume isn’t a measure by which search team performance is graded


Is it legally a good idea for Google to go and blacklist individual sites?


They killed off the Wikipedia clones that way. It's funny that they haven't come back even though the SO clones do the exact same thing.


Wikipedia clones probably got enough traffic to show up on someone's dashboard/report. SO clones are probably a small fraction of all traffic, even though they seem to show up in a large percentage of the searches we perform.


I guess Wikipedia kind of got their rampant deletionism under control while Stack Exchange did not?


Child porn is the extreme. www.chillingeffects.org is the other.


Exactly, change the word blacklist to censor, and then it seems wrong.


But that's not censorship. Google is not a government, and no one's speech is being suppressed by filtering out literal copies of something and leaving the original.


Censorship can be conducted by private or public entities.

https://www.aclu.org/other/what-censorship


Change the flag button on this post to censor and I bet people wouldn’t click it. But people want some level of censorship, so long as they convince themselves they’re not against free speech.


Blacklisting by a private entity is not censorship. It's even sillier when you consider that what's being removed are copies and the original content is still there.


Isn't that a bit narcissistic to think that your one problem in particular out of millions is a sign that they aren't doing anything at all?


I also a lot of strange domains (.it, .so) and tangential related domainnames. Clicking it gives a really weird content scraped jumble that seems AI generated. My malware extension also went on alert. And this is in the top ten in a search result!


I see tons of "non-standard" TLDs now, even in articles posted here, and I'm always wary of visiting them. To be perfectly honest, I tend to completely ignore TLDs with more than 3 characters.


Yeah, in general I do too.


The other day, I searched for something and got a whole HN thread copied to some crap site but it had just enough veneer to seem legit. I figured it out when I searched for something from that thread and found the actual original HN thread that had way more context. Google has been good to me overall, but in the past few weeks I've really noticed the SEO spam, and I'm not sure what to do about it so that I don't find those dodgy sites. I think Google search is a marvel of modern software, but I'm a firm believer that their advertising business has warped their other products in a negative way, though I'm not the person to suggest a fix.


> found the actual original HN thread that had way more context.

So the nitwits just harvested too early, before the HN discussion was fully grown?


It seems like the harvested from a permalink to a comment (the link you get from clicking the "x minutes ago" time) rather than the entire comment section and later what I found was the whole comment section here that gave better context for my query.


Here's an alternative explanation: the quality of the content has gotten worse. Everyone and their mother is doing SEO now. They're trying to maximize clicks, not quality content.


Google will only get worse with time because the core idea of PageRank relies on honest linking. Google saw the web before link farms were a thing, which allowed it to compute authority from a clean graph. As time passes and linking becomes gamed, the web graph becomes messier and the reliance on PageRank diminishes. AI doesn't replace PageRank, but it can be pretty good at predicting the results you're likely to click on with enough behavioural data - which doesn't work for long tail queries really


Algolia should look into becoming the next search engine.

I want it to absolutely destroy Google/Microsoft.


> Google Search is going downhill

Discussion: https://news.ycombinator.com/item?id=29392702.


This can be somewhat alleviated by running searx and adding domains to a filter list. You can also merge results with DDG and other search engines.


I have the Algolia HN search bookmarked. For specific types of searches I go straight to that


ddg has the same issue though


Not only SO threads, I particularly hate the ones that mirror GitHub Issues. They don't even link back to the original thread, for Christ's sake!


And as I had mentioned in another thread, DDG so far ignores them, at least in my 1 case, but Google has those GitHub issues mirror sites as the first results. So they are clearly gaming Google SEO.


I'm getting lots of these in DDG too. Right now I'm doing searches about home renovation and I see lots of pages which are just a rip off of an online forum page (which I find elsewhere in my results, usually). Which, unlike SO, is not licensed for people to copy.


My experience has been finding more garbage on DDG, not less. I ended up getting too tired of those results being above the good ones that I switched back to Google.


Try this, the exact query I was referring to. I leave it to the reader to decide which is better.. The top 2 Google results are scrapped clones (or whatever they should be called) of the 1st result from DDG (Github issues #162 for docker-calibre-web)

The closest Google comes in top 4 links is a link to the entire Github issues for docker-calibre-web

DDG: https://duckduckgo.com/?t=ffab&q=FakeUserAgentError+calibre-...

Google: https://www.google.com/search?hl=en&q=FakeUserAgentError%20c...

DDG Results:

1) https://github.com/linuxserver/docker-calibre-web/issues/162

2) https://www.nas-forum.com/forum/topic/67746-tuto-calibre-web...

3) https://pypi.org/project/fake-useragent/

4) https://github.com/linuxserver/docker-calibre-web/issues/

Google Results:

1) https://githubmemory.com/repo/linuxserver/docker-calibre-web...

2) https://issueexplorer.com/issue/linuxserver/docker-calibre-w...

3) https://github.com/linuxserver/docker-calibre-web/issues

4) https://github.com/janeczku/calibre-web/issues/1527


Interestingly, as of 02 December 2021 ~0300 US/Eastern, this thread is the third entry in your Google search. It doesn't appear at all on the first page of the DDG results.


Wow, I checked and that is the funniest thing ever and totally unexpected!!

FWIW, I see it as 4th item in Google search results, but that is not important. It's just weird that that will show up at all.

Again, I'll say, that's a win for DDG results, at least IMveryHO.


Wow total opposite experience for me. I find Google much better for casual searches, but DDG oceans better for anything technical.


I'd blame gooogle more than the website, since google search results are generally getting worse and less relevant


Totally. This is a exactly what I was thinking about yesterday in the thread where someone was asking if Google results were getting worse.

There's no reason the official docs for Python should be lower in the results than a shitty docs clone / spam site when searching for a common package/function in the standard library.


"There's no reason the official docs for Python should be lower in the results than a shitty docs clone"

If Google still used/respected the original page rank algo, the official docs would never be ranked lower than a spammy clone site. I just wish they reverted back to the "power of the crowd" algo, let each node vote with their links and reputation.

Nowadays I almost stopped using Google completely. I first noticed it with torrent/streaming censorship ~5 years ago, then the political/ideological censorship started to feel unbearable. I just want my search engine to show me the most relevant results BASED ON THE QUERY I SUBMITTED. No moral judgement crap.

My search stack nowadays is a mix o Bing, ddg, Yandex and if everything else failed, Google. It's a sad reality.


Sample of 1 as well but I do see them on DDG, not nearly as much as Google though.


Just found out about github clones today. It threw me off.

At least for SO you can get back to SO to search but github issues search engine is something else.


uBlacklist is a great tool to block sites you don't want from showing up on your search results on google (and a couple others including bing and duckduckgo). It supports regular expressions as well, I use /pinterest\..*/ to block all pinterest-related content.

https://addons.mozilla.org/en-US/firefox/addon/ublacklist/

https://chrome.google.com/webstore/detail/ublacklist/pncfbmi...


That's a great tool, thanks for the recommendation. Pity it doesn't work with Ecosia's image search. I wrote my own extension to remove Pinterest from image results, and I've been thinking of extending it to block Instagram and Facebook too. I hate how these companies point their sewer pipe at search engines, so you get into a walled garden where you have to sign up to see the indexed content.


I wish Google would add that feature natively. Google could even use it to help detect spam sites.


Google used to make it possible to filter domains from results. I was sad to see it disappear


You can still do "-site:spam.com".


For sure! Thanks for the reminder. There was a point (as you may know) when the domains were saved into your profile, so you didn't have to add it.


I didn’t know.


When they removed the domain blacklist feature search went down hill fast.

https://searchengineland.com/google-brings-back-blocking-sit...


This works, but bad actors like pinterest have found a way around that by having a domain on every TLD and ccTLD out there.


I found out by experiment that wildcards work, e.g. -site:pinterest.*


They used to have personal blocklist [1] and blocking garbage domains actually had an impact on search results.

1. https://googleblog.blogspot.com/2011/02/new-chrome-extension...


There's also Personal Blocklist, which appends a small "block site.etc" under every google search result.

If there is a site that is clogging up your search results, just click it and it's gone.

https://addons.mozilla.org/en-US/firefox/addon/personal-bloc...


To me, this comment made it sound like this "block site.etc" was a feature that Personal Blocklist has and uBlacklist doesn't, but if I open the ublacklist page the first screenshot of the extension shows "Block this site" next to each search result. Is the "block site.etc" in Personal Blocklist different from the one in uBlacklist?


I haven't used uBlacklist, this was a solution I had independently stumbled into and I wanted to share it.

Based on the dialogue from the posters I was responding to, it seemed like uBlacklist would require some ongoing maintenance or memorization of input fields to work.

If it also offers a single-click permanent block of all sites in a domain from a Google search then that's just as cool, but that feature wasn't obvious from the conversation.


It'd be pretty neat if DDG let you configure this as a setting.

They don't have "accounts" as such, but they do provide users with a one-time password that lets you save and restore settings.

Alternatively, if they at least supported "do not include" filters in regular searches, you could build a client that stored those kinds of search defaults locally. The best you can do now is strip out those results on the client side, which isn't bad but seems like a hack.


We built this into Kagi Search, natively, I invite you to try it out. You can both “mute” and “prefer” domains in your searches. Currently in beta!


They can't do that since they are getting their results from Bing.


Do you have any subscription lists you would recommend? [1] is the only related information I could find. Would be great to expand this.

[1] https://news.ycombinator.com/item?id=27994059


If you are using Safari, the DevonAgent app[1] that was linked to in another thread also has a denylist feature

[1]:https://www.devontechnologies.com/apps/devonagent


Great thanks, it even supports iOS. Will try it!

https://github.com/iorate/uBlacklist


Sites that do auto translate of original SO threads and pretend that it's their original content are the worst. Google sometimes prefers to give me that results instead of the actual SO thread because I'm not in English-speaking country. I have to waste some time to understand that it's just stolen SO thread. And it's not even that useful because some of them AUTO TRANSLATE CODE.


DDG allows you to set your own country or any country easily (without needing an account)


Lol. Google's absolute insistence that you must want local language results has been a pain in the ass for years.

It's impossible for them to understand. Clearly they have zero people working and living in countries other than their own. /s


IMHO the biggest offender:

    https://www.geeksforgeeks.org/
They are making me insane with the modal login demand. I wonder if Google has downgraded the authoritative standing of StackOverflow?


These definitely aren't what the OP are talking about. They focus on original content but the quality is not great.

The other sites, every page is a mirror of an SO page, with worse CSS. Maybe you haven't encountered them yet!


geeksforgeeks actually has humanly created articles. The quality varies (kind of like W3Schools does).

With ublock origin I have not seen any modal logins.

For example this article on red black trees: https://www.geeksforgeeks.org/red-black-tree-set-1-introduct... does not seem that horrible with proper references to CLRS and everything

OP is talking about ugly looking automated Stack Overflow copies. No idea how those end up so high in rankings.


Do they not have original content? I'm always frustrated when I click too fast and land on geeksforgeeks.org, because their code examples are so low quality and without explanations.

I didn't notice SO being this bad.


This site has fooled me once or twice. I think there are other domains that point to the same data-set, because that's not the only domain serving up this exact content.


See this reddit answer on how to circumvent the login prompt: https://old.reddit.com/r/programming/comments/q0oqai/what_is...


you can use UBO element zapper, but i recommend staying away from that site except for leetcode type questions. The content quality is mixed at best.


I’ve just disabled JavaScript. The site works fine and the prompt never appears.


Just to chip in with a very minor annoyance, I hate how google puts up w3schools results above MDN for anything related to JS/HTML.


Hell, yesterday I searched for `python string split`.

Stupid w3schools was the first hit. The official library reference documentation on python.org (which was what I was looking for) didn't turn up until nearly the bottom of the 2nd page of search results!! Beyond frustrating.


Check out tomdn.com. I created it to make it easier to get to MDN. E.g., instead of googling array.map, just type tomdn.com?array.map. It will take you straight to MDN docs on Array.prototype.map. Other things work as well, like tomdn.com?object.keys, tomdn.com?css.color, tomdn.com?htmlel.button, etc.

If there is a pattern it doesn't recognize, you end up on the MDN search results page.

edit: More info. about how it works is available here: https://github.com/tayler/tomdn


What value does this provide over simply Googling or DDG'ing "array.map mdn"? It's shorter and the first result for me is (unsurprisingly) MDN on both, and I don't have to run my searches through an unknown third party.

DDG also has !bangs for this specific purpose and !mdn does, as expected, lead you to MDN.


Valid questions. I just think it's faster. One action on tomdn.com instead of 1. DDG "array.map mdn", 2. Find MDN result (usually first), 3. Click result, 4. See docs.

Bangs are great, just shorter again to use tomdn.com: - `!mdn array.map` leads me to https://developer.mozilla.org/en-US/search?q=array.map, which I have to scan the results and click on. - tomdn.com?array.map leads me to the exact page in the docs. https://developer.mozilla.org/en-US/docs/Web/Javascript/Refe...

This may or may not be valuable, but I like saving the extra steps for something I do many times a week. Basically this calculation (https://xkcd.com/1205/) works out for me + the fun of working on a new project.

edit: >I don't have to run my searches through an unknown third party

Very sympathetic to this concern. Feel free to host yourself -- it's a single html file. https://github.com/tayler/tomdn


`! mdn array.map` takes you right to MDN


Totally had no idea. Thanks for the tip. I guess tomdn.com is for non-ddg users only.


I've seen several developers google "w3schools someDomEvent" because they are conditioned by those search results. Has it accidentally become a halfway useful site?


> Has it accidentally become a halfway useful site?

It has. It is nowhere near the quality of MDN, but also nowhere nearly as bad as it used to be.

W3Schools has less information than MDN. This is bad for the experienced, but a relief for people who are entering the trade.

As an example, consider an inexperienced web developer that wants to do something when a button is clicked. They google for it and find these two pages:

- https://www.w3schools.com/jsref/event_onclick.asp . Right there in the middle of the page is an example snippet that will get the job done. If I scroll down I see other ways to do it, all of them fully contained (i.e.: I can copy-paste it to some website and they will work).

- https://developer.mozilla.org/en-US/docs/Web/API/Element/cli... . There's a table of things I don't understand (Bubbles? Cancellable?), two lengthy discussions about weird edge cases in some browsers, and (finally!) a single not-so-simple example that is not even enclosed in <script> tags.

What website do you think they'll pick?


It's fine for simple things. I referenced it a lot when I was first learning JS and HTML because it was usually at the top of Google results. I think MDN is strictly better.


IMO it is a legitimately useful site, but strictly for beginners, and probably only for html/css. It teaches you the basics and has the sandbox to test things out. It's fine, just not what you want to see when you're looking for actual documentation.


I used it to learn CSS way back in the day and it still seems to be a good balance between too little and too much for the beginner, so I still send my newbies there.

Not surprised it shows higher than the actual spec. I think the HN community is just getting to the greybeard stage (I'm still cutting out grays), and this is our version of JUST READ THE MAN FILE


I do the same, but with "mdn" instead of "w3schools"


Are you positive the first term wasn't prefixed by a hyphen-minus?


I prefer W3Schools to MDN.


Really ticks me off that Google allows itself to be so easily gamed, it's your core business for christ's sake.


I used to work for a company that gamed Google like this, but with product search results (that were, themselves, CPC ads). The company got bought for scrap because Google got good at downranking their pages in search results and finally taking shopping seriously.

A lot has changed since then, but I doubt that Google has decided these kinds of content farms are suddenly OK. It should also give pause to people asking for "open ranking algorithms." Without some amount of secrecy, content farms will pop up for literally every keyword they can sell ads against, and they'll know how to rank more often rather than just blindly guessing.


It does give me pause, but do you really think there is no way to have both open algorithm and good results? What is users assessed sites as they used a search engine, and then this was broadcast back to the engine?

I have a feeling there is a way to introduce randomness to this search engine in a way that smooths out the game theoretic arms race of gaming an algorithm. This goes against our natural urge to have a fully deterministic algorithm which always brings the most pertinent sites right at the top. However, because humans are trying to bring order to things for their own benefit, throwing in chaos could prevent people from even trying to game the algorithm


Markets dont work when everyone knows what I am THINKING. i.e. If I know with 100% certainty you are going to buy something for 1$ I just bid 1.0001. Same with Google search. If I know with 100% certainty that you rank pages in a certain way, I can game them. Markets only work when all data is available but no all opinions.

Introducing random elements just makes it worse for everyone.


So why was google more effective before? Is this is product of the financial struggle of our society, or do you believe it is an inevitable consequence of a dominant search engine?


It is a lot more effective for me. I usually find what I am looking for in the first few results, there are flights, calculators, wiki pages, etc. So I disagree it has gotten worse.

The task of search has gotten exponentially harder as well. There are billions more website pages, more spam, more sophisticated attacks, more gaming of the system, etc. They could have gone from a 80% relevant result to a 99% relevant result and you perceive at 10% drop in relevancy. It's a hard problem.


Neural network weighting loops and adversarial inputs say yes, it is impossible.


No it isn't. Ads are their core business. Those crap sites almost all run ads from Google


That's a shortsighted view because it would imply that the quality of search doesn't have any bearing on their success. I highly doubt any Googler would agree with that. Especially not the ones who have been working on search specifically.


> That's a shortsighted view because it would imply that the quality of search doesn't have any bearing on their success. I highly doubt any Googler would agree with that. Especially not the ones who have been working on search specifically.

Google is the goto search engine today, the competition is virtually non-existant, so they don't have to compete anymore. Hence the quality of their search engine going down hill, conflict of interests and what not. It's not shortsighted, it would take billions to compete with Google search and even Bing isn't even trying anymore.

When I look at a Google search page result today I see loads of ads, then their shopping stuff, image search stuff, a list of questions related or not to my search and their answers extracted from third party web pages, then only when I scroll 1 page I see actual search results. Google search wasn't like that 15 years ago.

Googlers are working so that Google makes more money, and Google is an ad company first and foremost.


But that's pretty much how Google pushed out the then popular search engines: with search quality and a minimalist UI. It would be irrational for them to stray away too much from that quality focus.


Extrapolating from an employee's feelings to the behavior of a huge public company is a terrible way to model reality.


This whole topic is extrapolating from people's feelings to the behaviour of a huge public company. I've not seen a single piece of actual data anywhere so far.


That's not what I did. I was implying that those who have a stake in Google's success also have a stake in their search quality.


We have to assume that Google devs see these crappy spammy code websites too on a daily basis, right? What's up with the incredibly strong programmer culture over there, gone already after 20 years? You'd think someone would take these weak search results personally...


As someone who knows a handful of devs and hear their near daily grievances of google, it seems like the culture over there has died and it’s just a bog standard Big Corp(tm) now. By how they describe it, a lot of the people, (but not all of course), who deeply care about the ‘craft’ of development have retired or moved onto more interesting roles at newer and more exciting companies and were replaced by people whose main driving force is climbing the corporate ladder at a FAANG. As a result the perf cycles/promotions seem to be heavily based on high impact projects/launches, leaving the “little things” like bad search results or other maintenance work forever on the back burner, because working on those things won’t get you a promotion/more money.


Desktop web search is probably less the core business these days than making search work well on phones/mobile devices with specific applications, such as Maps, etc.


Why would they do that though? Why would they show you a low quality result if there was a better one available? I have pondered some of these sites as well. I think the answer is that they are actually helpful to some people.


Businesses are profit seeking entities run by human beings who tend to prefer conservation of energy. If a business goal can be met with X effort, X+1 effort is a waste and the resources can be directed elsewhere or kept as profit.


I don't think they track 'quality' they just see people clicking and use that as a proxy for 'quality'. (Quoted because what does quality even mean?)


I seem to recall reading that they (used to?) track abandonment of the search results page as a successful answer to your question from the embedded page snippets or info boxes. So if you get a search result page that's complete garbage and just give up, they increase the ranking of those sites.


Search hasn't been Google core business for decades, they're into selling advertisement spaces


The answer is, if someone can make money by doing something shitty but not illegal, they will do it.

Almost everything on the web is some scheme to put ads in your face so someone can make some money.


Yeah there's a whole mini scam network that copies programming how-to videos on YouTube and puts new thumbnails on them. So you watch the ad and then get 30 seconds into the video before realizing it's garbage, they've made some a few stupid pennies :(


My pet peeve is ApiDock, which has managed to SEO itself so high up the rankings when searching for anything connected to Ruby or Rails that it is actually quite difficult to get to the legitimate, official documentation.

What's worse is most of the results are outdated so you're looking at web-scraped API docs for Rails 3 or something.

Really frustrating.


And the site is so slooow for me.


It’s an easy way to make money. Scrape a popular site like Stack overflow or Wikipedia and add a bunch of advertisements.

One of the many ways that scum ruin the web.


Wonder if there’s any way to get advertisers to veto scraped content sites. Not really in their incentive to boycott eyeballs, but if advertisers were actively avoiding those sites, it’d dry up the incentive to make them.

This could backfire, but fining advertisers that show up on those sites might work. The difficulty would be all the claim verification and process of determining what exactly is a “scraped site”, and backfire scenario could be another hurdle for non established sites/more centralization of content. But if you targeted the advertisers rather than the sites themselves, the advertising networks would be incentivized to do that identification themselves.


Contacting the advertisers to tell them that being shown on those sites makes you want to avoid their products might work, although that would involve not blocking the ads.


I really hate these. Especially when I'm trying to figure something out and I'm struggling to find answers, I end up haplessly finding the exact same wrong answer on three different sites.


I've 100% noticed it in the past months. Have a new employee that i'm trying to train up a bit with python and googling has gotten more annoying. fucking codegrepper.com & shit. FUG EM . hilarious when you find the exact same snippet over and over. Even more hilarious those asswipes are probably making $$ (potentially lots) with all the stupid clicks n that.


Oh god yeah, I wish SO got better at blocking scrapers, I'm pretty sure services like CloudFlare have anti-scraping protection. They need killing off ASAP!


An index of these sites will be helpful to mass blacklist them with the uBlacklist extension.

Anyone up for creating one so everyone can contribute to it?

The extension allows subscribing to blacklists via links, so a single txt file will be enough.


I spoke to a VP at google in 2006 in london and discussed using a combination of curation and entropy to flush out duplicates. He seemed pretty excited by the idea but I don't think anything materialized. Which is another way of saying this is not new - in those days these sites were copying newsgroups too.


Well, as i wrote i understand that they try to monetize it.

But: why the sudden explosion? I feel there is more of these sites going live regularly. Many times they make up 80% of the first pages on search results, just repeating the SO threads listed before. So it‘s really getting difficult.

Something must be done…


Maybe these sites share some code? If you've already captured 2% of searches with one site, why not capture 1.5+1.5=3% of searches by running two sites.


Maybe someone figured out The Google Algorithm (TM/all hail the algorithm...) is allowing source code on dodgy sites, so they've created and marketed to spammers a script/bot to create these kinds of sites (or all the sites just belong to them) to farm that ad money.

A bit like how when "hoverboards" got popular, Chinese factories started churning them. And then when fidget spinners happened, same thing: https://www.buzzfeednews.com/article/josephbernstein/how-to-...


Everyone should install an ad blocker so these types of "businesses" become unprofitable. Installing an ad blocker is the ethical choice.


Search spam sites can be reported to google: https://developers.google.com/search/docs/advanced/guideline...


Quoting: "While Google does not use these reports to take direct action against violations"

Given e.g. pinterest in the google results, I find it difficult to imagine as sure way to waste your time uselessly than to report SEO spam sites to google. It is obvious they do not care the slightest bit.


The biggest problem is that they waste your time even when the content is ostensibly helpful, since the search result is usually listed after the Stack Overflow page it crawled from anyway. Each click steals a few seconds of developers’ time, which adds up, given how frequently these types of results pop up on Google search lately. That makes them worse than useless, they actively subtract value.


Not related to coding, but I've noticed a lot of "best of" and "top ten" sites that appear to be of the same ilk, possibly automated, that just combine pictures and paragraphs ranging from ad copy to pure drivel. On topics ranging from bicycles to Linux distro's.


Important tidbit: SO's content is CC-licensed and this is probably completely legal (apart from those who fail to add a link to the original). Not that I don't want those sites to burn in hell, but they are not even in a grey zone legally.


Here some of the sites that have gotten traction in my SERPs lately that I can't stand

https://pretagteam.com/

https://www.codegrepper.com/

https://issueexplorer.com/

They are all scraping Stack Overflow / Github in some fashion


Pretagteam & codegrepper were actually the sites that made me explode & post this here…


As others have said: SO content is ripped-off (poorly) and mirrored. The page games Google's algorithm and shows up as a 'legit' result.

Probably more complex than simple keyword stuffing, which isn't supposed to work these days..


I don't understand how these sites rank so highly though, they can't get much organic traffic at all. Most people who click on them will bounce immediately.


Generally they show up as one of the few options if you look for very specific subqueries - that is when I tend to find them.


Scrape-paste is one of the easiest way to make significant money, if it takes off, and that’s why these sites are made.

I think, google does well in general with coding or SO questions, but will show you these low quality sites when the questions are new or very specific and difficult to answer. Maybe, time to apply your head more.

*been on both sides


People are crawling content that is searched for frequently, then using SEO to rank higher in the results than the original content to make money from the ad revenue.

Code and recipes are two examples.

I'm also seeing politicians posting Tweets containing a link to their personal website, which has ads.


I've found that Bing does a better job at detecting spam like this. Not perfect, just better.


I’ve had to switch to !py on ddg because the official Python docs never make the first page. It’s really frustrating. :/


I wonder if there's a market for a software engineering specific search engine. Skip the shitty content farms, include code from open source projects, and potentially be more smart about finding package uses


I've noticed that Google Alerts for my open source projects have been useless for years. Full of snippet sites as well as outright scam sites which take code from SO or my blog or just mixed up tech words and repost it.

Here's the Google Alert from yesterday (scammy URLs redacted):

Guestmount qcow2 - Casino en ligne fiable It uses libguestfs for access to the guest filesystem, and FUSE (the ``filesystem in userspace'') to make it appear as a mountable device.

Stdin 1 libguestfs error usr bin supermin exited with error status 1 - Aritco Since libguestfs 1. sudo apt-get install libguestfs-tools mkdir sysroot # Just a test file. Supermin and Docker-Toolbox #14. DIF/DIX increases the ...

Edit Qcow2 Image - A-ONE HEALTH BRIDGE The libguestfs is a C library and a collection of tools on this library to create, view, access and modify virtual machine disk images in Linux. img


Another thing I've noticed recently is that a lot of queries about computer graphics, especially tied to Unity's render pipelines, bring up what look like blog posts full of code snippets but the actual "article" seems nonsensical and impossible to follow. I suspect they are machine translated and they're really annoying.

edit: after doing a single Google search for "urp rendercontext" I found this: https://programmerall.com/article/71251053239/

Looking at it closely there seems to be some red thread and the images and code snippets do seem to follow a logical progression, but the text itself is a complete mess. I can tell it sometimes references things from the code snippets and hints at things I can see in the image, but it's certainly not informative.

Their site description says "Programmer All, we have been working hard to make a technical sharing website that all programmers love." I'm sorry, but I really don't.


I am with you on this. Lately I have noticed that I've googled for an issue, find a low quality site with relevant results, and later discover that it's just a copy of the GitHub issues page from the original project. Why didn't the GitHub issue link make it to the first page of Google and this crappy knockoff, with no link back to the source material beat it? So frustrating.


Just putting this out there... try brave search. The best answers from stackoverflow etc are all snippets and their results are getting better and better every day. I got sick of google after they made the BERT update. Really happy I switched (except for google maps data. google is still winning that game)


I made this chrome extension because of this exact issue.

https://chrome.google.com/webstore/detail/search-noise-filte...


I mean, while we're at it, can we get rid of blogspam? Try googling for instructions for installing cellulose insulation. It takes AGES to find a site that isn't just garbage vague content. It should be possible to detect and demote this stuff. it is so obvious.


I find that for instructions in how to do common mechanical tasks, searching on YouTube is effective. There are lots of carpenters, mechanics, electricians, etc. who want to tell us the best way to do something.


One thing that makes SO an easy target for this is that they let you download all their data and you don't even need to crawl and scrape the content from the website. Just download a dump, put it in an database, slap an HTML template on top of it, splash a few ads, and boom.


The worst of all are those websites that only show the content in search engine. When you click the link to their webpages, you can only see random texts have nothing to do with the search result at all. There must be some really narcissistic programmers behind these.


At least the SO clones will still probably have content that you can make use of in some way; what's worse is when you search for an error code and you get back tons of pages that don't even have the exact code you searched for, which seems to be increasingly common recently.

It also used to be the case that you could dig into the second, third, ... sometimes even 20-30 pages in and hit the jackpot. Now, the results are even less relevant there, you soon get to "the end", and if you change your query slightly and try again a few times to search harder for what you want, you'll get hit with the unsolvable CAPTCHA hellban.


Same thing with various GitHub issues lookalikes.


There was an SO outage this last year, and I only found that out when I tried to go to SO and couldn't get to it. I checked page 2 of Google and found one of the mirrors that you're talking about. I grabbed the content from there and continued with work.

I think it's a matter of perspective. If you _know_ that you want a specific site, use google's 'site:' specifier. If you're looking and find something that is from SO, redo the search and get to the SO Q/A. As for me, I'm moderately grateful for the decentralized backups.


This is a rather contrived argument. In the rare instances that SO was down, I had to make do looking up the original documentation for whatever language I was using.

But these scape sites add nothing of value at all, not even as an involuntary backup/archive type, since they are usually laden with ads.


The issue is actually pretty old. There was a time when Google introduced blacklisting of search results and revenue of those sites dived. Sadly, later Google rolled back the blacklist.


Any idea why they rolled it back?


Because Google is not interested in serving the best possible search results but rather in serving those that will make them the most money.


It was effective and those affected were able to get the feature axed.


All user-contributed content on Stack Overflow is under a CC-BY-SA license. So what the sites are doing is allowed under the license, as long as they're providing attribution.

Is it annoying? Sure. But neither Stack Overflow nor the authors of the content can do anything about it since they gave away a license to do it.

One of the things you have to accept when you release something under an open-source or Creative Commons license is that other people can take it and use it in ways that you don't like.


They get fed into a web crawler and then into a giant hopper whence they become the backbone of that shiny "No Code" technology you've been hearing about.


I have this problem, and contrary to a lot of people I don't protect a lot of my PI from Google. It used to be Google was good at giving me stuff I wanted in ads, especially in gmail but they don't really anymore. You would think the more they knew about you they would be able to give you better results, so maybe a large scale test should be done - if Google knows your PI do you get better or worse results, or doesn't it matter at all.


I suspect they were always there, but google and ddg are getting gamed more now. The quality of results has dropped quite a bit in the past 4 or 5 years in this regard.


If I recall the Stackoverflow dataset is open source or at least made available to download so I assume all these sites just download that information regularly.


Does anyone know of a good list of these copy sites? I just came across this Firefox extension which makes it possible to filter sites from search results: https://addons.mozilla.org/en-US/firefox/addon/hohser/. Would be great with a community blocklist like those for pi.hole


Surely, even using an extension mainly to hide these stuffs.

Not only mirroring SO, also its siblings (like serverfault and askubuntu), and others like GitHub.

But the most annoying part is it keeps showing those mirrored and machine translated stuffs that offers little to none benefits to me and I'm already being forced trained enough to identify those at first glance.

Those even shows up when I'm already searching in other languages, ahrr.

edit: formatting


I use personal blocklist extension.

It lets you remove entire sites from your search.

https://chrome.google.com/webstore/detail/personal-blocklist...

I've also removed a bunch of trash news sites and wwwSchools, and it's really sanitized my search.


- 1990 no big data, no data, google indexing a porn and no ads and black hack market, no open source code, no seo articles, no market, bbs only - $

- 2000 censorship, business, big data and ads ads - $$

- 2010 code learning projects, quora, reddit, iphone, spam indexing and seo ads ads ads ads - $$$

- 2020 ai indexing everywhere, no-code indexing and code is a porn of no-code now so ads and ads ads ads seo seo seo - $$$$$$

- 2030 profit $googleplex?


FYI, people have been posting about this on stackoverflow meta going back at least 8 years: https://meta.stackexchange.com/questions/200177/a-site-or-sc...


I wonder if Google as an organization cares enough to start a new spam crusade, but here's what happened the last time: https://googleblog.blogspot.com/2011/01/google-search-and-se...


Based on the current state of GMail I don’t think anyone at Google gives a fuck about spam.


Plugging my own FF/Chrome browser extension that lets you add domains you want to block and will simply prepend matching text links with an angry emoji and prompt you to confirm whether you want to visit the page or not:

https://github.com/fnune/nay


This reminds me Yahoo! Answers clones 10 years+ ago. To get traffic to website and cheat the search engine they would have index the Yahoo! answers website for specific niche category and create a garbage website with questions and answers not crediting the source and cramp the website with Ads everywhere to earn massive revenue.


I believe Google have hit a sweet spot (for them) where they can keep you browsing a specific topic for a long time while still showing you mildly interesting results. Since the results are consistently on topic, you are shown ads that are interesting to you time and time again, which results in a lot of clicks and a lot of revenue.


I really hope that is not true but I guess if you are optimizing for a metric then that could make sense.


This is why I often search solutions directly on Stack Overflow, and not via Google. Or I add "site:stackoverlow.com" to my search. Generally SO has all I ever need... I find vendor forums to be a total wasteland for help (ie. Power BI forums) so don't need them as part of Google results.


> Every programmer should be grateful for the opportunity to find good quality content quickly

Totally. There should be a better way to index SO.

you.com seems to try doing it that way. For most code issues, it's easier to navigate and decide what's worth reading from You than from google IMO.


Recently I've found duckduckgo to provide better coding results than Google, which really surprised me. I was only using duckduckgo at home for privacy, and Google at work because "best tool for the job", but I think that might not be the case anymore.


It's quite simple. SO has a huge easily indexable database of answers, and SEO scammers can make a quick buck by copying it all and making it seem like they have an answer for unanswered questions. Nothing to see here, blame your search engine.


I wonder if you are logged in Google and allow search history to be saved? (assuming you talked about Google search). Because I don’t have that kind of problems and I know Google use your search history to improve your personal results.


I’ve got into the habit of clicking the 3-dot icon next to the search entry (often number 1 in Googles’s results) and reporting these sites as scrapers, stealing content from SO.

Maybe if we all did this, google might eventually take notice?


It will probably take some time, but sites such as roseindia and expertsexchange also clouded search results in a similar manner. They are now history because Google and others deranked them to the depths of hell.


I do a lot of this sort of search

I have no idea what you are talking about (except for Apple's efforts at astro turfing, but that is not what you mean I think)

I use DuckDuckGo, is that why this does not bother me?

I have not used Google for search for years.


Spammers. They mirror SO stuff on their own sites and put google ads on them


Google search is going downhill in my experience.

My question: What is the alternative?


I‘m using DDG most of the time but when I have to use Google hoping the results will be better, I try to specify words that need to be in there. However, the more niche your problems are, the less you will find.


DDG is a gimmick.

They claim to be running their own search engine, never checked out for me.

More covered here:

https://news.ycombinator.com/item?id=4817576

Anyone seeing them crawling your site?

Is bing the best alternative? I've seen some pretty compelling evidence DDG copies / uses / has a deal with bing.


You don't need to search for "compelling evidence", it's not like they're hiding it from you.

https://help.duckduckgo.com/results/sources/


Reminds me a little of an oldie but a goodie: expertsexchange.com


annoyed by that too, weirdly enough they pop up for certain queries and for others they don't. i have also seen that for github issues ..

but i switched to https://www.ecosia.org/ as my main search engine, and i like the results much more than on google - nothing special, but somehow more reliable/predictable. and meanwhile you plant some trees :P


It's for all kinds of sites lately. The uBlacklist extension has solved it for me - one click and you can remove an entire domain from future results.


This is just Google doesn't need real search anymore. They're now the portal. Market cap is what drives them not some geeks' needs.


Why don't you name the URL? Share so we know to avoid. It is not like we are going to dox the guy or something.


It is just Google, with its amazing algorithms that rank established website way higher than a random spam website.


They are doing that because it's technically not violating the license the way they do it.


I hate those websites which just proxy github or npm with a different stylesheet so much.


The content farms get ahead of the organic results for many other areas too. Search for the programming questions isn't so bad. At least, the garbage is easy to recognize. Queries about products and services probably have the worst results.


Yep, just try searching for “Best <consumer product>” nothing but fake review sites stuffed with Amazon affiliate links.


Maybe code snippets are "enriched" in these sites?


They are not. All the context and conversations are stripped.


This is an obvious case of SEO spam. But there are tons and tons of other examples worth mentioning.

For example many news sites have soft paywalls that can easily be circumvented with a few clicks. The reason they don't have an _actual_ paywall is likely to come up in search results. So essentially they spam search results and obscure the content for technical illiterate users instead of just paying for ads. They want their cake and eat it too.

Now this whole dynamic is super weird. We often talk about these issues as if Google was some kind of public service that should make useful and fair search suggestions. Sure they have the incentive to do so, but they have conflicting interests at the same time.


baeldung is the worst offender for java echosystem


I actually saw a job posting by them on SO awhile back for “part-time article writing”. My guess is if they legitimately hire people to do it, those people just end stealing content to meet their writing demands.


stackoverflow.com doesn't have google ads. Those copy sites do. What is google's motive to fix that?


this shits is just as same as quora and pinterest w3school and apidock


site:stackoverflow.com <query>

That's what I do


Why not just search on stack overflow at that point?


Kmnnlm


Is it bad that I've actually found my answer on some of these sites haha. But yeah, they're pretty low quality in general.


Yes; if the content were original or the different context was increasing discoverability, that’d be ok, but the duplicates of content posted elsewhere almost always just obscures the original, better post with better context and real users to follow up with.

I’ve googled things and gotten an answer pop up from one of these types of sites too/not saying its something to beat yourself up about, but if you’re playing devil’s advocate and suggesting there’s discoverability value added outweighing the other value lost, I disagree.


This is more of an issue with google results than the content itself.

Google is a shit product and you get shit results when you use it.


I have many problems with google as a company, but as a product, google search is fantastic. To say it's shit is a huge, huge, huge over-reaction.


Just whitelist Stack Overflow in your head and avoid splogs / spamblogs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: