Filters to block and remove copycat-websites from DuckDuckGo, Google and other

xarope · on Feb 17, 2022

I've been recently searching for some very specific keyword stuff, and bumping into a lot of sites which seemed like just reformatted copy-n-paste of stackoverflow and various mailing lists, adding zero value and clogging up the top 100 search results.

Now that I see the HUGE number of copycat sites in the stackoverflow_copycats.txt file, I am beginning to understand what's going on.

Thanks!

sam0x17 · on Feb 18, 2022

It's crazy the community has had to come together to create rag-tag tools like this to patch up what is supposed to be a trillion dollar search engine that should be the pinnacle of human civilization. But then I remember it's the same sort of crazy whereby the richest country in the world has homeless people, child hunger, and decaying infrastructure :/

merpnderp · on Feb 18, 2022

It’s not a lack of resources it’s a lack of will. You can’t force parents to use the food programs. Hell around where I live you can’t stop the parents from stealing from their kids to buy drugs. Send a coat home with a kid on Friday, Monday they show up with no coat.

sam0x17 · on Feb 18, 2022

Political will, yes, but if you think people aren't willing to receive help and all the resources we need are there you are so misguided my friend. We can't even fund student lunches in many states. The primary problem across the board is lack of committing resources. Similarly child allowance tax credit was literally lifting people out of poverty but we don't have the political will to renew it.

People accept help when you give it. This why quite embarrassingly one third of GoFundMe campaigns are to fund medical expenses that wouldn't even have happened with a decent healthcare system. Don't come to me with this argument when literally millions of people are resorting to begging on GoFundMe.

If a child goes home with a coat and comes back without it, they probably have bills and a financial situation at home so dire that they sold the coat.

How is a family of four making $25k/yr supposed to deal with a $120k medical bill when they lack health insurance because this gig economy and the 3 shift jobs they work don't provide it?

merpnderp · on Feb 18, 2022

There is no good reason a kid should go hungry in the US. Federal, state, school, charity food programs make it possible for every kid to eat. But a parent who would steal his kids coat to pay his medical bills (your scenario not mine or likely anyone else's), is the kind of parent who doesn't care enough about their kid to sign up and participate in these programs. You should hang out on r/teachers some time and get a glimpse into what teachers are seeing these days.

sam0x17 · on Feb 19, 2022

It isn't a good argument for why you shouldn't give out resources. Even if parents are going to "steal" the funds, as you say, you are still giving them that much more overhead, and research shows the vast, vast majority of people aren't going to behave like you describe (educate yourself, consume some real leftist content)

MichaelZuo · on Feb 19, 2022

Your counter argument would be more persuasive if you provided proof instead of what seems like just another opinion.

barfingclouds · on Feb 19, 2022

I think these are two completely separate issues and patching them together is a lot more about spinning a narrative than addressing them

tobr · on Feb 18, 2022

Could you elaborate a little on the similarities you see between SEO spam and child hunger?

DaltonCoffee · on Feb 18, 2022

Both are unnecessary and confusing to see when set against the massive success of their creators.

sam0x17 · on Feb 18, 2022

In both scenarios you have a massive entity with near-infinite resources acting like it's so underfunded it can't take care of it's own shit.

atoav · on Feb 18, 2022

In capitalism rich/powerful entities don't necessarily have incentives to make the world (or even their own direct neigbourhood) a better place, and most big entities act only based on their incentives, not based on morals and ethics.

They however have strong incentives to invest all available resources into their own survival, prolong their own life and/or earn more money.

Depending on how the managment is in turn incentivised they will typically also prefer short term success over decisions that would make more sense long term — e.g. short term a search engine saves money by not spending a ton of money on quality when they are in a monopolist position, while in the long term it fould bite them.

mcfedr · on Feb 17, 2022

I never understood why Google isn't blocking these crap results, it's really making my experience of search really bad for a light of my searches

slig · on Feb 17, 2022

They used to be superb on detecting duplicated content. They also were extremely good at detecting spam/ham. Nowadays it feels like they don't even care anymore and whatever filters they have are either broken or untrained.

richardfey · on Feb 18, 2022

Bring back the panda, I say!

https://en.wikipedia.org/wiki/Google_Panda

worik · on Feb 18, 2022

It is an arms race.

I wonder how many players on the page generation side? The economics of it must be marginal, I guess

zhte415 · on Feb 18, 2022

Copycat sites also used to be extremely careful at not appearing copycat sites. Or not duplicating content on the same site. I am surely not alone in recalling the old mantra of not duplicating content.

Copycat sites don't seem to care anymore.

I don't believe there are an overwhelming number for Google et al to deal with as it's often the same names topping search results that such filters can remove through semi-manual user action.

While leads to the conclusion - Google don't care about duplicate content any more.

ratww · on Feb 18, 2022

Were they? I remember having to manually block myself a lot of those copycat wikipedia/stackoverflow sites back in 2011 or 2012 when they had the domain-blocklist option available for users. When the feature was removed, it all came back.

Maybe the problem is just that there are more of those now.

ASalazarMX · on Feb 18, 2022

Google removed that option without even trying to spin it as a pro-consumer change. The only problems I can think it brought to Google are clueless users complaining that they can no longer see microsoft.com in their results, and having a negative impact on unethical advertisers.

is_true · on Feb 18, 2022

I had lot of problems with "duplicated content" from sites that published the same content as I did and outranked my site.

brimble · on Feb 17, 2022

Do they do a good job at getting clickthroughs on Google ads on their site? :-/

Does the rate of ad-clicking on the results page increase if most of the "natural" results are crap? :-(

lumost · on Feb 17, 2022

I've noticed a recent trend where the copy cat/adware sites are "up-ranked" relative to original content. This would be the expected behavior of a search engine optimizing for clicks and revenue.

worik · on Feb 18, 2022

Don't be evil

amlib · on Feb 18, 2022

>Dont, be evil

For an Alphabet company, they sure don't know where to put the apostrophe

D-Coder · on Feb 18, 2022

I think they shut down that app.

eitland · on Feb 18, 2022

Part of the problem might have been that Stack Overflow has been busy shooting themselves in both feet for years.

For a while (maybe around 2012 - 2017) or something it felt like it was almost the rule that if you found a really useful question on Stack Overflow it would always be marked as llw quality.

Eventually I guess they were pruned and that might explain a bit of why they rose.

They annoyed mee too though as they often mixed together unrelated questions on the same page and get hits for very specific queries that are unrelated.

ASalazarMX · on Feb 18, 2022

They should give the YouTube audio fingerprint team a shot at it.

But seriously, Google doesn't need to make anything besides bringing back the option to hide certain domains from the results forever. Even if they don't analyze what domains people are hiding, it would dramatically improve the usability.

wolpoli · on Feb 18, 2022

Google started adding other quantitive measures for ranking results. That's how some of these crap web sites manage to rank so high.

Gigachad · on Feb 18, 2022

They probably are. But the clone sites are designed specifically to avoid being blocked by google.

ahelwer · on Feb 17, 2022

Looks great and much better than my piecemeal efforts, although I recommend linking to a specific commit of all.txt so you aren't opening up your browser's ublock origin filter list to arbitrary remote control. Like:

https://raw.githubusercontent.com/quenhus/uBlock-Origin-dev-...

Quenhus · on Feb 17, 2022

As the author of the filter, I strongly agree with you. However, I believe it would be too tedious for most people to update the filter "by hand". I think I'm going to add this important security information in the README.

gruez · on Feb 18, 2022

>so you aren't opening up your browser's ublock origin filter list to arbitrary remote control.

What's the worse that could happen? It seems like ublock is already to treat filter lists as semi-untrusted. There's not much it can do other than block stuff.

DavideNL · on Feb 22, 2022

Exactly... "Dynamic filtering: precedence" : https://github.com/gorhill/uBlock/wiki/Dynamic-filtering:-pr...

xahrepap · on Feb 18, 2022

I know there’s some serious chicken/egg issue here, and solves a different problem than Google or DDG. but I think what I need is a way to “Like” or “Favorite” a page I’m on (maybe just a bookmark?) and that triggers my personal search indexer which will then index that page.

Then maybe optionally “follow” other people’s favorites as part of your own results. Imagine someone crawling Twitter or SoF and you just “follow” their index and it gets merged with your results and disable/enable them for specific searches.

I’ve been thinking about it a lot lately because I often am trying to remember something that helped me or I wanted to remember. But I can’t find it in my FF history. And Bookmarks feel too clunky.

Gigachad · on Feb 18, 2022

In the end there is no perfect solution because the whole thing is gamed too much. Eventually spam sites will start paying popular users to like their spam page.

Buttons840 · on Feb 18, 2022

This attitude is too defeatist, and I'm not convinced. The OP's suggestion was that people would choose which other "favorite lists" or "indexes" they follow, this implies these are public. Who's going to follow someone who puts a StackOverflow clone filled with ads on their list? And if they do, and it shows in my results, I should be able to discover that someone I'm "following" has been compromised, and I'll unfollow them.

Gigachad · on Feb 18, 2022

This would have been said a million times for social media and yet most large personalities are filled with paid sponsorships and no one really cares.

RileyJames · on Feb 18, 2022

Are large personalities not “unique”, and followed for the uniqueness of their character?

Where as the “moderator of a filter” is followed on the basis of their output, the filter itself.

I don’t disagree with your statement, but I don’t find it a compelling counterargument to the suggested solution.

bussiere · on Feb 18, 2022

try bookmark service like pinboard.in they have a search engine when you can search by keyword ...

chupchap · on Feb 18, 2022

Google tried that with their plus button, but it went away along with Google Plus.

reillyse · on Feb 17, 2022

I really don't like these websites and I smash that back button as soon as I realize I've landed on one.

That said I'm amazed they are still showing up at the top of google search. My understanding was that that kind of behavior (which I think at least some other people do too) combined with the fact that they are just copying another much higher page ranked website would mean that they are highly unlikely to rank above the relevant stack overflow article that they are duping. So what is happening here?

giancarlostoro · on Feb 17, 2022

Reminds me of going to Google Images, and getting sent to Pinterest... which is not where the image is sourced out of.

hughrr · on Feb 17, 2022

Pinterest is one of those sites that really makes me want to strangle someone. It’s just an abhorrent walled garden of other people’s property.

giancarlostoro · on Feb 17, 2022

If I were on Pinterest looking up things, fine. But I'm on Google, not trying to find a mirror of what I want, I want what I want.

Edit: I have a friend who works there, but not as an engineer haha I'm pretty sure I've told him my woes with pinterest. My wife loves pinterest though. It allows her to come up with amazing design ideas and art ideas.

asddubs · on Feb 18, 2022

The main problem is really just that they show up for basically every search, a bunch of times, and when you click through, you can't get to the image without signing up/in

richardfey · on Feb 18, 2022

Pinterest is one of those sites that I would block forever, if I could customise my search experience.

conanite · on Feb 18, 2022

Add this to your browser's search engine list and make it default:

    {google:baseURL}search?q=%s+-site:pinterest.*

wyclif · on Feb 18, 2022

If on Firefox, do you need to add the optional search bar (with the magnifying glass) in order to access "Change Search Settings"? Because I'm not finding it anywhere.

LordDragonfang · on Feb 18, 2022

Doesn't help with reverse image searches, which accounts for probably 90% of the times I end up clicking on a pinterest link.

richardfey · on Feb 18, 2022

That's a brilliant solution, thanks!

giancarlostoro · on Feb 18, 2022

As much as I avoid giving google any of my money I would literally pay them to be able to block domains from results.

Beaver117 · on Feb 17, 2022

Why doesn't someone sue them then?

blacksmith_tb · on Feb 17, 2022

I am not sure how GOOG weighs what happens after you click on a result, it would be clever of them to notice how quickly you click on another result for the same search and slightly downgrade the first link (though, what happens if you open the first three links into new tabs before you actually visit them, say). My assumption was that they just counted clicks as an upvote, so if these scammers can make it into the first page of results, they will tend to stay.

xenadu02 · on Feb 17, 2022

This is not even the first or second time Google has rolled out changes that allowed SEO spam sites copying Stackoverflow or Wikipedia to rank higher than the original.

They did fix this at one point in time by figuring out which site posted the content first and penalizing the copycats, but it appears the fix is once again broken.

reillyse · on Feb 17, 2022

I figure they must just be monitoring the original content and republishing it before it’s indexed by google. The searches are so specific and niche that generally ranking isn’t hard it’s beating the og that’s hard.

I just don’t know how they are managing to get indexed before the big name established sites. Perhaps they are succeeding on some small percentage and that is what we are seeing?

Perhaps they have an additional trick to make it look like they posted the content first, perhaps internal links or something.

dtech · on Feb 17, 2022

I find that hard to believe, the SO questions are often years old, the GH ones months.

reillyse · on Feb 17, 2022

So how are they doing it ?

ChefboyOG · on Feb 17, 2022

For a long time now, Google has weighted behavioral signals similar to what you describe. "Bounce Rate" is the percentage of users who quickly leave your site after clicking. "Dwell Time" is the amount of time a user spends on a page.

There's even a cottage industry around gaming these signals. See SerpClix and the like.

blacksmith_tb · on Feb 17, 2022

So are you suggesting that GOOG is using bounce rate on the target site (assuming it uses gAnalytics I take it - otherwise how would they know?) to alter the ranking of search results (like "90% of users bounce in the first two seconds on scam.copycat.com/seemed-useful so we're demoting its rank by 20%"?) That would be interesting, I was musing more about if GOOG does any session tracking on their end to try and infer how happy/unhappy users are with a given set of results. Of course they could do both at the same time.

Gigachad · on Feb 18, 2022

Stop using stock ticker names to refer to companies. It's cringe and not even easier to type.

fennecs · on Feb 18, 2022

I agree, but even more generally avoid abbreviations. A reader has to spend time thinking about what you are talking about. It’s detracting.

ajsnigrutin · on Feb 17, 2022

So how come pinterest is still on top with many searches?

jonas21 · on Feb 17, 2022

Why not? Pinterest is a popular site, with lots of content all linked to each other. Many people probably spend a long time there after clicking a result.

dylan604 · on Feb 17, 2022

The inordinate amount of time to click away all of the dark ui login screens just to see the content before making the decision its not what you wanted already increased the dwell time to longer than other sites.

aunty_helen · on Feb 17, 2022

I've been using uBlacklist which adds a little block this site button to the google search results. Handy.

It is a browser extension and I haven't looked too deeply into it so if that's important to you perhaps have a browse over their repo etc before installing.

LordDragonfang · on Feb 18, 2022

If you scroll down on the OP, two of the source lists are actually uBlacklist block lists.

OGWhales · on Feb 17, 2022

I use that too, great for blocking pinterest from flooding all of your google image search results.

sebazzz · on Feb 17, 2022

I use Kagi as a search engine and can just block the site from the search results.

lolinder · on Feb 17, 2022

I started using Kagi recently, and so far haven't had to block a single site. Their filters are great!

wanderingmind · on Feb 17, 2022

Kagi filters are great for programming, but still evolving for others. I still see a lot of pinterest results. You can block domains by adding them to Kagi blocklist through Settings -> Personalized Results -> Blocked Domains.

dharmab · on Feb 18, 2022

There's also a boost/block button right on the search results page.

dawnerd · on Feb 17, 2022

FYI whoever made this, you can create clickable links to import filters. For example: https://subscribe.adblockplus.org/?location=https://raw.gith...

Quick edit: I know the domain is ABP but ublock origin picks it up.

Quenhus · on Feb 17, 2022

That's actually really cool, thanks! However, the link don't work from HN for me. For the link to work, users need to click from a trusted domain listed here https://github.com/gorhill/uBlock/blob/bba4732c6b47134c3f54e...

gorhill · on Feb 17, 2022

From 1.41.5b2 and above, it works on non-"trusted" sites with right-clicking the link, there will be an entry on the contextual menu to import the list.

For older versions of uBO, you can already use the old way:

    abp:subscribe?location=[...]

Quenhus · on Feb 18, 2022

Thanks for your incredibly fast response and help. (Do you have a notification when your username is invoked in HN?) And of course, thanks for uBO!

In my case, the issue is that GitHub doesn't allow the apb|ubo protocol in links. However, no problem, I can use the method with "subscribe.adblockplus.org". Yet, subscribing from contextual menu is a great feature.

aspenmayer · on Feb 18, 2022

Hey, it’s the creator! Thanks for uBO.

rirze · on Feb 18, 2022

Not unless you have both enabled... then ABP picks it up

oxguy3 · on Feb 17, 2022

Cool idea! I was surprised that Wikipedia mirrors aren't included, as I encounter them constantly and they drive me bonkers. I opened an issue: https://github.com/quenhus/uBlock-Origin-dev-filter/issues/2...

ummonk · on Feb 17, 2022

Yeah it's really frustrating when I read a poorly sourced Wikipedia article and I'm trying to search for other sources on the claims in the article but all I get is clones of the Wikipedia article.

kevin_thibedeau · on Feb 18, 2022

I'm seeing a lot of Github clones on DDG lately. Some even stealing the GH favicon. I can't fathom what their goal is other than to inject exploits into devs who are high value targets. You'd think a responsible search engine would be culling these for the sake of general security.

nerdbaggy · on Feb 18, 2022

I’ve been seeing that a ton on Google as well. It seems to be the GitHub topics

pajko · on Feb 17, 2022

Another useful extension like this is https://iorate.github.io/ublacklist/

willis936 · on Feb 17, 2022

Bless you. This took 30 seconds to put on my phone and laptop and has already improved my results so much.

stuckkeys · on Feb 17, 2022

Phone? Can you elaborate more on this. Are you on Android or iOS?

willis936 · on Feb 18, 2022

iOS. I use Safari and adguard. It supports custom block lists.

stuckkeys · on Feb 18, 2022

Which list or url are you using? I have adguard also. Thanks for the input. Did not know you could do that.

willis936 · on Feb 18, 2022

The one listed in the OP repo:

https://raw.githubusercontent.com/quenhus/uBlock-Origin-dev-...

Adguard picks up 588 rules.

0xTJ · on Feb 18, 2022

I'm absolutely going to try this. The amount of time that I get blatant copies of Stack Overflow posts, only with the formatting stripped, is infuriating.

cmroanirgo · on Feb 17, 2022

Missing from the title is:

> Specific to dev websites like StackOverflow or GitHub.

Before I noticed that, I had searched for pinterest and found nothing. Even marking the HN title with "dev" would be good.

If this were my list I'd add w3schools because to me, it's low quality, especially compared to mozilla.

hlbjhblbljib · on Feb 17, 2022

> I had searched for pinterest and found nothing

So it's working as intended and blocking low effort spam sites

Kovah · on Feb 17, 2022

Does anyone has an idea how to make this work in Brave without uBlock? I added the block list to custom filters (brave://adblock/) but results for those spam sites are still shown in Duckduck.

hda2 · on Feb 18, 2022

Two important questions I don't see answered in the readme:

1. What is the criteria for a github copycat?

2. What is the process for having a website removed?

I ask #1 because many businesses use their own git-hosting solutions to host their code but also use github as a mirror. It would be very easy for a competitor to get the website of a rival business listed if the criteria is not strict and specific enough. I recommend never blocking the entire domain unless the website is a repeat offender (to avoid mistakenly harming innocent businesses and the liability that it may cause).

I ask #2 because many of these domains will likely be registered by honest people once the scammers are finished with them. There needs to be a way for the new owners to get their new domains delisted.

aniforprez · on Feb 18, 2022

For #2 I assume a PR to remove the site could be sent and merged

For #1 it seems to mostly be a list of SEO gaming sites that I've personally found to be supremely irritating and deserve to be in the list. They basically just mirror stack overflow and GitHub issues and provide obscured links back to the original source to make sure people stay on their site. You can peruse the list in the code yourself. It's just a text file

hda2 · on Feb 18, 2022

I don't like low effort clones either. My only concern is the susceptibility of this project to being weaponized by dishonest actors. Some of us have been on the receiving end of false-flag spamming campaigns by competitors and have been penalized unfairly for it. If these filter lists take off, and I hope they do, they must have a well-documented process for accepting patches and another for handling malicious actors.

theyeenzbeanz · on Feb 18, 2022

Often enough I’ll see 5+ sites at the top of results for windows errors and they’re basically all trying to sell me golden hammer software to “fix it”.

i13e · on Feb 20, 2022

How do I convert these filters into ABP-compatibile syntax? The adblocker in Brave's browser only reads that for some reason

a1371 · on Feb 18, 2022

Hadn't thought about using uBlock Origin for this. I'll use the same technique to filter out Pinterest too then.

nhoughto · on Feb 17, 2022

Making it this obvious how to set this up made me finally do it, no more junk results (well less junk..). Thanks!

Melatonic · on Feb 17, 2022

Anybody know of a way I could bulk import these into NextDNS?

alin23 · on Feb 18, 2022

NextDNS is for blocking the site when you access it. It won’t hide it from the search engine results.

You still need a browser extension adblocker (uBlock, AdGuard) that can modify the contents of any webpage for this to be effective.

_uiyh · on Feb 17, 2022

Oh sweet, thank you. Been looking for something like this.

metadat · on Feb 18, 2022

Because filters will fix the Broken Google problem?

scim-knox-twox · on Feb 18, 2022

poulpy123 · on Feb 17, 2022

Great, I will try that soon. These websites are infuriating

btdmaster · on Feb 17, 2022

To be fair, they are quite nice when the official website is down or blocked...

dredmorbius · on Feb 18, 2022

Archival platforms not based on fraud or deception are far preferable.

Internet Archive or Archive Today.

Jerry2 · on Feb 17, 2022

... or DMCA'd.

Over the past year, I've noticed that quite a few repos that I used to track have disappeared. I keep a local bookmarks list now because if a "starred" project is removed or DMCA'd, Github does not tell you about it and they remove any mention of the repo from the "starred" list.

kipchak · on Feb 17, 2022

Google's cached version of pages can be another useful option, if you click on the ellipses to the right of a search result's address and then "cached" in the bottom right hand corner of the "About this result" box.

userbinator · on Feb 17, 2022

Came here to post a similar sentiment. I've rescued very useful content from mirroring sites that was gone from the original. You can filter them out if you want, but don't forget you're doing that or you may not find what you're after.