Hacker News new | past | comments | ask | show | jobs | submit login
Filters to block and remove copycat-websites from DuckDuckGo, Google and other (github.com/quenhus)
338 points by gleb_the_human on Feb 17, 2022 | hide | past | favorite | 109 comments



I've been recently searching for some very specific keyword stuff, and bumping into a lot of sites which seemed like just reformatted copy-n-paste of stackoverflow and various mailing lists, adding zero value and clogging up the top 100 search results.

Now that I see the HUGE number of copycat sites in the stackoverflow_copycats.txt file, I am beginning to understand what's going on.

Thanks!


It's crazy the community has had to come together to create rag-tag tools like this to patch up what is supposed to be a trillion dollar search engine that should be the pinnacle of human civilization. But then I remember it's the same sort of crazy whereby the richest country in the world has homeless people, child hunger, and decaying infrastructure :/


It’s not a lack of resources it’s a lack of will. You can’t force parents to use the food programs. Hell around where I live you can’t stop the parents from stealing from their kids to buy drugs. Send a coat home with a kid on Friday, Monday they show up with no coat.


Political will, yes, but if you think people aren't willing to receive help and all the resources we need are there you are so misguided my friend. We can't even fund student lunches in many states. The primary problem across the board is lack of committing resources. Similarly child allowance tax credit was literally lifting people out of poverty but we don't have the political will to renew it.

People accept help when you give it. This why quite embarrassingly one third of GoFundMe campaigns are to fund medical expenses that wouldn't even have happened with a decent healthcare system. Don't come to me with this argument when literally millions of people are resorting to begging on GoFundMe.

If a child goes home with a coat and comes back without it, they probably have bills and a financial situation at home so dire that they sold the coat.

How is a family of four making $25k/yr supposed to deal with a $120k medical bill when they lack health insurance because this gig economy and the 3 shift jobs they work don't provide it?


There is no good reason a kid should go hungry in the US. Federal, state, school, charity food programs make it possible for every kid to eat. But a parent who would steal his kids coat to pay his medical bills (your scenario not mine or likely anyone else's), is the kind of parent who doesn't care enough about their kid to sign up and participate in these programs. You should hang out on r/teachers some time and get a glimpse into what teachers are seeing these days.


It isn't a good argument for why you shouldn't give out resources. Even if parents are going to "steal" the funds, as you say, you are still giving them that much more overhead, and research shows the vast, vast majority of people aren't going to behave like you describe (educate yourself, consume some real leftist content)


Your counter argument would be more persuasive if you provided proof instead of what seems like just another opinion.


I think these are two completely separate issues and patching them together is a lot more about spinning a narrative than addressing them


Could you elaborate a little on the similarities you see between SEO spam and child hunger?


Both are unnecessary and confusing to see when set against the massive success of their creators.


In both scenarios you have a massive entity with near-infinite resources acting like it's so underfunded it can't take care of it's own shit.


In capitalism rich/powerful entities don't necessarily have incentives to make the world (or even their own direct neigbourhood) a better place, and most big entities act only based on their incentives, not based on morals and ethics.

They however have strong incentives to invest all available resources into their own survival, prolong their own life and/or earn more money.

Depending on how the managment is in turn incentivised they will typically also prefer short term success over decisions that would make more sense long term — e.g. short term a search engine saves money by not spending a ton of money on quality when they are in a monopolist position, while in the long term it fould bite them.


I never understood why Google isn't blocking these crap results, it's really making my experience of search really bad for a light of my searches


They used to be superb on detecting duplicated content. They also were extremely good at detecting spam/ham. Nowadays it feels like they don't even care anymore and whatever filters they have are either broken or untrained.


Bring back the panda, I say!

https://en.wikipedia.org/wiki/Google_Panda


It is an arms race.

I wonder how many players on the page generation side? The economics of it must be marginal, I guess


Copycat sites also used to be extremely careful at not appearing copycat sites. Or not duplicating content on the same site. I am surely not alone in recalling the old mantra of not duplicating content.

Copycat sites don't seem to care anymore.

I don't believe there are an overwhelming number for Google et al to deal with as it's often the same names topping search results that such filters can remove through semi-manual user action.

While leads to the conclusion - Google don't care about duplicate content any more.


Were they? I remember having to manually block myself a lot of those copycat wikipedia/stackoverflow sites back in 2011 or 2012 when they had the domain-blocklist option available for users. When the feature was removed, it all came back.

Maybe the problem is just that there are more of those now.


Google removed that option without even trying to spin it as a pro-consumer change. The only problems I can think it brought to Google are clueless users complaining that they can no longer see microsoft.com in their results, and having a negative impact on unethical advertisers.


I had lot of problems with "duplicated content" from sites that published the same content as I did and outranked my site.


Do they do a good job at getting clickthroughs on Google ads on their site? :-/

Does the rate of ad-clicking on the results page increase if most of the "natural" results are crap? :-(


I've noticed a recent trend where the copy cat/adware sites are "up-ranked" relative to original content. This would be the expected behavior of a search engine optimizing for clicks and revenue.


Don't be evil


>Dont, be evil

For an Alphabet company, they sure don't know where to put the apostrophe


I think they shut down that app.


Part of the problem might have been that Stack Overflow has been busy shooting themselves in both feet for years.

For a while (maybe around 2012 - 2017) or something it felt like it was almost the rule that if you found a really useful question on Stack Overflow it would always be marked as llw quality.

Eventually I guess they were pruned and that might explain a bit of why they rose.

They annoyed mee too though as they often mixed together unrelated questions on the same page and get hits for very specific queries that are unrelated.


They should give the YouTube audio fingerprint team a shot at it.

But seriously, Google doesn't need to make anything besides bringing back the option to hide certain domains from the results forever. Even if they don't analyze what domains people are hiding, it would dramatically improve the usability.


Google started adding other quantitive measures for ranking results. That's how some of these crap web sites manage to rank so high.


They probably are. But the clone sites are designed specifically to avoid being blocked by google.


Looks great and much better than my piecemeal efforts, although I recommend linking to a specific commit of all.txt so you aren't opening up your browser's ublock origin filter list to arbitrary remote control. Like:

https://raw.githubusercontent.com/quenhus/uBlock-Origin-dev-...


As the author of the filter, I strongly agree with you. However, I believe it would be too tedious for most people to update the filter "by hand". I think I'm going to add this important security information in the README.


>so you aren't opening up your browser's ublock origin filter list to arbitrary remote control.

What's the worse that could happen? It seems like ublock is already to treat filter lists as semi-untrusted. There's not much it can do other than block stuff.


Exactly... "Dynamic filtering: precedence" : https://github.com/gorhill/uBlock/wiki/Dynamic-filtering:-pr...


I know there’s some serious chicken/egg issue here, and solves a different problem than Google or DDG. but I think what I need is a way to “Like” or “Favorite” a page I’m on (maybe just a bookmark?) and that triggers my personal search indexer which will then index that page.

Then maybe optionally “follow” other people’s favorites as part of your own results. Imagine someone crawling Twitter or SoF and you just “follow” their index and it gets merged with your results and disable/enable them for specific searches.

I’ve been thinking about it a lot lately because I often am trying to remember something that helped me or I wanted to remember. But I can’t find it in my FF history. And Bookmarks feel too clunky.


In the end there is no perfect solution because the whole thing is gamed too much. Eventually spam sites will start paying popular users to like their spam page.


This attitude is too defeatist, and I'm not convinced. The OP's suggestion was that people would choose which other "favorite lists" or "indexes" they follow, this implies these are public. Who's going to follow someone who puts a StackOverflow clone filled with ads on their list? And if they do, and it shows in my results, I should be able to discover that someone I'm "following" has been compromised, and I'll unfollow them.


This would have been said a million times for social media and yet most large personalities are filled with paid sponsorships and no one really cares.


Are large personalities not “unique”, and followed for the uniqueness of their character?

Where as the “moderator of a filter” is followed on the basis of their output, the filter itself.

I don’t disagree with your statement, but I don’t find it a compelling counterargument to the suggested solution.


try bookmark service like pinboard.in they have a search engine when you can search by keyword ...


Google tried that with their plus button, but it went away along with Google Plus.


I really don't like these websites and I smash that back button as soon as I realize I've landed on one.

That said I'm amazed they are still showing up at the top of google search. My understanding was that that kind of behavior (which I think at least some other people do too) combined with the fact that they are just copying another much higher page ranked website would mean that they are highly unlikely to rank above the relevant stack overflow article that they are duping. So what is happening here?


Reminds me of going to Google Images, and getting sent to Pinterest... which is not where the image is sourced out of.


Pinterest is one of those sites that really makes me want to strangle someone. It’s just an abhorrent walled garden of other people’s property.


If I were on Pinterest looking up things, fine. But I'm on Google, not trying to find a mirror of what I want, I want what I want.

Edit: I have a friend who works there, but not as an engineer haha I'm pretty sure I've told him my woes with pinterest. My wife loves pinterest though. It allows her to come up with amazing design ideas and art ideas.


The main problem is really just that they show up for basically every search, a bunch of times, and when you click through, you can't get to the image without signing up/in


Pinterest is one of those sites that I would block forever, if I could customise my search experience.


Add this to your browser's search engine list and make it default:

    {google:baseURL}search?q=%s+-site:pinterest.*


If on Firefox, do you need to add the optional search bar (with the magnifying glass) in order to access "Change Search Settings"? Because I'm not finding it anywhere.


Doesn't help with reverse image searches, which accounts for probably 90% of the times I end up clicking on a pinterest link.


That's a brilliant solution, thanks!


As much as I avoid giving google any of my money I would literally pay them to be able to block domains from results.


Why doesn't someone sue them then?


I am not sure how GOOG weighs what happens after you click on a result, it would be clever of them to notice how quickly you click on another result for the same search and slightly downgrade the first link (though, what happens if you open the first three links into new tabs before you actually visit them, say). My assumption was that they just counted clicks as an upvote, so if these scammers can make it into the first page of results, they will tend to stay.


This is not even the first or second time Google has rolled out changes that allowed SEO spam sites copying Stackoverflow or Wikipedia to rank higher than the original.

They did fix this at one point in time by figuring out which site posted the content first and penalizing the copycats, but it appears the fix is once again broken.


I figure they must just be monitoring the original content and republishing it before it’s indexed by google. The searches are so specific and niche that generally ranking isn’t hard it’s beating the og that’s hard.

I just don’t know how they are managing to get indexed before the big name established sites. Perhaps they are succeeding on some small percentage and that is what we are seeing?

Perhaps they have an additional trick to make it look like they posted the content first, perhaps internal links or something.


I find that hard to believe, the SO questions are often years old, the GH ones months.


So how are they doing it ?


For a long time now, Google has weighted behavioral signals similar to what you describe. "Bounce Rate" is the percentage of users who quickly leave your site after clicking. "Dwell Time" is the amount of time a user spends on a page.

There's even a cottage industry around gaming these signals. See SerpClix and the like.


So are you suggesting that GOOG is using bounce rate on the target site (assuming it uses gAnalytics I take it - otherwise how would they know?) to alter the ranking of search results (like "90% of users bounce in the first two seconds on scam.copycat.com/seemed-useful so we're demoting its rank by 20%"?) That would be interesting, I was musing more about if GOOG does any session tracking on their end to try and infer how happy/unhappy users are with a given set of results. Of course they could do both at the same time.


Stop using stock ticker names to refer to companies. It's cringe and not even easier to type.


I agree, but even more generally avoid abbreviations. A reader has to spend time thinking about what you are talking about. It’s detracting.


So how come pinterest is still on top with many searches?


Why not? Pinterest is a popular site, with lots of content all linked to each other. Many people probably spend a long time there after clicking a result.


The inordinate amount of time to click away all of the dark ui login screens just to see the content before making the decision its not what you wanted already increased the dwell time to longer than other sites.


I've been using uBlacklist which adds a little block this site button to the google search results. Handy.

It is a browser extension and I haven't looked too deeply into it so if that's important to you perhaps have a browse over their repo etc before installing.


If you scroll down on the OP, two of the source lists are actually uBlacklist block lists.


I use that too, great for blocking pinterest from flooding all of your google image search results.


I use Kagi as a search engine and can just block the site from the search results.


I started using Kagi recently, and so far haven't had to block a single site. Their filters are great!


Kagi filters are great for programming, but still evolving for others. I still see a lot of pinterest results. You can block domains by adding them to Kagi blocklist through Settings -> Personalized Results -> Blocked Domains.


There's also a boost/block button right on the search results page.


FYI whoever made this, you can create clickable links to import filters. For example: https://subscribe.adblockplus.org/?location=https://raw.gith...

Quick edit: I know the domain is ABP but ublock origin picks it up.


That's actually really cool, thanks! However, the link don't work from HN for me. For the link to work, users need to click from a trusted domain listed here https://github.com/gorhill/uBlock/blob/bba4732c6b47134c3f54e...


From 1.41.5b2 and above, it works on non-"trusted" sites with right-clicking the link, there will be an entry on the contextual menu to import the list.

For older versions of uBO, you can already use the old way:

    abp:subscribe?location=[...]


Thanks for your incredibly fast response and help. (Do you have a notification when your username is invoked in HN?) And of course, thanks for uBO!

In my case, the issue is that GitHub doesn't allow the apb|ubo protocol in links. However, no problem, I can use the method with "subscribe.adblockplus.org". Yet, subscribing from contextual menu is a great feature.


Hey, it’s the creator! Thanks for uBO.


Not unless you have both enabled... then ABP picks it up


Cool idea! I was surprised that Wikipedia mirrors aren't included, as I encounter them constantly and they drive me bonkers. I opened an issue: https://github.com/quenhus/uBlock-Origin-dev-filter/issues/2...


Yeah it's really frustrating when I read a poorly sourced Wikipedia article and I'm trying to search for other sources on the claims in the article but all I get is clones of the Wikipedia article.


I'm seeing a lot of Github clones on DDG lately. Some even stealing the GH favicon. I can't fathom what their goal is other than to inject exploits into devs who are high value targets. You'd think a responsible search engine would be culling these for the sake of general security.


I’ve been seeing that a ton on Google as well. It seems to be the GitHub topics


Another useful extension like this is https://iorate.github.io/ublacklist/


Bless you. This took 30 seconds to put on my phone and laptop and has already improved my results so much.


Phone? Can you elaborate more on this. Are you on Android or iOS?


iOS. I use Safari and adguard. It supports custom block lists.


Which list or url are you using? I have adguard also. Thanks for the input. Did not know you could do that.


The one listed in the OP repo:

https://raw.githubusercontent.com/quenhus/uBlock-Origin-dev-...

Adguard picks up 588 rules.


I'm absolutely going to try this. The amount of time that I get blatant copies of Stack Overflow posts, only with the formatting stripped, is infuriating.


Missing from the title is:

> Specific to dev websites like StackOverflow or GitHub.

Before I noticed that, I had searched for pinterest and found nothing. Even marking the HN title with "dev" would be good.

If this were my list I'd add w3schools because to me, it's low quality, especially compared to mozilla.


> I had searched for pinterest and found nothing

So it's working as intended and blocking low effort spam sites


Does anyone has an idea how to make this work in Brave without uBlock? I added the block list to custom filters (brave://adblock/) but results for those spam sites are still shown in Duckduck.


Two important questions I don't see answered in the readme:

1. What is the criteria for a github copycat?

2. What is the process for having a website removed?

I ask #1 because many businesses use their own git-hosting solutions to host their code but also use github as a mirror. It would be very easy for a competitor to get the website of a rival business listed if the criteria is not strict and specific enough. I recommend never blocking the entire domain unless the website is a repeat offender (to avoid mistakenly harming innocent businesses and the liability that it may cause).

I ask #2 because many of these domains will likely be registered by honest people once the scammers are finished with them. There needs to be a way for the new owners to get their new domains delisted.


For #2 I assume a PR to remove the site could be sent and merged

For #1 it seems to mostly be a list of SEO gaming sites that I've personally found to be supremely irritating and deserve to be in the list. They basically just mirror stack overflow and GitHub issues and provide obscured links back to the original source to make sure people stay on their site. You can peruse the list in the code yourself. It's just a text file


I don't like low effort clones either. My only concern is the susceptibility of this project to being weaponized by dishonest actors. Some of us have been on the receiving end of false-flag spamming campaigns by competitors and have been penalized unfairly for it. If these filter lists take off, and I hope they do, they must have a well-documented process for accepting patches and another for handling malicious actors.


Often enough I’ll see 5+ sites at the top of results for windows errors and they’re basically all trying to sell me golden hammer software to “fix it”.


How do I convert these filters into ABP-compatibile syntax? The adblocker in Brave's browser only reads that for some reason


Hadn't thought about using uBlock Origin for this. I'll use the same technique to filter out Pinterest too then.


Making it this obvious how to set this up made me finally do it, no more junk results (well less junk..). Thanks!


Anybody know of a way I could bulk import these into NextDNS?


NextDNS is for blocking the site when you access it. It won’t hide it from the search engine results.

You still need a browser extension adblocker (uBlock, AdGuard) that can modify the contents of any webpage for this to be effective.


Oh sweet, thank you. Been looking for something like this.


Because filters will fix the Broken Google problem?


???


Great, I will try that soon. These websites are infuriating


To be fair, they are quite nice when the official website is down or blocked...


Archival platforms not based on fraud or deception are far preferable.

Internet Archive or Archive Today.


... or DMCA'd.

Over the past year, I've noticed that quite a few repos that I used to track have disappeared. I keep a local bookmarks list now because if a "starred" project is removed or DMCA'd, Github does not tell you about it and they remove any mention of the repo from the "starred" list.


Google's cached version of pages can be another useful option, if you click on the ellipses to the right of a search result's address and then "cached" in the bottom right hand corner of the "About this result" box.


Came here to post a similar sentiment. I've rescued very useful content from mirroring sites that was gone from the original. You can filter them out if you want, but don't forget you're doing that or you may not find what you're after.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: