Hacker News new | past | comments | ask | show | jobs | submit login
YT-Spammer-Purge: Scan for and delete spam comments on YouTube (github.com/thiojoe)
137 points by zikohh on April 4, 2022 | hide | past | favorite | 56 comments



I watch a lot of financial videos on YT and noticed the sophistication of the spam has increased lately. Previously you'd have either:

- someone stealing the logo of the video owner responding to your comments with whatsapp links

- some MLM-esque praise for a particular person's crypto trading scheme.

for the second case, you now have spam bots replying to the top comment with a vaguely realistic sounding dialog with comments alternating between people asking for more information, similarly praising the crypto scheme, or giving contact information.

The comments section has basically become useless at this point due to the level of spam. You still have content creators encouraging you to comment to trick the YT algorithm though.


Exactly this, sometimes up to 30 comments and likes, all by the same spam bot. If you call out the spammer below his comment the bot reports you and your comment will be removed.


I just report the top level comment as spam, and it disappears for me. I feel i have done everything i can at that point, no need to engage spammers.


I've tried this, it does nothing, I visited the page 3 weeks later no help. Neither does this script give instant feedback it still manually has to be actioned by a human. I think this tool is only useful if you own the channel and use it. Otherwise, it doesnt matter. YouTube just doesnt care.


This has gotten exponentially worse in the past month. In the past, I'd see maybe one or two videos get targeted with Telegram scams per month.

At this point, every one of my videos is targeted, and usually in waves two or three times before the spammers give up.

Sadly, there are at least 5-10 people _per week_ who now email me asking if the scam is legit, and I have to tell them no, I would never choose you to win something then ask you to send me $100 to send it to you.

The tool linked here works perfectly every time for me... I don't know how this problem is so prevalent (MKBHD said it's like 6-10% of all comments on his channel now) and YouTube can still ignore it (while doing other pointless things like removing public dislike counts).


Engagement is engagement to them. Remember this is google we are talking about. To say they don't care implies that at some point they did.


I know they have team(s) of very smart people dedicated to solving this issue (at least at the individual level).

So assuming they care, I can think of two main reasons as to why it is not solved yet, both related to scale as Marques mentioned:

1.) Scale of the problem - It might be that they are already catching 99% of the stuff and we just see what falls through the cracks

2.) Scale of the solving - It could be that the teams and infrastructure are so large that they can't make the rapid adjustments needed compete in such an arms race

On a separate note, I imagine a higher quality comment section would increase engagement more than any "appealing" scam.


> I know they have team(s) of very smart people dedicated to solving this issue (at least at the individual level).

Do you actually know, or are you being generous and still trying to assume good faith from a company that disproved it several times?

I don't see a business reason for them to take action. The spam comments don't open them to any legal liability (they already get away with much worse), YouTube has a monopoly so no amounts of spam will drive users away, the spam contributes to engagement numbers and the advertisers don't seem to mind.


I happen to know someone in this case, and am not assuming good faith from the company by any means. I trust and respect the individual.

I'm also generally interested in the comment moderation problem myself, and have been working on it myself for some time. I guess my judgement is clouded by my hope that there is a reasonable excuse for the team(s) at Google to not have solved it by now.

Perhaps it is naive of me to think this way; if it really is as simple as "this does not affect advertising revenue" then that would be quite nearsighted of Google. And, as I mentioned earlier, I am of the opinion that quality comment sections would increase engagement (and revenue as a result), so it doesn't make sense to me.


The spam comments usually contain hooks and symbols so that the other bots can latch onto them much easier. Querying for those signs in order to spot possible spam comment threads, with high probability, is trivial, especially considering the already existing libraries on the topic, for instance within Bayesian probability and statistics.

Sure, the most hardcore spammers would most likely change tac if thus attacked, but many would also quit entirely as it become unprofitable to spam. If they also were to train one of their AIs or neural networks, they can catch even more spam by simply looking for post and sentence patterns. For instance it's very common that a spam thread contains multiple references to a name; the name of the brand or investor, or whoever they are shilling. They're always giving some sort of advice in conjunction with that name. And at some point the posts most certainly contain weird symbols to reference the WhatsApp number or Telegram channel. So no, I don't buy that this is hard to do. I think most of it is trivial.

So why aren't they fixing it? Well, I seriously doubt it's due to incompetence. The more likely scenario is because they through earnings and statistics already know that it's not losing them any paying customers. As such it's simply a matter of priority for them. And you're not it. Because you're the product, not the customer.


> Well, I seriously doubt it's due to incompetence.

I agree with you on that, as well as taking an ML approach. Querying the hooks and symbols directly can lead to the false positive vs spam tradeoff that TheDong is referring to elsewhere in this comment section (to be fair, so can the ML approach but its more avoidable). It is possible that the scale of it makes the minor shortcomings not so minor.


MKBHD just released a video where he used this and was happy with the results but hated that it was even needed.

https://youtu.be/1Cw-vODp-8Y


Linus (Tech Tips) in February:

> YouTube’s spam problem has gotten out of hand… And it’s up to the community once again to do what YouTube can’t. Thankfully, the community has delivered…

* https://www.youtube.com/watch?v=zo_uoFI1WXM


Wow, the tool found that 30% of his video's comments were spam.


Which part of that is surprising to you? That 30% of YT comments are spam, or that the software worked to that degree?


Not the OP, but I am personally surprised by many things.

- YouTube's comment system is horrible at dealing with spam - A tool made by a random person online is able to find and remove so many of these comments, meanwhile YouTube with WAAAAYYYY more resources, data, people, etc... is somehow not - The fact that the spam problem is so bad that tech YouTubers are seeing tools like this remove around 30% of the comments on their videos is insane.

Overall I barely even look at the comment section on YouTube anymore because of how big of a mess it is. It's refreshing to look at the comments on Linus Tech Tip videos as they are free from most spam due to them making use of this tool. It's super surprising how long this problem has been going on, and just how much it keeps worsening. It just seems crazy how long Google has been letting this problem go unchecked.


It works precisely because it is a tool made by a random person, so spammers are not trying to avoid it.

Such is the curse of spam: they get approximately infinite tries to post spam. We don’t really know how many spam comments Google is preventing from being posted, but any anti-spam measure they introduce will quickly result in the spammers changing tactics, probably within minutes.


Yet some of the tactics of spammers/scammers are so obvious that it's surprising YouTube doesn't provide simple solutions for some of the case.

For example a common example as MKBHD mentioned is scammers impersonating channel owners within their comments.

Why can't content creators set an option to auto-flag other users that use their name and profile picture (to some degree of similarity) in comments under their video?


Most of the spam left over is super-obvious. It should not be hard to remove it. A filter based on Bayesian probability would wipe out most of them with a high degree of certainty (i.e. very few false positives).


> YouTube with WAAAAYYYY more resources, data, people

Alphabet, Google and to a lesser extent, YouTube, have a lot of ressources, but they might not invest it in this direction. A friend worked on a team responsible for understanding creators, recently, and it was alarming how few people were involved. He left because it was poorly managed. They barely scratched the surface on engagement, not even able to measure the lack of consistency of audience engagement (the biggest point of contention of creators).

They likely have fewer people dedidated to fighting spam overall, let alone in comments, than you’d expect. This is a delicate interaction to highlight because it’s both account creation (accounts that don’t upload videos, so they likely aren’t prioritised) and editing their identity, something no one cares much at YouTube (you get plates if you matter, that’s operations); all the bad things happen off YouTube…

The other comment about adversarial interactions with spam is key --and likely more important-- but we need to remind everyone (including people on HN who likely have experienced this at their job) that “it’s a large company” doesn’t translate to “they have large teams working on this” but “they have many other problems prioritised over this”.


> “it’s a large company” doesn’t translate to “they have large teams working on this” but “they have many other problems prioritised over this”.

I don't think anyone needs to be reminded of that. "Company X with way more resources is failing at basic tasks" doesn't signal a misunderstanding of what those resources are used for, it is criticism of that company's priorities.

The fact is that a company whose profits grow by billions every year is not fixing a very obvious problem on one of their biggest platforms. They have more than enough the resources to fix it.


Their profits would only be relevant if they underpaid the staff that could fix that problem (machine-learning specialists, because you can’t imagine addressing this by hand). They don’t. Whatever lack of ressource they might have isn’t fixable with money alone.

They prioritise other issues; comments, more so, comments on comments, are nowhere near the core feature of the platform. They have, seemingly, addressed issues on copyright infringement, people gaming the recommendation system and first-level comments—things that the same creators have been loudly complaining about earlier. That sound like a reasonnable prioritisation system. The phenomena that has been described is apparently new, so the issue is presumably that they aren’t able to react to new threats rapidly. That would be a new structural concern for them, and not something that they would be expected to have fixed as a large company.


One man's spam is another man's engagement metrics.


Google is spending its resources in more important things like spying on everything you do! They ain't got no time for despamming!


There's also anniversary doodles.


What, you want Google's salaried visual designers (that they would have even with different priorities) to try their hand at solving YouTube spam-filter scaling?


There's coding to them too.


Leetcode caused all this!!!


Not really your point but probably way higher than 30% of comments are spam. These are just the comments that made it through the filter. I wouldn't be surprised if 90% or 95% of comments are spam but most of them get filtered out by Google.


This argument seems circular - assuming that the 30% results have good precision and recall


this seems like what should be a one week data science project at YouTube. kind of embarrassing


Googler, opinions are my own.

My guess is that YouTube could easily build the exact same tool. The problem is when you roll it out across your entire site, the spammers figure out the holes in it and start abusing it.

I'm guessing the same thing would happen with this tool. If some large percentage of channels started using it, spammers would find the holes in it.


I don't find this argument very convincing. The same argument could be made about security vulnerabilities, but I'm willing to bet (and hoping that this is the case) that Google will invest millions of dollars and man hours per year on security patches and secure systems.

Sure, hackers will always find loop holes. But when that happens we flag that version as vulnerable and release a patch to fix the vulnerability. This is the exact same technique Google could apply to the YouTube spam. They just don't want to spend the money or time to do it.


There is a difference between the cat-and-mouse game of spam fighting and between fixing vulnerabilities.

Typically, when you fix a vulnerability, things are strictly better. Attackers can no longer do X bad action, but all legitimate users can still do everything they wanted.

Spam fighting is different. If you make your spam classifier broader and broader, it will have more and more false positives as well, and legitimate comments will get deleted too. Without AGI, or at least very good language parsing, it really will be a case of tuning between "more false positives, less spam" and "fewer false positives, more spam".

There's also vastly more spam than there are security vulnerabilities since there are hundreds of thousands (millions?) of people intently creating spam for profit, while bugs are mostly accidental, and exploitable ones relatively rare.


Meanwhile Google deletes on-topic comments with links after 5 minutes, but lets the spambots through. Actually detecting spam can't be any worse than not doing so. If you don't want to detect spambots this way, why not send possibly-spammy comments to the channel owner's moderation queue instead of memory-holing them?


It's even worse than that : I had some comments deleted but others shadowbanned (!), merely from using timestamp links to the same video and/or posting multiple comments and/or writing long comments ! (And I first noticed it about a year ago.)

P.S.: Even the deleted comments are deleted silently.


Right. They already have a system monitoring for spam comments. 99% of the time I make a valid, useful comment on a YouTube video it gets marked as spam and never shows up.

They basically need to do the exact opposite of whatever they are currently doing.


Between Gmail, Search and AdWords I'm sure you numbskulls could come up with something to deal with a bit of spam.


Sir he is a Googler show some respect.

Imagine actually referring to yourself as a Googler.


OK, Metamate.


so why does google tries to fight spam on gmail but thinks it's a lost cause in youtube comments?


Honestly, given the amount of obvious spam landing in my gmail inbox the last few months, I kinda think they gave up on fighting spam on gmail...


I build something similar with a basic classifier in Tensorflow: https://github.com/Savjee/yt-spam-classifier

Here's the training dataset that I've been using. It includes all comments from my channel from last year until now. And I've manually tagged about 3000 of them as spam. https://docs.google.com/spreadsheets/d/1QEQrLne1SDxwQVl5qpGQ...


https://news.ycombinator.com/item?id=30616055

A recent post I made about adult pics and usernames on youtube comment section.

Usually like this 'Click here, to view sexy videos' but alphabets are replaced with different characters so as to not get caught by Anti-adult content youtube algorithm.

Eg., https://www.youtube.com/watch?v=F-kvFACZ5yE (a random video from homepage, proxy link-https://piped.kavin.rocks/watch?v=F-kvFACZ5yE


Make this a chrome extension, and you will have a great time selling it.



Simplest solution for YouTube is IMHO allow channels to hide all usernames and profile pictures in comment section.


...people actually read youtube comments?


The first few highest rated comments are worth a glance, because if somethings critically wrong it’ll usually be mentioned there.


Now that we don't have a dislike button, the comments are the only option.


Not when it's sorted by most votes. Every comment you see is "this video is so good!" even if it's not. Basically useless.


I’m guilty of this. I follow some smaller channels and I want to encourage them and to juice the algorithm a bit by giving them some engagement beyond a “like”.


Definitely don't feel guilty! It's great to support content creators you like. I'm just arguing the comments section is weighted for those types of comments, so it's a poor indicator if you'll actually like the video or not.


except when critical comments are removed


I haven't seen many critical comments anymore. Even on videos that are just completely wrong, the top comments are alway full of praise.


I blocked them with ublock a few years ago. It was too tempting to engage in silly online arguments over things that really in the long run did not matter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: