I was in the business of fighting web spam for over 5 years (Defensio) and while these techniques help, they're not the definitive answer.
Spam bots are now extremely sophisticated and have been able to execute Javascript and "read" and understand web pages for many years. They'll also post bogus comments that are somewhat related to your article but sneak in a fishy URL in there. We had many false-positive reports that were actually real spam. It's just really hard to detect by a human. Of course, JavaScript-based technique will eliminate some easy to catch spam, but nothing a 3rd party service couldn't catch.
Another huge problem is that people are paid next to nothing in China and India to manually spam websites and break captchas. The number of human spammer keeps increasing. When I left last year, it was becoming a huge problem. Definitely the biggest headache for us in ~5 years.
In my experience, the best protection against web spam is still Akismet/Mollom/Defensio. And for the record, I know we didn't like when people used other mechanism to stop some spam before it got to us because we didn't get to see the full corpus, which was invaluable to us in helping all our users fight spam.
I think the kind of defense you need to use depends on what kind of website you have.
Based on my experience if you have a small/medium website you won't find bots that execute javascript, understand a web page or use human spammers.
Those are reserved for the big ones, for all the others is mostly general-purpose bots that try every form they can find on the internet. Where speed is most important than accuracy spammer won't use the "Heavy" bots.
Actually, the sophisticated bots typically target platforms, not websites. So if your website runs Wordpress, you're much more likely to be spammed hard than if you custom-built a comment form.
It depends. I can tell you from experience that even a medium-sized website that ranks well on (legitimate) pharmaceutical terms is a huge target for spammers.
I wonder if eventually people will just stop allowing hyperlinks in comments altogether. It would, at a stroke, eliminate the biggest incentive for spam.
Yes, it's nice (I guess) when someone's name is a link to their personal website or they can post the URL of a relevant article in the comments, but it's not like commenting ceases to be valuable without those features.
It actually came pretty close in ~2007. We were working on something else spam-related and when we noticed that big bloggers were fed up with existing anti-spam solutions (false negatives/positives) and were about to just remove commenting altogether, we realized that it was a huge problem without a good solution, so we knew we had to do something about it.
A real spammer will take any link, it doesn't matter if the link won't be considered as some form of endorsement by search engines due to the use of the rel=nofollow attribute. A spammer will happily post a million links, there will be some poor souls out there and click on some of them. Quantity over quality has always been one characteristic trait of spam.
Spam and link spam were already there before Google existed and the PageRank was invented. The index of the AltaVista search engine was huge and full of spam.
When the nofollow value for the rel attribute was introduced there were many claims that this would reduce the amount of link and comment spam. Critical remarks came often from people who were offering link building and SEO as a service.
We got hit with a huge wave recently, that sent over 40,000 visits a day to our site and nearly ground it to a halt.
The number 1 effective thing we have found to do is to not allow hyperlinks to be posted if they are not trusted (not enough rep/point/score whatever)
Overnight it basically stopped the spam wave. Your removing the one thing of value for them, a hyperlink. I'm a big fan of accessibility and this works well with it. The only other technique we use is honeypot form fields which do catch a fair few, but nowadays a lot of spam I suspect is paid human spam.
Blocking any spam that contains a link is helpful, for sure, but doesn't get everything. Every few months I see waves of comments like: "Really graet article. We need more people like you in the world."
Each comment has exactly one pair of transposed letters. There is no product being pitched, and no url (we don't display or link to email address either). It's baffling.
Some blog platforms whitelist comments from people who have had previous comments approved. I'm pretty sure these meaningless (but positive) comments are an attempt to get on that list.
yes. also some platforms (mostly forums, less of blogs) allow editing of posts, so forum spammers sometimes post meaningless crap only to replace it later with spam.
We should be able to detect that by looking for large numbers of posts with small edit distances. That will contain false positives, but looking for very large numbers should mitigate that.
> Each comment has exactly one pair of transposed letters. There is no product being pitched, and no url (we don't display or link to email address either). It's baffling.
I would guess that the transposed letters are used to keep tabs on where their comments are live. It could look innocent, as if it were a human typo, but later the spammers could run a search and see which sites are trusting their comments in order to either edit them later or use their trusted account to post spam links.
Those comments could be used to make automated spam filters less effective. Spammers could post comments that would normally be labeled as spam, but do not contain any URLs. Over time a spam filter would have a harder time distinguishing between cut-and-dry spam and real comments (assuming the admin is marking those spam comments as ham).
Also, I'm a fan of not allowing brand new accounts to post URLs in their comments. It's a no-brainer.
Seems pretty effective indeed. If i look at the caught spam on my blog all of them have hyperlinks in either the 'website' field or the comment text itself.
Pity it also targets normal users that simply want to post a hyperlink :(
Maybe the barrier to posting a comment should be low if you don't have a hyperlink but if you include one then you have to jump through some hoops (captcha's, etc).
To be clear, we're talking about an automated process that is looking for forms to fill out. It's far easier and faster to simply scrape the HTML of a page and look for the <form> and <input> tags to figure out what to submit than to actually fire up something that resembles a real browser.
Hmm. I wonder if spambots check back on their work to see if it's worth continuing to attack a target. I guess removing the hyperlink would make them decide to drop it and move on.
I would say almost certainly yes. In my own forum I found that a short period after I had implemented spam filtering, the moderation queue length (everything that gets rejected as spam goes in to the mod queue) had very little spam in it. I would surmise that spam bots are at least smart enough to check that their posts are getting through, and if not, don't waste time posting content (just try one every now and then to see if you can slip one through and then hammer it!).
did you try, or do you know if merely adding nofollow was enough to have an impact? I've always assumed comment spam links were more about growing/stealing page rank than actually being clicked by anyone.
I may not have reverse engineered it fully, but something like this will allow me to post images around the internet that actually create comments on your site by the IP of the visitor.
My personal favourite quick-fix (which doesn't stand up to targeted attacks, but is a very effective band-aid), is to put the following :
<input type='text' name='website' style='display:none'>
Then disallow any form submissions server-side which contain a value for 'website'. Automated bots can't resist filling out that field.
This happened to me recently with a WP blog. It happened quite by accident, however, since the client just didn't want the website field. When comments still came in with a URL, the client was concerned that I had screwed up - but it clicked right away for me that these must be bots. It might have been a little disheartening for the client, since a number of these spam messages were along the lines of "I have never read such a great article. I have bookmarked your blog and will come back every day to read more of your insightful posts." What unaware blog owner wouldn't want that on their comments? Crafty spammers.
Mine is the reverse of this idea. I have an hidden field that when you click submit, I fill in with a token via javascript. If the correct token isn't present when submitting, i reject the comment.
It works very well unless you're big enough to merit individual attention from a spammer. It's not rocket science - it just raises the bar a little above the level of effort that people who spam everything, everywhere are willing to put in.
That might change.
The real merit of Javascript used this way is that there are so many different possible approaches and ways to write the code that parsing has to be done on a site-by-site basis. It should even be possible to write something that auto-generates -and-mixes various combinations to make it annoying and costly for an individual to keep working at breaking the protection, and thus increasing the size of community/site you could protect this way.
There is a Wordpress plugin called Spam Free Wordpress that implements a variation of this and has effectively cut spam on my sites to zero.
The plugin improves on the method described by randomly generating the value of the additional token parameter, and keeping a list of all generated tokens. If the server receives a comment post request which does not contain one of the generated tokens, then that comment is guaranteed to be automated spam.
I recently set up a WP site and forum for a product my brothers are trying to sell.
We're not allowing commenting on WP, but obviously have to allow people to post on the forum. The forum software offered a couple of (unofficial) anti-spam plugins, but they were not effective at all.
Decided to try re-captcha, but found that to be equally ineffective (hadn't read about just how broken re-captcha is until this incident).
So I spent 10 minutes writing a little script that checks for mouse movement and clears a pre-populated field. If the field isn't empty, bot it is.
Wasn't sure it'd work, but so far, so good. I know it's not ideal and will be a problem for people without js enabled, but the site and product are targeting a demographic in which that's likely to be a rare occurrence so the benefit > risk.
That seems clever but is actually a bad idea. I often browse the web with vimium and no mouse movement. I wouldn't be able to comment on your blog. There are better ways of using JS to prevent spam.
It could also just put up a page that says "Because no mouse movement has been detected, there's a possibility you might be a bot; to show you're not, please move your mouse around a bit and then click <submit> again."
Well, I'm not checking when people are trying to post but when they try to register--sorry if I wasn't clear.
The forum requires registration (and verification) before posting, so once they're registered there aren't any restrictions. And one of the benefits of this check is that there aren't any "human verifications" visible to the user. In fact, I could probably do away with the email validation too.
Exactly right, and there is a threshold set. Though it's not used when people try to post but rather when they try to register, I'd imagine it'd work similarly well on an "open" comment page. For a while at least.
I use browser plugins that allow me to avoid using the mouse (Vimperator for Firefox, for example). It's not unusual for me to run a search query and view several sites using only the keyboard. I'm replying to your post now without ever touching the mouse. I think your approach is clever and the advantages may outweigh the disadvantages, but it may need some refinement to avoid false positives.
Eh. It depends upon the audience of your web site. If it's web site with a programmer audience, there is probably going to be a non-trivial portion of your users that are using plugins like vimperator or vimium.
Yes, I'm not sure if this approach would pass the Accessibility test. (Think text-to-speech browsers, customized control setups, and so on. Some people really cannot use a mouse.)
Similarly you could capture keydown events (specifically arrows and tabs) and pare down the false positives from people using other accessibility devices/browsers.
From my experience running popular open source applications seems to pretty much guarantee spam.
For example, We built a website with a forum some years back and used phpBB. Within days massive amounts of explicit porn had been posted all over it and we had a client threatening to sue.
We tried everything we could to get rid of it, stopping images/hyperlinks from being posted, adding captchas , anti-spam plugins and doing stuff like adding sneaky hidden form fields.
At one point we even deleted the signup form and required administrators to create accounts by hand on request for users, yet the bots still somehow managed to create their own accounts on the forum.
None of it worked for over a month at a time.
In the end I just built a super simple php forum by hand in a few hours with very rudimentary anti-spam since it was a small forum and we weren't using many phpBB features anyway.
Took over a year for the bots to come back and at that point switching the HTML around and changing the form field names seems to have kept them away thus far.
Another technique I find to be working really well is the "honeypot" technique. I create a CSS-hidden input field with a delicious, attractive name "url" and then validate it to be empty.
I use both the hidden honeypot and a random javascript injection that has to be matched server-side. Both have to pass.
The "problem" with this kind of tricks is that they works for small/medium website and only if they are not adopted as part as a big library that everyone uses.
They are not that hard to beat if you want to spam someone intentionally or if they are implemented by a well known plugin for (wordpress/joomla/etc..)
I use this technique a long time and it seems to beat all bots (a medium website, about 100k unique visitors per day). It's easy, unobtrusive and just works :)
I sense a lot of nativity in this post. For starters using GET just means that once one spam user creates a rule for your site, they can spam it until you change the variable names in the query string. Using JS to submit a form, whilst should be fine, but I STILL encounter people without JS, and personally without a JS fallback I think it's just bad coding.
A simple honeypot with some CRSF tokens would reduce spam, if you want to beat spam altogether, then invest some time in a captcha, but expect it to come at the user's expense.
With 5 lines of PHP I was able to block 94,94% of the spam on a WordPress blog. I simply checked how long time it took for reading my article, writing and submitting a comment. Less than 10 seconds = block with a friendly message.
Code and more details here: http://www.jimwestergren.com/a-new-approach-to-block-web-spa...
> I simply checked how long time it took for reading my article, writing and submitting a comment. Less than 10 seconds = block with a friendly message
Some of the bots simulate mouse movements, some of them even inject letters/words into textarea elements as if someone is typing. It's not that hard to make it look like someone is correcting typos.
Server-side, encrypt a token which, including representing the unique form instance, contains a tick count and set a hidden input's value to it. Now, ensure that each form instance cannot be submitted more than once AND that the delta between the current tick count and the form's tick count is greater than or equal to the amount of time that would be need for a human to fill out the form.
You MUST ensure client-side error detection is superb (as you want to catch all errors prior to submitting), handle for back button usage properly (browser caching directives, http status codes, etc), and ensure you handle for browsers which may auto fill information in for the user.
You would be surprised just how many bots come in and either used a cached form or immediately submit it. Assuming they are smart enough to bypass both of these, you just reduced the number of times they could potentially spam you dramatically.
The tick count figure needs to be done on a form by form basis, as each one likely has a different minimum.
I added something similar to our framework where we do the encryption server side when a form is generated.
In our token we encrypt a form generation time and captcha question and answer variables. This allows us to easily render on the form a textual or graphical captcha and pass the answer encrypted. The form processing simply decrypts the data and decides one, if a form is too fast or stale based on the difference of the form generation and submit time and two, it compares the captcha answer to that which was passed in the encrypted token.
This is a cat and mouse game. If enough websites out there use this technique then there will be bots that can circumvent this, although some non-trivial amount of work is needed to parse the javascript.
Just to respond to some of the comments I've seen. Basically, yes it's true that my sites aren't very high profile, and if someone were to target them directly it would be trivial to bypass this system. The point was more that the current, well used bots that send spam randomly, do not work against them.
Interestingly enough one of you, someone who saw the story here, decided to actually write one such bot and start spamming my blog post, but again they were pretty stupid and it was trivial to block. Still, pretty sad that someone would go to this length and actually try and send hundreds of spam posts just for the kick of it.
Also a lot of people mentioned captcha, and yes I guess I should have mentioned that, but the reason I never used one is because I didn't get any spam in the first place.
Idea is good for massive/popular spam bots. But... well... I've been there and I know, that spam bots evolve if only there is someone, who can tune bot a little. So changing bot from looking for submit button into just submiting form is quite easy. Also - technology goes forward and there's no big problem today to write bot, that understand JavaScript.
And as I have mentioned this - the best solution, I've found to fight spam bots is to create hidden (or visible, what the hell) field with initial value, that's later changed by JavaScript. Checking for value, you expect it to be set by bot works like a charm.
But - as long as you say, that this solution works, it's worth mentioning and remembering.
It's important to note that it's extremely easy to "beat" comment spam if you have a relatively low-traffic site and some programming time to spend on a custom solution.
The per-message payoff for spam is horrendously low. Spammers only do it because they can post a huge number of messages. The big threats are necessarily automated, and that automation isn't going to bother with special cases for any site that isn't worth their while.
For the longest time, the anti-spam measure on my blog's comments was a field that literally said:
Type the word "elbow": _____
And it only accepted the comment if you typed the word "elbow". It wasn't even a dynamic word. It was literally hardcoded to be the word "elbow". This stopped almost all spam for years.
Somebody finally added this to their bot, so I modified it slightly, to:
Type the word "humour", but with American spelling: _____
Once again, this stopped almost all spam for years.
A few months ago, more for fun and curiosity than because I really needed it, I replaced that anti-spam field with a JavaScript hashcash-based solution. Basically, when the user wants to make a comment, the page fetches a problem from the server whose solution is difficult to compute but easy to verify. The page then computes the solution on the commenter's computer, and posts it along with the comment. I tuned it to take about 20-30 seconds on modern hardware/browsers.
For the curious, the problem I chose is a standard one you'll find if you search for "hashcash". The quick version is that the server generates some random data and gives it to the client. The client then searches for a salt that, when added to the data, produces a SHA-1 hash with a given number of leading zero bits. The number of leading zero bits required can be easily tuned, with each additional bit roughly doubling the amount of time it takes to find a solution. The client's solution can easily and quickly be verified by just combining the client's solution with the generated data and counting the number of leading zeroes in the SHA-1 hash.
Now, this would not stand up to a concerted effort. My JavaScript implementation is pretty slow, which means that the 30-second work required by my page could be reduced to <1s of CPU time for a program optimized to break my protection. But it doesn't matter, because it's not worth anybody's time to do this.
I occasionally get spam, still. From looking at the logs, I'm about 99.9% sure that these spam comments are being posted by actual human beings sitting at a browser. I have no idea how it could possibly be cost effective to do this, but the quantity is low enough that it's not a real problem.
My crazy hashcash solution has an additional benefit, which some might see as a liability. I only start the work when the user clicks on the comment form, in order not to burn up their battery unnecessarily if they don't plan to leave a comment. The user then has to wait until the proof of work is completed, typically 20-30 seconds, before they can post a comment. This strongly discourages short, off-the-cuff comments, which are almost invariably worthless anyway.
In short: spam prevention is easy if your site is small and you have the time to invest in a custom solution. Any custom solution will do. As long as it doesn't match whatever patterns spambots possess, it doesn't much matter what you do, as long as it's unusual.
Once your site gets big enough, you'll no doubt need more. But cutesy stuff like changing your form variable names won't save you then anyway. If you're at the level where the linked solution works, you're at a level where nearly anything custom-made will work.
I use a dummy field on one site - called something like "Last Name" - the contents of which are hidden and must not be changed. The field contents are clear they must not be changed - "Do not alter this field!" - so that it still works for a wanted user if CSS has been tampered with.
No spam yet. But it's quite a small site, probably this is over only about 6Million hits.
For all I know it's just because it's a hand-coded site. Trying this on a WP site is on my todo list.
I used this solution on a network of WP blogs with moderate traffic (maybe somewhere around 100 to 500k+ visits per month at best) but after a while some spammers took the time to script their way into the comments.
Regardless of spam protection, I like the idea of a 'deep breath and count to ten' being forced on a commenter before they can submit and I'd love to know what an impact that might have on comment quality somewhere like youtube.
I didn't think of that when I first wrote the thing. Only after I activated it did I have a reader point out that it would cause people with short comments to have to wait to reply, talking about it as a bad thing. My immediate reaction was, this is great!
Like many other bloggers, I've been a victim of Blog Comment spam for quite a while. On few occasions, I've totally disabled comments on my blog.
However, isn't that something of the past?
I've totally outsourced my blog comments to Disqus (there are other alternatives) and I'd like to say, I'm very happy with my decision. Some manual spams still leaks through but they're so minuscule and I don't really fret over them any more.
>However, isn't that something of the past?
Not even remotely. Adobe Business Catalyst users have been getting hammered with comment spam for months now, it shows up in waves on livejournal and I catch it regularly in my akismet queue in wordpress. I see it everywhere, still. If there's a form, something will try to post a link in it.
>I've totally outsourced my blog comments to Disqus
That's all well and good until someone writes a bot designed to target Disqus users because of the size of its userbase.
May we just let comments = referrer links? Comment on your own blog, twitter feed, etc. and traffic from those sources list automatically under the content.
Fighting these kinds of problems makes for interesting mental challenges, but a technical solution isn't necessarily the best one. Shouldn't the price of having space on my site to comment be that you do so from some kind of online identity of your own?
I just thought of this method: randomize the input names on each form load, and include they key to the hash in a hidden field. This way the bot would have to be smart enough to go off field order instead of name (you could even randomize field order using some clever CSS). Or are they already smart enough to deal with that?
Wait until you get the SPAM bots targeting your payment forms to validate stolen credit cards...whole different set of challenges. We had 800 payments in one day from this type of attack.
Yup. You just have to make your system a bit unlike everyone else's. I just hid my normal comment field with css, and made a new, visible one with a different name. Any comment that came in with the old parameter name was chucked. Done and done.
I don't think Javascript tricks work very well against motivated spammers. It is trivial to use headless WebKit client to execute Javascript and ajax requests.
I'm reading this thread whilst running a full-stack test suite against my app - using a headless WebKit client. I expect spammers will do the same if and when the JavaScript-unaware methods stop yielding an acceptable return, but given their low costs that threshold may be a long way off.
I use something similar in my own site: a field in which the commenter is asked to fill a specific value. If they're running JavaScript, I fill it in for them and hide the element. So far, it works perfectly.
As other commenters have pointed out, however, this kind of defence only works against generic attacks, and defending against a targeted spam attack will always be difficult. But for the generic case, there will continue to be simple things you can do to thwart naive attacks. One that springs to mind is to introduce a scripted timing element. A spam bot won't wait a minute before submitting, but a user should at least have read the post they're commenting on.
Progressive enhancement for bot detection... I like your idea. This is much, much better than simply stopping anyone without JS enabled from using the form.
If by motivated you mean "want to spam your site specifically at any cost", you're right.
But running javascript multiplies their processing costs substantially and it also means that at that point their costs can be driven up far higher simply by making the computation required to post higher - it doesn't take much - say a few hundred milliseconds of hash calculation on posting - to suddenly tie up a lot of resources for someone trying to spam as many people as they can for as few resources as possible.
For any spammer that has softer targets it makes little to no sense to bother.
Thats for bots customized to run on your site. There are generic spider-like bots that look for forms they can submit into, without knowing anything about the sites architecture.
I think that at least 90% of the bots are not made to work on specific sites. Unless your sites has millions of visitor I so not think someone will spend time to make a bot just for you.
this is amazingly simple and brilliant. thanks for sharing.
it is going to be too difficult for the bots to learn to get around that for quite some time, so now it's the time to enjoy this defense mechanism. once it becomes the ordinary thing, the bots will evolve too for sure but we're not quite there yet.
i suppose because the bot is not a browser but a service that gets the page, parse input fields and possible tokens and send request to the page that the form would post to.
Unfortunately (at least in the UK) this technique cannot be used on consumer facing sites as it breaks the accessibility of the form for some disabled users.
For personal sites it really comes down to your preferences. Personally I would prefer that everyone was able to comment, however if it stops you having to wade through thousands of spam messages every day I can see the point of using it.
Why would this be an accessibility problem? I don't see why screen readers would have a problem dealing with it - for them the form in the users browser will appear just the same as it otherwise would.
1.) Screenreaders have different modes of operation for different aspects of web content. For dealing with Forms they have Forms mode, in which only form elements are announced. A link isn't a form element, so they wont see the submit button.
2.) Screenreader users have a shortcut key to submit the form - typically when under-qualified web developers create forms without submit buttons. This fires the form submit event, which without a JavaScript preventDefault will get the form contents sent to the URL mentioned in the action attribute on the form. So the screen reader user's comment is treated as spam.
In this case though, from an accessibility point of view there are a few issues with the use of a link tag rather than the standard form 'input submit' or 'button'.
Problems include:
- It goes against user expectations of how the form functions
- User would not be able to submit the form while focus is on one of the inputs (although this could be remedied with more js)
- User would have to realise that this form does not have a standard submit button and realise that the link tag is the submit button (difficult for screen readers because there are no alt tags).
There is also an issue with usability for the few who don't have js enabled, as they will not be able to submit this form.
Spam bots are now extremely sophisticated and have been able to execute Javascript and "read" and understand web pages for many years. They'll also post bogus comments that are somewhat related to your article but sneak in a fishy URL in there. We had many false-positive reports that were actually real spam. It's just really hard to detect by a human. Of course, JavaScript-based technique will eliminate some easy to catch spam, but nothing a 3rd party service couldn't catch.
Another huge problem is that people are paid next to nothing in China and India to manually spam websites and break captchas. The number of human spammer keeps increasing. When I left last year, it was becoming a huge problem. Definitely the biggest headache for us in ~5 years.
In my experience, the best protection against web spam is still Akismet/Mollom/Defensio. And for the record, I know we didn't like when people used other mechanism to stop some spam before it got to us because we didn't get to see the full corpus, which was invaluable to us in helping all our users fight spam.