Reducing website spam

TeMPOraL · on Dec 1, 2011

"Most of the time they are oblivious to it. Some of the time they feign ignorance. The ones who are oblivious to it after a bit more questioning appear to have hired 'SEO Experts' to help improve their website rankings. These 'experts' then start up their various pieces of spam software and sit back often charging the site owners a lot of money for that service."

Back in the time when I was too lazy to install Akismet and manually managed comments on my blog (via moderation), I used to see from time to time annoying SEO comments - a fake profile with genuine-sounding comment, backlinking to a website catalog. When I was especially bored, I used to go to such catalog in order to find the SEO company who was spamming me.

On one occasion I contacted a private photo studio, asking if they're using SEO company XYZ, as links to them are all over the catalog. It turned out that they just hired a SEO company to 'position their website', without knowing anything about how it's done. I explained the situation (sending screenshots), and the studio immediately ditched the SEO company and hired another one. I was amazed by this attitude and featured a special advert-like post on my blog for them and we're in good contact since then.

TomGullen · on Dec 1, 2011

Good story! Being proactive like this is a great idea, and I think I will try the same when we get some more. A lot of it is from this sort of thing, people who don't really understand what they are paying for which is what I tried to highlight a bit.

TeMPOraL · on Dec 1, 2011

Thanks :).

Yes, you're right. I currently don't have any other idea how to deal with those small, spammy SEO businesses than to raise awareness about the subject and hit their customers directly. It's sad, because the customers here are usually not to blame - like you said, they neither know nor understand what's going on here. I was quite (pleasantly) surprised by the immediate reaction of that particular person, who 'out of respect for my space in the Internet' decided to take action that cost her money. I deeply respect such people. "Vote with your dollars" mindset should be encouraged, IMO.

Anyway, here's the story - it's in polish, but maybe HN folks from Poland will find it interesting (and get to know about one thoughtful photographer) - http://temporal.pr0.pl/devblog/2011/09/19/spam-seo-i-fotogra....

jackson71 · on Dec 1, 2011

A couple of ways I've handled the spam situation in the past:

1. Base64-encode your form field names and decode them server-side prior to processing, or...

2. Create one-time use field names using md5 hashes of random numbers, map them to their true fields and store them in a session. Then process against those on the server side post-submit and clear the slate. (I used this method more often than not.)

3. Control the visibility of the honeypot field with CSS rather than "type=hidden".

Using #1 OR #2 in addition to #3 I've never had to use CAPTCHA nor human tests. A few paid spammers have come around from time to time, but since automated software isn't sophisticated enough to pick apart which field's which it either doesn't even try or it throws whatever it can at the fields, getting locked up in server-side validation.

TomGullen · on Dec 1, 2011

Good points. In regards to CSS I think it's important to specify exactly what you mean though. I've seen this implemented dangerously where they simply position the form control off the wide (left:-4000px). This is vulnerable to browsers auto filling!

However hiding it completely with display:none should be safe and is what I think you mean.

In our case, just hiding it works fine. We can upgrade this to a CSS solution if we need to upgrade it though. The main point is if your doing something, you're ahead of the herd enough for spammers to generally leave you alone.

jackson71 · on Dec 1, 2011

On #3, you want to ensure that the CSS is externalized and the class name non-obvious.

The amount of additional processing power it'd take to surmount this (and the other methods) from a spammer's end at scale would be incredibly prohibitive, and that's kind of the point.

typicalrunt · on Dec 1, 2011

On the forum I work on (custom software; forked from JForum) we used to get 10s of thousands of spam a day. That stopped almost immediately with the requirement of a verified email address.

We had tried CAPTCHA and honeypots but the spammers broke through it (we were being actively targeted). Once we used email verification, it forced the business owners of the email (Gmail, Yahoo) to implement better verification on their side and stop so many fake accounts from being created.

Spam is now a rounding error in my system (about 60 for every 56,000 daily posts).

I should also add that I subscribe to the broken window theory, so we also implemented Akismet to check all incoming posts for spamminess. We hide all posts that Akismet marks as spam, and it cleans up the place enough to (hopefully) ruin any spammer's SEO tactics. Once it becomes a futile effort to post spam that's just going to get hidden, they seem to stop aggressively targeting us.

TomGullen · on Dec 1, 2011

Interesting thanks! We don't want users to have to verify email addresses as we see this again as a barrier of entry that will put people off. However I do see the necessity of it in your case as this doesn't always scale very well.

typicalrunt · on Dec 1, 2011

We were worried about implementing this feature too, citing the same issue. What we did was explain to the users why we were making the change, bless some of the existing users that are known to be in good-standing, and then provide a tiered view of the forum for those accounts with an unverified email address. The people in that group could post only text. However, once their email address was verified, they are automatically put into another group and they can post like normal users again.

So while it's a barrier to entry, we attempted to minimize it as much as possible.

eli · on Dec 1, 2011

Important caveat for using honeypot hidden text fields: use a field name that is gibberish.

If you name the honeypot field something like "address" or "website", you get browser toolbars that will "helpfully" try to pre-fill the field for the user even though it's hidden. And then you're flagging legit users as spammers. I think an ideal system would simply require users who fail the honeypot field to submit a captcha rather than lock them out altogether.

TomGullen · on Dec 1, 2011

I've done some tests and no browser appears to autofill an HTML hidden field. It's why we picked that method as supposed to CSS as I'm not sure the consequences of this.

Display:hidden on an input field would be OK, but there might be cases on some sites where the field loads before the CSS file. This could cause auto filling to happen. It's a lot harder to test this sort of thing with CSS rules, it's a lot easier just to use hidden fields.

I'd recommend using names such as "username" as honeypots as the likelihood of being filled are high.

eli · on Dec 1, 2011

Did you test browser toolbars? I've actually been bitten by this. I believe Google Toolbar for IE7 was one of the culprits.

RyanMcGreal · on Dec 1, 2011

I managed to completely eliminate bot spam on a fairly popular site I administer through a combination of a honeypot form field and a simple human-testing question. This worked flawlessly for years, but a recent problem is spam accounts that appear to be filled in by actual humans rather than bots.

TomGullen · on Dec 1, 2011

We just don't like doing that no matter how simple it is. A lot of our users are from all over the world (probably 50% from non English speaking countries) and it really makes it a lot harder for them to signup. Also anything that acts as a barrier no matter how weak WILL lose customers and signups!

SageRaven · on Dec 1, 2011

So use math, the universal language. I had a site request that I try to stop spam on a evaluation/registration page. I added a randomized single-digit addition problem to the form and the spam ceased. A single input field where the answer is always 0 to 18 was all it took.

Or course, if someone took the time to target this specific site, that would be easily thwarted.

It still puzzles me just how this custom-made form got infiltrated by spam to begin with. Do people go around picking such forms and submitting form-specific bot-code code to some vast pool sold to bot operators? Or are bots far more intelligent than I give them credit for?

danneu · on Dec 1, 2011

Depends on the community, of course. Someone that can't fill out our simple human question wouldn't make it very far.

willvarfar · on Dec 1, 2011

The complex captcha has shown has been taken completely out of context; it's at http://random.irb.hr/signup.php and its completely appropriate and funny

TomGullen · on Dec 1, 2011

Fair point Will, I'll find a different image!

eli · on Dec 1, 2011

If you site ranks highly enough for certain valuable keywords and the bad guys start specifically targeting you, a lot of these countermeasures are useless.

I get comment spam on some sites that I'm 99% sure is being posted by a human using a regular browser at an internet cafe in China.

TomGullen · on Dec 1, 2011

If you're being targeted by humans, you're right, nothing you can really do. If they are using automates systems to target you best thing to do is start IP banning.

danneu · on Dec 1, 2011

Ever since my forum got popular, I've had non-stop trouble with human spammers. Straight up IP banning doesn't work. Too explicit and obvious.

A better solution is to let them register, but check all registration/post IP addresses against the blacklist. If they match, unobtrusively move them into a usergroup that just seems like your site isn't working that well.

Works wonders.

eneveu · on Dec 1, 2011

Yep. There was an interesting article about this practice (hellban / slowban) a while ago:

http://www.codinghorror.com/blog/2011/06/suspension-ban-or-h...

http://news.ycombinator.com/item?id=2619641

keeperofdakeys · on Dec 2, 2011

A better solution is to let them register, but check all registration/post IP addresses against the blacklist. If they match, unobtrusively move them into a usergroup that just seems like your site isn't working that well.

Reddit uses a similar system called shadow banning. When a user is shadow banned, they think their content and submissions make it through, but they actually are hidden to other users. On rare occasions, real users can accidentally get shadow banned; this is not ideal, but shadow banning is effective enough that the sacrifice is made.

Zak · on Dec 1, 2011

I've been looking at a lot of spam lately as I'm developing services based on text classification as a web service at classifyr.com, and spam is kind of the obvious test. Initially, the site was getting 8200 spam attempts a day with a captcha that was 99.9% effective. The classifier is 99.99% effective and has cut down the number of spam attempts to almost nothing.

The ones that still get through on rare occasions copy legitimate content and usually link to sites that aren't inherently spammy (like grey-market pharmacies) for SEO purposes. I suspect these are actually posted by humans; a post can classify as suspect rather than spam or ham and these always do when they get through. When that happens, the user is asked a trivia question related to the site's subject matter. I think it extremely unlikely a spam bot would be able to answer correctly.

brlewis · on Dec 1, 2011

Really? Is type="hidden" sufficient for an effective honeypot? If so I'll start doing this right away.

TomGullen · on Dec 1, 2011

For us it seems to be yes. It catches lots out. But there are other ways of hiding the field such as with CSS which is probably safer.

shtylman · on Dec 1, 2011

If you rename the username field with something nonstandard, then autocomplete tools have a hard time handling it (think of email fields).

jeremydavid · on Dec 1, 2011

I hope not to sound daft, but why do spammers do this? What is the benefit of automatically creating these accounts?

Are they trying to find some sort of exploit in your code that lets them send out emails?

rplnt · on Dec 1, 2011

As outlined in another comment, marketing/seo is often the reason. That is, bots will spam links. Either they want direct clicks or at least some SEO bonus.

eli · on Dec 1, 2011

The worst is the bots that are smart enough to figure out how to post spam on your site, but not smart enough to see that it's all rel=nofollow

TomGullen · on Dec 1, 2011

Working out if they are nofollowed or not probably is comparable effort to just posting the spam link! Therefore just posting the spam link is probably the best strategy for them.

Natsu · on Dec 1, 2011

For one link, yes, but for the entire site? I wonder.

I can't imagine there are any search engines important enough to optimize for that don't use the rel=nofollow cue.

danneu · on Dec 1, 2011

I'm not convinced nofollow links are valueless.

danneu · on Dec 2, 2011

Source: http://www.socialseo.com/blog/an-experiment-nofollow-links-d...

Google "nofollow seo value" to find other insights.

Semiapies · on Dec 1, 2011

Most forum-spam involves spam posts with links to sites in the hopes of getting customers and/or page-rank.

TomGullen · on Dec 1, 2011

As rplnt says it's often for SEO reasons. This is misguided advice at best and is rampant in the SEO world. It's a bad practise that a lot of people are misinformed on and are willing to do at the cost of honest webmasters.

rkon · on Dec 1, 2011

Wouldn't an obscurely named username field hurt accessibility? People using screen readers probably wouldn't be able to register.

function_seven · on Dec 1, 2011

That's a very fundamental conflict when implementing these types of anti-spam measures. Unfortunately anything that is screen reader friendly is often spam-bot friendly, (or, anything that is unfriendly to spam-bots will also be unfriendly to screen-readers) as both entities are programs that attempt to parse a web page.

Maybe include a link near the start of the form that says something like, "Screen-readers, follow this link."? The link would be to a one-time use page that is sane and accessible. I don't know...