Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning and Link Spam: My Brush With Insanity (seomoz.org)
116 points by a5seo on April 25, 2013 | hide | past | favorite | 38 comments



Oh my yes! Machine learning one of the hardest programming there is. You only ever get an indirect measure if it is working correctly or not. Its hard to debug. My algorithms gets it right 80%, have I made an implementation mistake? Who knows?

My general strategy is to invest into training set curation and evaluation. I also use quick scatter scatter plots to check I can seperate the training sets into classes easy. If its not easy to do by eye then the machine is not magic and probably can't either. If I can't, then its time to rethink the representation.

The author correctly underlines the importance of training set, but also equally critical to have the right representation (the features). If you project your data into the right space then pretty much any ML algorithm will be able to learn on it. i.e. its more about what data you put in, rather than the processor. k-means and decision trees FTW

EDIT: Oh and maybe very relevant is the importance of data normalization. Naive Bayes classifiers require features to be conditionally independent. So you have to PCA or ICA your data first (or both), otherwise features get "counted" twice. E.g. every wedding related word counting equally toward spam catagorization. PCA realizes which variables are highly correlated and projects them into a common measure of "weddingness". Very easy with skilean its preprocessing.Scaler() and turn whitening on.


Agreed about features and Bayesian filters. Words are just not very good features for filtering spam - but all the numeric data he could easily feed to the Bayesian filter by dividing the data ranges into compartments (like very-many-links-per-word, or over-100-links-per-word).


I've wanted to build this for a while, I think an SVM-based spam solution could be amazing. Obviously, like the article mentions, when trying to categorize spam, a purely Bayesian approach is not great -- and neither is an ANN (although, with a large enough pool of hidden layers, it can get pretty decent). I think that the issue lies in the problem set. Spam cannot be treated like a linearly-separable model.

There are papers[1][2] that outline possible benefits of SVM-based spam filtering. Unfortunately, SVMs are still in their infancy and not many people know how to implement and use them. I do think the are the future, however.

[1] http://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final...

[2] http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/007886...


Assuming you mean Support Vector Machines with "SVM", you may be idealizing them a bit.

SVMs have been around for almost two decades now, which is an eternity in the ML world, rather than infancy.

SVMs don't require the problem set to be linearly separable.

Please note that there's a myriad of robust, scalable SVM implementations -- SVMlight, HeroSVM, LIBSVM, liblinear... (the latter two also have wrappers in scikit-learn, a Python library mentioned in the OP).


Would only add that unless you're writing a PHD on SVM don't write your own implementation. As Radim wrote there are quite a few to chose from.


Unfortunately the vast majority of scalable SVM implementations are only for non commercial use.


That's exactly my point. Most real-world problems (see spam filtering) are NOT linearly separable. Using highly-granular kernels would yield better results when compared to ANNs and Bayesian methods, etc.

Like I said, I haven't worked (much) with SVMs but they really do seem like the future. Unfortunately, they are difficult to work with.

And finally, the first workable SVM algorithm was proposed in 1995 (not nearly the two decades you claim) and implemented several years later. SVMs are very much still in their infancy -- especially considering that not much work is being done on SVMs since ANNs are much (much) easier to work with.


If I were to build this, I would try deep belief networks. Deep learning can learn the features from a lot of text in an unsupervised manner and then correlate those features to a category with a relatively small training set, which allows it to generalize much better than traditional ANN techniques.

Geoff Hinton's claims that DBN's outperform traditional (shallow) ANN's, random forests and SVM's for any of the tasks he has tried them for (including classifying textual documents), though I have not benchmarked how well they work myself (not yet anyway!).


What would you use as input for the deep network?

I have only worked with text classification methods where I chose the features myself. As I understand it, a deep network still has (like a 'traditional'/non-deep ANN) a fixed number of inputs in its input layer, i.e. one would have to process each input text somehow before feeding it into the network (to make the input sizes equal). Is there a usual way to do this without doing feature-extraction?


If using raw bag of words or n-grams, why not hash the strings to maybe 2^13-1 slots, with something like MurmurHash3 or with multiple hashes to prevent collisions, then use that sparse vector as input to a deep learning model?


Thanks, that makes sense.

So the parameter would be the number of slots (== number of input units of the deep NN).

And the transformation of the text into bag-of-words / n-grams would not be considered feature-engineering - or at least only 'low level feature engineering' - the higher level features will be learned by the deep network.

I guess one could go lower level still and even do away with bag-of-words / n-grams : limit the text size to e.g. 20000 characters, represent each character value with a numerical value (e.g. its ASCII code point when dealing with mainly English texts) and then simply feed this vector of codepoints to the input layer of the deep network. Given enough input data, it should learn location-invariant representations like bag-of-words / n-grams (or even better ones) itself, right?


As Radim noted, there might be better approaches to using SVMs. One example may be Restricted Boltzmann Machines (see Hinton's google tech talk [0]). Some folks have tried using them to detect spam (or at least using RBMs as a part in another architecture), achieving better results than SVMs (they actually did a proper comparison). [1] [2] Might be something worth to be looked at ;) At any rate, RBMs are rather fascinating, I also plan to try experimenting with them when I have time.

[0] https://www.youtube.com/watch?v=AyzOUbkUf3M

[1] For a general overview of the study, here are its slides (pptx): http://users.cs.uoi.gr/~gtzortzi/docs/publications/Deep%20Be...

[2] The same study in full (pdf): http://users.cs.uoi.gr/~gtzortzi/docs/publications/Deep%20Be...


I'm interested to know why ANNs wouldn't be a good fit for this? ANNs aren't my area but I have a basic grasp, and I thought they could be an interesting approach to try here?


ANNs excel in systems that are linearly separable, but have a much harder time with data that is not. Google the XOR problem, for example.


The problem now is that (imho) 99% of the links posted on the internet are spam.

Unless you have a baseline of "what was here first" and "exactly when every website went live with what links" like Google does (because they have been indexing websites since the dawn of time as far as the internet and linking is concerned. Heck, there wasn't even backlink spamming prior to Google because Google was the first search engine to rank by number of backlinks!)... you're going to have a really tough time determining what spam is and what it isn't.

Which 1% do you decide to focus in on?


Just because a link is added to content after the content already exists doesn't immediately qualify it as spam. 99% of links being spam is a pretty massive assertion, is that anecdotal or backed by any actual data?


"doesn't immediately qualify it as spam", no. That's correct.

It does however make it very difficult to gauge who is a legitimate linker and who is not.

All I'm really saying is if you're starting now with all of the years of random linkspam backscatter is that you are in for a rough ride.


Even if you have had a perfect history of when links appeared you're still in the a rough ride. Furthermore, absence of the that information doesn't invalidate the author's approach (but having it might improve its effectiveness).

As an aside, you have use Ahrefs.com to get pretty decent tracking of when links appeared since it started (I think ~18 months ago or so). Given that the rate of spammy pages is increasing extremely fast and old spam pages are dying off, I imagine that in the not too distant future you'll be able to get decent link history for many sites.


> 99% of the links posted on the internet are spam.

Oops, I just clicked Reply to your comment.


I've built an anti-spam system for Delicious.com using Naive Bayes classifier with a really huge feature database, think tens of millions, mostly tokens in different parts of the page, those features are given different weights which contribute to the final probability aggregation. The result was similar to what the OP achieved - around 80% accuracy. The piece of work was really interesting and satisfying.


Hmm, interesting ... but how you calculate the weights ? do you use the KL-divergence method.


I'm surprised this post doesn't mention Markov chains. The author seems to think that finding and implementing a grammar quality checker will help stop spam. Aside from provides endless hours of entertainment viz. DissociatedPress, Markov chains are abused by spammers to generate grammatically correct nonsense. You can easily add meaning to the "nonsense" by adding formatting to certain words to add a secondary message. Does anyone know of a way to stop this?


You can estimate the likelihood that a particular sentence is spam by calculating the log sum of n-gram probabilities of sub-sequences in a sentence. These probabilities are obtained from a sufficiently general training set, such as Google's n-gram viewer[1]. You can estimate the probability of a particular sequence of words by summing the log probabilities of each n-gram within that sequence. Using a trigram language model (n = 3), you could estimate the likelihood as follows:

Sentence = "This sentence is semantically and syntactically valid."

P(Sentence) = log(p(START,START,This)) + log(p(START,This,sentence)) + log(p(This,sentence,is)) + log(p(sentence,is,semantically)) + log(p(is,semantically,and)) + log(p(semantically,and,syntactically)) + log(p(and,syntactically,valid)) + log(p(syntactically,valid,.)) + log(p(valid,.,STOP)) + log(p(.,STOP,STOP))

where START and STOP are special symbols that aid in determining the proximity of a word to the beginning and end of a sentence.

If your training set fails to sufficiently generalize, you could use Bayesian inference to estimate the likelihood that the sentence is spam. Under this framework, you'd be calculating the posterior probability of the sentence being spam given the observed sequence of n-grams, which combines (i) the inherent likelihood that any sequence of words is spam and (ii) the compatibility of an observed sequence with (i), which is proportional to the impact it has on (i).

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....


Your comment would be marked as spam using your logic. Was that intentional?


Note this is exactly how a smart spammer would generate text (sampling from a language model, built on a public ally available data set like google ngrams or Wikipedia). If you wanted to catch someone doing this, you're much better off using your own corpus to generate a language model, as a spammer would have to scrape all your data to reconstruct the same thing.

Then, run the model over your data and start playing whack-a-mole (and refining the model).


Having done some link spam (in decent volume) in the past, your best bet in stopping it would be using more sophisticated captchas (such as the ones that have you identify a cartoon in an image or something non textual) and identifying the system posting the comment/spam link and seeing if they're a real user with a real browser or just an automated spambot like xrumer or scrapebox.

You can also turn off links in comment bodies and the URL field of the comment form to try and prevent scrapers from even finding you a worthy target. Won't help identify spam though.

Finally, centralized spam identification systems like Akismet work really damn well because they are watching the whole site network at once and can use those heuristics to identify spammers rather than the actual spam content itself.


No need to go that far. Just download OpenNLP and it will do the labeling quite well with almost no effort.


This is the really hard way.

And it is going to fail A LOT.

Do this instead:

1. Contact a company that has a searchengine and therefore access to all your links. ( http://samuru.com ) springs to mind.

2. Do keyword extraction of those pages. Assume that anything that doesn't have any of the keywords of the page that is being linked to is a Bad link.

3. The ones that remain Google the keywords you extracted. (like 10 of the words) if the linking page doesn't appear in the top 50 results it is probably a Bad neighbor according to Google.

This method doesn't require NTLK, or Grammar checking. You can do it your self, and you are using Google to tell you if the site is on the Bad Neighbor list so you don't have to guess.


Your approach is going to have a lot of problems.

One of the most linked to page on the internet is the download page for Adobe Reader. It is definitely not spam but millions of those links aren't going to have "the keyword" on the page, so by your logic are bad links. This is an extreme example, but it is not an uncommon scenario.

Furthermore, if you have millions of backlinks, it becomes quite difficult to scrape Google (but you can use services like Authority Labs).


Why are you doing Bad Neighbor link checks if you make something like Adobe Reader?

You don't have to scrape Google they have an API for Search that is about $10 per 1000 calls at volume.

I have done this for BILLIONs of links.


You think the Adobe never built any bad links? I know many such large companies that are spendings hundreds of thousands of dollars a year or more buying links.

Do you have a link to the API, please? Thanks!


Custom Search, Don't specify any rules which would interact with the results you are testing against. (like add a rule that favors a parked domain.)

If they are buying those links then finding the bad ones is as easy as contacting the people they cut checks to.


I've tried doing something similar with AI the other day. My approach was looking at money flow instead, as in theory, spammers only spam to make money. I basically downloaded an ad-blocker list and ran it against a pages source. That along with a couple of other factors were fed into many attempts of machine learning fun. In the end, it all failed. I learned that it's just impossible without a data-set like google's, so I went and build them into the process, and voila, it worked.


> I went and build them into the process

Sorry, what do you mean by that?


My script basically looked up how high a link ranked on Google for various related keywords. Alexa page rankings and similar services were included aswell. The AI then weighed the factors and tried coming up with an educated guess. After I included external numbers from people with big databases, it became actually very successful.


I tried doing this with reddit links. I downloaded a dataset of 100K submissions and their vote rank. I split them in two, those with less than 5 votes and those with more. It would seem that falls right in the median value.

Then I collected the HTML from those pages and turned it into a feature vector, then tried to learn if a page would have less or more than 5 votes. My prediction rate was as good as random. Fail.

Google does this since Panda. They machine-predict if a page would have success or not with the people and use that as a ranking factor. It makes SEO into a holistic art - you need to think of everything.


I have used maximum entropy classification for a quite similar task. It achieves better performances than Naive Bayes classifiers. But as the author remarked, the quality of the training set and the selection of features are very important aspects too.


The Bayesian filter might have worked. Theres no reason to use only content as a feature - you can use all the features you want regardless of which ML technique you apply. Bayesian poisoning is a real concern though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: