Naive Bayes and Text Classification I – Introduction and Theory

syllogism · on Oct 5, 2014

> Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes’ probability theorem, are known for creating simple yet well performing models, especially in the fields of document classification and disease prediction.

But this is simply not true! They _don't_ perform well. There's really no reason to teach people Naive Bayes any more, except as a footnote when explaining log-linear/MaxEnt models.

MaxEnt is not so complicated, and it makes Naive Bayes fully obsolete. And if MaxEnt is in some way too complicated/expensive, Averaged Perceptron is generally much better than NB, can be implemented in 50 lines of Python, and has far fewer hyper-parameters.

A common way for machine learning courses to suck is to teach students about a bunch of crap, obsolete algorithms they should never use, simply for historical reasons --- they used to be in the course, so they stay in the course.

Dn_Ab · on Oct 5, 2014

As others have mentioned, when the assumptions of conditional independence are met, Gaussian NaiveBayes and MaxEnt will asymptotically learn the same classifier. Except NB will learn it in O(log dimensions) instead of O(dimensions) examples. So in those cases and where you have fewer examples you'll want to use NB.

Even when the independence assumptions are not met, NB will often produce a good classifier if you don't care about the accuracy of the probabilities. NB is also online, resistant to the curse of dimensionality and for categorical data will learn polynomial decision boundaries. So the gain in the trade-off is in ease of implementation (you can write one in a dozen lines or less) that you can throw something close to your raw data at and often get good enough results without any ahead of time training, kernels or regularization.

You are right though that MaxEnt (some sparsity capturing algorithms are as powerful as SVMs) will in general outperform naive bayes and that the averaged perceptron probably has, on average, the best performance/ease of implementation ratio. But for many implementors/problems, where clarity is paramount, those distinctions will not be worth the cost.

Also, you have a wonderful website.

syllogism · on Oct 5, 2014

Thanks for the kind compliment :). And I should probably be less shouty here.

Okay, it's not true that NB is strictly dominated by MaxEnt. But, look at the two example problems the author gave, where Naive Bayes was said to be a good choice. The parameters there definitely, definitely won't be conditionally independent. And probably you'll have enough data. Naive Bayes is a bad choice here, as it is in most other situations.

So, I think the caveats you've raised are all true...but, I still wouldn't be raising them in a class I was teaching. I think it's easy to have less useful discussion, made up of individually more true statements. I think a common problem in technical discussion is too much attention to every qualification, and every caveat.

People assume that there's some sort of proportionality between the importance of an idea/topic, and the airtime you give it. And I think they do this implicitly, in a way that's really hard to consciously over-rule. So I think it's really important to editorialise. I think a lot of technical discussion would be better off by making statements that are untrue at the edges, accepting the imprecision, and moving on.

That's why I'm dismayed to see someone's started writing a detailed tutorial on Naive Bayes, especially set to two problem domains it's a bad choice for. Even if it's composed of entirely true statements, I think its net effect is to miseducate people.

instakill · on Oct 5, 2014

What's his website? It's not in his bio.

andreasvc · on Oct 5, 2014

http://honnibal.wordpress.com/

syllogism · on Oct 5, 2014

Thanks, fixed.

Radim · on Oct 5, 2014

Now you're just being silly. Yes, other models may perform better, but Naive Bayes is a simple model that performs quite well, especially in the fields mentioned.

The statement is completely uncontroversial.

syllogism · on Oct 5, 2014

We can argue about what "well" means I guess. On many problems the gap will be small, but you should never expect NB to out-perform the better models. And on lots of problems, the difference will be substantial. NB is strictly worse than MaxEnt.

Why settle for "quite well"? Just use the better algorithm, always. It's important to understand when solutions have been superseded.

Radim · on Oct 5, 2014

That part I agree on.

I guess my point is that pragmatic progress in science is hardly ever as earth-shattering and all-changing as the academic circles flaunt it to be (understandably so, in these "publish or perish" times).

"Old stuff" doesn't suddenly become rubbish just because "new stuff" comes out.

Even when the new stuff offers a better, more predictive/complete model of data, there's typically a class of problems where the improvement is inconsequential (cf. gravity ala Newton vs. general relativity).

But yes, when you understand the implications of switching, and the cost is simply a different `pip install`, switch away.

MrQuincle · on Oct 5, 2014

If the features are independent, it is better to use NB, because then you don't need to find it out the hard way. So your hypothesis of maxent being always better is flawed.

Moreover NB can be seen as a joint maxent method, so don't compare it on problems with conditional structure. This website explains nicely the connection between them: http://www.denizyuret.com/2010/11/naive-bayes-is-joint-maxim...

andreasvc · on Oct 5, 2014

Assumptions such as independence are always dutifully mentioned when discussing statistical methods, but I never see a discussion about in which sort of practical problems such assumptions hold. In the case of independence and text classification, it seems to me that it doesn't hold, certainly not on the scale of a paragraph. As the blog post mentions, naive Bayes performs well despite violating the independence assumption; I'm curious to understand why that is.

rasbt · on Oct 6, 2014

In practice, I would say that it is a very challenging task to determine whether variables are totally independent or not: It is more about the strength of independence. Every time something is "estimated", you'd introduce an additional source of error, and in cases of weak independence the violation may be negligible. Modeling the conditional dependences is basically the idea of Bayesian Belief Networks, and you might find the paper "Comparing Bayesian Network Classifiers" interesting in this context (includes comparisons to Naive Bayes) http://link.springer.com/article/10.1023/A:1007465528199

xpe · on Oct 5, 2014

Re: "Empirical studies showed that the mutli-variate Bernoulli model is inferior to the multinomial Bayes model, and the latter has been shown to reduce the error by 27% on average [13]."

I remain skeptical of statements that claim that one kind of model is categorically inferior. Also, I have a hard time believing this is a fair summary of the literature. First, one study is cited, but the sentence implies many.

Second, the citation in the original post is not a fair summary of the cited research. The abstract of "A Comparison of Event Models for Naive Bayes Text Classification" says: "This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes—providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size."

rasbt · on Oct 5, 2014

Thanks for the comment, and I agree! I rephrased it a little bit to

"Empirical comparisons provide evidence that the multinomial model tends to outperform the multi-variate Bernoulli model if the vocabulary size is relatively large [13]. However, the performance of machine learning algorithms is highly dependent on the appropriate choice of features. In the case of naive Bayes classifiers and text classification, large differences in performance can be attributed to the choices of stop word removal, stemming, and token-length [14]. In practice, it is recommended that the choice between a multi-variate Bernoulli or multinomial model for text classification should precede comparative studies including different combinations of feature extraction and selection steps."

xpe · on Oct 6, 2014

Thanks for the updates.

MojoJolo · on Oct 5, 2014

I actually really like doing text classification using Naive Bayes. I'm still new to it and still learning a lot. But one thing I'm having a hard time answering is explaining Naive Bayes classification in simple terms.

If that was asked to you, how can you explain Naive Bayes classification in simple terms?

yen223 · on Oct 5, 2014

I know a lot of spam has the word "Viagra", and very few legitimate emails have the word "Viagra". So if an email contains the word "Viagra", it's more likely to be spam. Naive Bayes classification just applies that to all words in all your emails.

rasbt · on Oct 5, 2014

I like this simple and intuitive explanation. Hm, maybe a short mentioning of the "a priori" knowledge would be also worthwhile!

rasbt · on Oct 14, 2014

I posted this article some while ago, and I thought it may not be a bad idea to archive it on arXiv. What do you think? The (Latex) PDF version would look like this: http://sebastianraschka.com/PDFs/articles/naive_bayes_1.pdf I just signed up on arXiv and see that the catch is that I'd need 2 recommendations for the categories Computer Science -> Learning or Statistics -> Machine Learning. Would anyone here who would want to give me a recommendation? I would really appreciate it!

meaty_sausages · on Oct 5, 2014

Lexicon-based approaches are better than all these approaches. The difference between MaxEnt and NB is similar (the same?) as a binomial/multinomial regression. They have all of the drawbacks and only a few of the benefits. A decent word list would be about as accurate as a regression (70% or so). Account for syntax and it will get you higher.

xpe · on Oct 5, 2014

There are typos:

2.6.3. The correct term is "Additive Smoothing" https://en.wikipedia.org/wiki/Additive_smoothing not "smoothening".

3.3. Letters transposed. The author meant "Multi-variate".

rasbt · on Oct 5, 2014

Thanks a lot, I always called it "smoothening" for whatever reason, but it should be fixed now :)

en4bz · on Oct 5, 2014

I'm currently working on a school project to classify abstracts from papers in to 4 categories. Currently Bernoulli Naive Bayes scores 85% accuracy. For comparison kNN scores about the same.

rasbt · on Oct 5, 2014

Sounds cool. I think the most crucial part is really the data (or feature) set. The choice of classifiers makes a difference too, but having good data (salient, invariant, discriminatory) makes the largest difference.

One think about naive Bayes and kNN: Although the performance may be similar, I would consider Naive beneficial if the data set is large. kNN requires to re-evaluate the data. Naive Bayes makes the classification computationally cheaper and with "on-line" learning, you can still improve or adjust it over time