Machine Learning: Neural Network vs Support Vector Machine

bravura · on Nov 24, 2012

SVMs are good if you want high accuracy without much fiddling and don't have many training examples. It is pretty simple to get off-the-shelf results from SVMs. However, SVM training is quadratic in the number of examples, and you have to get really hacky to train >10K examples.

Neural networks are good if you have many training examples, and don't mind doing hyperparameter tuning. I have trained neural networks over 1B examples on a single core. (Took a month.) However, you have to tune your learning rate and regularization, and there don't yet exist good packages to do this automatically.

It is also much simpler with neural networks to learn over custom data, e.g. mixing supervised and unsupervised learning (labeled and unlabeled examples), transfer learning, etc., because you can change your evaluation criterion and minimize it.

If you want to do deep learning, we have a much better understanding of how to do training using neural networks, particularly because we can train on such large datasets.

levesque · on Nov 25, 2012

Just wanted to add that hyperparameter tuning is vital for SVMs as well.

michaelochurch · on Nov 25, 2012

SVMs require a linearly separable dataset. That would make them seem quite useless, but often that's not a huge problem, and sometimes when it is you can add basis functions as you wish, like on this data:

This data set becomes linearly separable with the basis expansion {X^2, X^3} because there's a cubic function that matches the data perfectly for signs (e.g. f(x) > 0 on all the x's mapped to 1, and f(x) < 0 on all the -1's). This doubles the size of your data matrix (4 columns, for 1, X, X^2, and X^3} in comparison to the more typical basis.

The problem is that if your data are noisy, you can end up needing a lot of basis expansions to get separation, and then you have the typical problems of complexity.

As far as I understand it, SVMs are good when you know there aren't misclassifications in the data and believe the separating surfaces (which will separate the data perfectly) will be simple because you're not going to need to add a lot of basis expansions. It seems to expect a certain orderliness in the data.

Neural nets are extremely flexible on account of their large number of degrees of freedom, but rarely achieve "perfect" classification and it's often not desirable that they do (overfitting). What they seem to be strong at is finding a very good (but not perfect) answer, and unlike SVMs which don't work well in a noisy world (can't achieve separation) they seem to handle it well. Neural nets, in my experience, will learn something in a lot of different types of environments. The major negative of neural nets is that they take a very long time to converge, and if you use stochastic gradient descent (which becomes necessary on a large data set) they will actually never converge. Also, neural net problems often require extensive cross-validation.

psb217 · on Nov 25, 2012

Your description of SVMs is a bit misleading. It matches with what are typically called "hard margin" SVMs, which do require linearly separable data. However, people talking about SVMs are typically talking about "soft margin" SVMs, which don't require linearly separable data. Soft margin SVMs are the kind implemented in practically any off-the-shelf machine learning library you might pick up online.

A concise description of the "mode of action" for soft margin SVMs would be: project the training data into an alternate space and then perform L2-regularized regression with hinge loss in that space. The trick is that, by using (valid Mercer) kernels, the solution to the regularized regression in the alternate space can be represented by a weighted sum of kernel functions evaluated at points in the training set (i.e. the support vectors). Thus, the solution can be learned without having to explicitly represent the points in the alternate space, which permits the use of very high, or even "infinite" dimensional spaces for the alternate representation.

In the context of the alternative space, the use of strong L2 regularization (formally effected by an L2 constraint on the implicitly learned parameters) dramatically reduces the risk of overfitting. Additionally, the combination of L2 regularization and hinge loss leads to a convex optimization problem which, to some people, constitutes one of the key advantages of SVMs.

I tried to avoid too much jargon, though the term "L2-regularized regression with hinge loss" may merit further expansion (if you're interested).

michaelochurch · on Nov 25, 2012

Thanks. That's really cool. I would be interested in getting some pointers regarding the soft-margin SVM.

I've used neural nets a fair bit, but I've never built an SVM, although I'll probably get to them soon in my ML study (I'm using Bishop's book and Hastie's, as well as the online course videos).

psb217 · on Nov 25, 2012

If you look through chapter 12 of the most recent (online) version of ESL by Hastie et al, you can get a good idea of how to think about what the soft-margin SVM is doing. In particular, equation 12.25 on page 426 gives a formal mathematical equivalent to my previous verbal description.

Personally, I find it helpful to think of most classification methods in terms of what loss function is being optimized and what sorts of regularization are being applied. Chapter 3 in ESL gives a nice introduction to the concepts required for such an approach, in addition to the info in chapter 12 that applies directly to SVMs.

Building an SVM-based classifier from scratch is pretty straightforward but, as with many ML methods, making it efficient requires a bag of tricks. SVMLight and LibSVM both provide well-tested implementations of a variety of algorithms that are worth looking at (though the source may be hard to digest due to heavy optimization). You could also check out LibLinear, from the authors of LibSVM, which focuses on linear SVMs for use with large and high-dimensional datasets.

Evbn · on Nov 25, 2012

Read Andrew Ng's lecture notes (really more book-level quality) for the version of cs291 he wrote before he created the simplified/less-mathematical corsera version. They are floating around online.

Also, Elements of Statistical Learning is available online for free (previous edition, maybe?) , which covers a lot more standard/traditional statistical curve/surface fitting topics as well, all with high mathematical rigor.

psb217 · on Nov 25, 2012

Elements of Statistical Learning is Hastie's book. Between it and Bishop's book (i.e. Pattern Recognition and Machine Learning), I prefer Hastie's for clarity of exposition. In particular, I find that ESL better conveys the sort of intuitive understanding of _why_ a method works that facilitates practical applications and extensions to novel contexts. Though, there are topics for which the increased equations/explanations ratio of Bishop's book is useful.

I've TAed my university's graduate ML course for the past couple of years, so I've read most chapters of these books in some detail and have hands-on experience using them to help people who are looking closely at these topics for the first time. Interestingly, SVMs are actually a good example of when I'd suggest both books.