If you're new to ML or datascience, I would recommend working to build a strong basis in Bayesian statistics. It will help you understand how all of the "canonical" ML methods relate to one another, and will give you a basis for building off of them.
In particular, aspire to learn probabilistic graphical models + the libraries to train them (like pyro, tensorflow probability, Edward, Stan). They have a steep learning curve, especially if you're new to the game, but the reward is great.
All of these methods have their place. SVM's have their place, but also aren't great for probability calibration and non-linear SVM's like every single kernel method can scale absolutely terribly. Neural networks have their place, sometimes as a component of a larger statistical model, sometimes as a feature selector, sometimes in and of themselves. They're also very often the wrong choice for a problem.
Don't fall into the beginner trap: sometimes people tend to mistake 'what is the hottest research topic' for 'what is the right solution to my problem given my constraints, (data limitations, time limitations, skill limitations, etc.)'. Be realistic, don't use magical thinking, and have a strong basis in statistics to weed out the beautiful non-bullshit from the bullshit that is frustratingly prevalent (everyone and their mother is an ML expert today).
EDIT: I want to also clarify: I don't mean to suggest the author is new to ML, I just mean this as general advice for anyone coming here who is new to DS/ML. The article looks great!
Personally I'd advise against both SVM's and Bayesian methods for a beginner. Bayesian statistics is very much the deep end of the pool. Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling.
A strong basis in statistics is certainly a great thing, but that can be maximum likelihood plus Bayes law (i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method) and provide the big picture for almost everything.
Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
> Personally I'd advise against both SVM's and Bayesian methods for a beginner. Bayesian statistics is very much the deep end of the pool.
I don’t know, I think it depends on what you mean by Bayesian. I would say understanding loss functions and regularization requires some understanding of Bayesian stats (just knowing that it comes from log p(x|q) + log p(q) and what both of those terms mean).
> Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling
Can you be more specific here? It sounds like you’re talking about a particular problem or class of methods. PGMs/Bayesian methods can mean basically anything from logistic regression to running HMC on some hierarchical model using 10,000 CPU hours. I just mean aspiring to learn PGMs will force you to quickly learn and gain a deeper understanding of and appreciation for Bayesian stats, and then you can build on that for years and years. But it depends on what you’re interested in doing —- there’s a difference between model building and inference; you can spend your whole life using the same loss function and just focus on making your NN architecture better, you don’t need much Bayesian stats to do that.
> i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method
Huh? Maybe we mean different things by Bayesian — the mode of your posterior seems pretty Bayesian to me!
> Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
Would agree that optimization is an important part of ML/DS, but since nowadays virtually all of the most popular optimization algorithms are available at our fingertips in e.g. pytorch, I would still think its better to start with trying to build a fundamental understanding of how to frame problems. But that’s colored by my own experience and background, people’s priorities should be different depending on what they want to do.
To elaborate on the point: When doing probabilistic modeling, whether one realizes or not, there typically is an underlying Bayesian formulation which explains what one is doing. Now, that might be well-aligned with the problem of interest (or not), and being clear on the fundamentals helps understand that, and also to compose distinct ideas which make sense together in the context of the problem. eg: see my comments below, in the context of linear regression from a Bayesian perspective.
Also, while "scaling" with data is a very hip thing these days, for most problems of interest it is very difficult/expensive to get lots of data (or afford compute). Further, humans often have very useful domain-models which are worth encoding into the structure of the model. This also helps nicely mix together a conventional "software" modeling with probabilistic aspects (for those who weren't aware, this flows towards what is called "probabilistic programming", and recent developments have made significant progress towards methods which work for an "intermediate" dimensionality, if not "large" dimensionality).
--
@astrophysician: Feels nice to see expressed so clearly, a perspective that I share! Feel free to get in touch if you'd like to discuss ML.
Additionally, you should probably learn a (little) R (which you can get from this book). This is not because R is a wonderful language (though I'm pretty fond of it myself) but because it's a great tool for the communication and expression of statistical methods.
Most good stats (which will help you be actually good at ML) books tend to either be written in mathematics, or R (or both). Given that you're already a programmer, R will probably make it easier for you to learn a bunch of this stuff (and the docs for R functions tend to point towards useful literature).
I actually travelled the other way (i.e. from stats to code) and I found R very very helpful. Of course, your mileage may vary, but the link above is probably the best single book that you could read to start learning ML.
Thank you! Providence wills that I have R studio installed to make a wordcloud, but I installed it without actually knowing what it is. (Just followed a tutorial to get my wordcloud :)
So thanks for the recommendation, looking forward reading it!
By Bayesian methods I mean methods which also describe everything with statistical distributions including hyperparameters, and which solve for a distribution (or some hard-to-get feature of it like its mean or confidence intervals), instead of just a maximum probability estimate where you can make shortcuts.
For example Ridge regression where you put a gaussian prior on the predictor then guess at its variance as a hyperparameter for a quadratic penalty (with cross validation or something) is MAP but not actually "Bayesian". Bayesians also put a prior distribution on the hyperparameter itself making life extra difficult.
I agree it's an awkward definition but it does differentiate between the easy tricks and the tough approaches Bayesian researchers actually work on.
Even in deep learning it's simple to put a penalty term on the loss or weights for that matter, so you could call it a MAP estimate. That's implemented everywhere. For that matter you can claim any neural net classifier with a sigmoid or softmax at the final layer is doing logistic regression. With multilayer networks just adding a representation learning stage to it.
As I said cheers to statistics, just that stopping at MAP estimates will cover most everything. If one goes looking for a Bayesian methods text or review article it will be well beyond that.
By scaling problems I basically the fact that monte carlo doesn't work in large numbers of dimensions, where the parameters of the joint distribution of everything are vast.
> By Bayesian methods I mean methods which also describe everything with statistical distributions including hyperparameters, and which solve for a distribution (or some hard-to-get feature of it like its mean or confidence intervals), instead of just a maximum probability estimate where you can make shortcuts.
Ah I see yea, if you’re talking about hyperpriors and marginalization and MCMC I totally agree; these are really valuable in science, and this is the “full Bayesian solution” but with many caveats, one big one being that it’s very easy to be overly confident in the results of some unwieldy hierarchical model and ignore the fact that the priors (or hyperpriors) are often times fudge factors because it’s so damn hard to articulate your prior belief as a bona fide distribution. A lot of times you run into a “garbage in garbage out” problem and it might not be obvious right away since we are probably busy patting ourselves on the back and popping champagne bottles because we’ve gotten our MCMC chains to converge.
> For example Ridge regression where you put a gaussian prior on the predictor then guess at its variance as a hyperparameter for a quadratic penalty (with cross validation or something) is MAP but not actually "Bayesian". Bayesians also put a prior distribution on the hyperparameter itself making life extra difficult.
Yea right — nothing wrong with cross validation in a lot of cases. Since yea we could go with the “full Bayesian solution” and marginalize over some distribution of our hyperprior, but that very well might not give you anything more than cross validation (or maybe will give you something worse if you have a bad hyperprior).
> Even in deep learning it's simple to put a penalty term on the loss or weights for that matter, so you could call it a MAP estimate. That's implemented everywhere.
Agreed, my only argument is that it’s helpful to know where all of these terms come from in a general sense rather than treat each one as some ad hoc solution.
> For that matter you can claim any neural net classifier with a sigmoid or softmax at the final layer is doing logistic regression. With multilayer networks just adding a representation learning stage to it.
Agree and that’s exactly the sort of valuable insight that’s helpful to have (though maybe this one isn’t really an insight from Bayesian stats).
> As I said cheers to statistics, just that stopping at MAP estimates will cover most everything. If one goes looking for a Bayesian methods text or review article it will be well beyond that.
Sure, if you know MAP and what it means, then you know what a posterior means and you know that every model you fit and loss function you use comes from that, and you know if you care about more than a mode or mean of the posterior you can use lots of tools to deal with that. I’ve never really encountered MCMC in the work I do and i don’t think it would really bring any additional insight into what we’re doing, but sometimes we care about uncertainty and Laplace approximation or variational inference can do the job just fine.
> By scaling problems I basically the fact that monte carlo doesn't work in large numbers of dimensions, where the parameters of the joint distribution of everything are vast.
While I would agree MCMC and it’s variants are certainly not a good idea a lot of the time because they require a lot of time and computation, very rarely they are (you want insights from your data to use internally or something and you really want to explore implications of many different assumptions), and HMC etcetera can deal with very high dimensional settings (> millions of parameters). But generally I agree, if you’re talking about “scalable solutions” it usually means MCMC is not really what you want to be doing.
How does understanding loss functions and regularization require understanding Bayesian statistics? Those notions are literally part of linear regression theory.
Only if you use them as a black box. What is considered l2 norm in regression theory corresponds to a Gaussian likelihood, whose log-likelihood becomes quadratic. And one typically might not have any priors (which is subtly misleading, because the "flat" prior is highly dependent on the chosen representation for the underlying degree of freedom). Note that regularizations are just log-priors.
Without an understanding of the underlying Bayesian formulation, linear regression theory might look like a vast and somewhat ad-hoc collection of separate ideas (eg: robust loss functions, and the many different kinds of regularizations), but seen in the correct light, it is easy to start with a general formulation and specialize it nicely to your problem. Working that way, you can often design a good solution for your problem without searching through handbooks for possible pre-defined methods. You can also combine multiple ideas very easily. eg: A couple of weeks ago, working from first principles I rediscovered what is called "Lavrentyev_regularization": https://en.wikipedia.org/wiki/Tikhonov_regularization#Lavren...
Regularization is used by mathematicians and engineers with a different theoretical perspective. A regular solution is one that is smooth or well-behaved in some sense. The motivation to choose the penalty is based on desired properties of the solution. L2 regularization leads to the least-length solution, L1 is chosen if one wants the sparsest solution, and so on. It is a bit less of a deep motivation, but then the decision of which model to choose is always somewhat ad hoc.
A lot of tools in ML might be used in other places with other interpretations, and the intuition for L1 and L2 that you describe is not wrong at all, but (1) ML/DS is usually done in a statistical context so I would argue that it’s a good idea to understand that formalism, and (2) that intuition doesn’t sound like it would help you build more complex statistical models, whereas understanding where L1/L2 come from in a Bayesian context definitely would help you understand what you would need to do to form a regularization term for e.g. a probability, or how to edit your loss to learn uncertainty. It also helps you understand what not to do and why not.
All of this is opinion for the most part, and if you feel there is more to learn from alternative interpretations, fine, but the suggestion is to understand the fundamentals of what you’re doing, and you’re usually doing statistics in ML/DS (whether you know it or not). Also, understanding Bayesian stats will make your life easier and it will make it easier to understand lots of other ML concepts in a unified way rather than in an ad hoc way: “minimum length solution” or “sparse solution” is what I mean by ad hoc. Both of those things are true and important, but they’re ad hoc.
I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.
I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.
And one can do statistics without being a Bayesian of course.
> I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.
This is the number one misunderstanding when it comes to Bayesian stats. Priors are hard, priors are often bullshit, priors are often the source of a “garbage in garbage out” problem, absolutely. I don’t mean to suggest Bayesian stats as something magical (magical thinking will get you in trouble). But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors. That’s just not true: priors are an unavoidable fact of life. If you’re not explicit about your prior, it means you’re still using one but not being upfront about it. So I would agree that priors are difficult and problematic, but they are also unavoidable, and I would not say they’re “ad hoc”. I would also say it’s important to understand what they are.
> I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.
I don’t know what you mean by “deterministic”...do you mean “frequentist”? If so I would disagree completely. Frequentist and Bayesian views are equivalent except for philosophy, and frequentist stats are taught at all levels of school until grad school (at least in my experience) and I think that’s a huge mistake. What do you mean by “nice to have not a necessity”? If you’re solving a statistical question, stats are a necessity. Other fields are the nice-to-have intuition. I would agree however that sometimes you’re solving a NON-stats problem in which case have at it with whatever field makes sense.
> And one can do statistics without being a Bayesian of course.
Again, fine, I agree you can use the same math with a different philosophy, the philosophy is up to you, but if you think somehow you can do inference without priors I’m sorry but that’s wrong. In my experience “Frequentist” usually has meant Bayesian but with a flat prior (please comment if you have a counter example).
In summary: study what you want, and lots of perspectives bring more understanding, absolutely. But I stand by the importance of understanding Bayesian stats for doing ML. Even if you don’t like Bayesian stats, it’s still important to understand what is going on. Also I should be clear by “Bayesian” I mean nothing more than understanding what posteriors, priors, and likelihoods are, not a hierarchical model with MCMC or something.
> But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors.
I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.
As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting".
As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".
As for statistics with out being a Bayesian, I did mean frequentist, though that may not cover everyone. You can even use a prior distribution that is estimated from data (people commonly do that with the naive Bayes method), whatever you want to call such a person. I wouldn't call them a Bayesian. You can simply view it as applying the chain rule of probability to get a more convenient form of your maximum likelihood equation.
> I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.
Oh totally, I agree with that.
> As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting".
As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".
Yea I agree with this too, at least in principle. No issue with solving a lot of problems from a non statistical perspective since many times statistics is not the clear “right” choice. E.g. understanding that L1 regularization corresponds to a “Laplace prior” doesn’t give you that much deeper of an understanding of what you’re doing, since most people use L1 regularization to encourage sparsity. Also, if you’re more comfortable with a non-stats perspective on things, no problem approaching problems in the way you prefer.
Summary: I agree with everything you’ve said here. All that’s left is I think a difference of opinion about how important it is to understand the Bayesian perspective and I think that likely comes down to (1) the types of problems you typically work on, and (2) personal preference. I find personally that understanding the Bayesian interpretation is extremely helpful for building a deeper understanding of a wide variety of ML algorithms but I totally concede this is not necessarily a hard truth. So I stand by my advice, but will definitely agree that there are alternatives. I took the route of understanding ML without Bayesian stats first — really didn’t understand or know Bayesian stuff for a decent amount of time after I got into ML. I’ve found the Bayesian perspective has helped tremendously but that’s just me.
I get it, understanding the Bayesian perspective on linear regression requires understanding Bayesian statistics. But linear regression is not solely a Bayesian technique.
You’re right, but in ML/DS (and I would argue, even in other contexts you might not consider ML/DS) you usually are doing things in a Bayesian context (“what can I say about my model parameters given my observations and assumptions”). If you’re only doing OLS with maybe a L1 or L2 regularization, it’s not critical you understand the Bayesian interpretation, but it helps, and as soon as you start venturing out into other ideas and messier/biased data and want to modify something, you’ll need a Bayesian understanding of what you’re doing.
>The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning.
The lore I've heard is that most new deep learning training algorithms (optimization algorithms) only work better on particular special cases, and it is hard to do better than the established algorithms in general.
I'm also not sure why you're saying they're applicable beyond deep learning--how do you plan to train a PGM or SVM using Adam?
I'd more generally describe the area as first order optimization, including methods like acceleration, automatic differentiation, stochastic approaches. Adam is just one trick for determining a hyperparameter.
They are usable everywhere derivative-based optimization is usable. Which certainly means SVM's, though since it's a shallow method you don't need much data to train it, and hence don't need a scalable optimization methods (it would just be unnecessarily slow). But you certainly could do it if you somehow needed to. Here's the first hit on google for "sgd svm': https://scikit-learn.org/stable/modules/generated/sklearn.li...
The fact that you can't use first order optimization methods for graphical models is one answer to the question of why everyone doesn't use them. Though for small models there are deep networks which model them and are trained as per usual for neural networks. I think this is still an active research area.
Nice yea I would agree with the vast majority of this, only thing I would add is that Adam/gradient methods are still useful in a graphical model e.g. to get a MAP estimate (and then you can get a rough posterior estimate using variational methods or Laplace approximation once you find the MAP). But I agree I wasn’t clear about what I mean when I say graphical models since I think most people would understand graphical models to mean a full MCMC sampling of the posterior and marginalization over hyperparameters. I would say it’s useful to understand why people do that and why that is useful, but many times that is (1) overkill and (2) inspires overconfidence in the result because once we marginalize over our prior distribution people tend to forget that our prior may have been a complete fudge. I just mean graphical models as a tool for model building, understanding how different models relate to one another, and as a recipe for deriving a loss function.
Adam, RMSProp, etc are just flavors of gradient descent so they’re useful on anything from ResNet to logistic regression. There are more flavors like natural gradient that are more useful for smaller problems since they require a Hessian matrix, but gradient descent is gradient descent. We use Adam in production for logistic regression, not for any particular reason really, just happens to work.
I'm not the OP, but personally I see NN's as being really really useful where the input data is unstructured (such as text or images). The deep approach (appears to) build better features than a human can, but I'm not convinced that they are _that_ much better (or indeed at all) than standard methods for tabular data.
Once upon a time, when I used to hire data people, I'd ask them to tell me about a recent data project. They'd normally mention some kind of complex model, and I'd ask them how much better it was than linear/logistic regression. A really large proportion of candidates (around 50%) couldn't answer this because they'd never compared their approach to anything simpler.
One person told me that linear regression wasn't in the top 10 Kaggle models, so they would never use it.
Oh so training time is virtually irrelevant to us and if it weren’t we would have to be a lot more careful about optimization methods and possibly which language to use. We also cannot use NN for the models we build (we are restricted to LR, but LR has as much model capacity as you need as long as you include more and more feature interaction terms).
NN’s are universal function approximators. They can have arbitrary model capacity, and you can sort of control that with architecture decisions, loss function/regularization choices, and early stopping, but depending on the problem they can cause more problems than they solve. Usually you don’t really know if your NN will generalize well outside of your train/test distributions, so many times it’s better to have a simpler, more predictable model that you can control the behavior of. This is all from my personal experience and is completely moot when we’re talking about e.g. NLP or vision tasks or situations where you’re drowning in data. NNs are super interesting and powerful, don’t mean to suggest otherwise but the mantra is: “what is the right solution to my problem”. Lots of great advantages to NN’s as well (you can get them to do anything with enough cajoling and they can be solutions to major headaches you would usually have in e.g. kernel methods).
Different people learn in different ways, but personally I’ve had more success with the opposite approach, ie “top-down”.
As in, rather than learning in depth all the low level parts then finally putting it together at the end, start with a surface high-level understanding of a working prototype then expand into the details of how everything works inside.
In the case of ML, this could mean starting with a 5 line SciKit-learn prototype of a random forest model, seeing some working predictions, then expanding knowledge from there - what data is going in and what is coming out? What’s a classifier? What’s a decision tree? Etc
I support this learning method. Having a map of the key concepts, as well as visceral feel for them via code, will keep up the motivation.
This would be in contrast to picking up one of the plethora of “ML” textbooks that mostly only describe the math behind all the algorithms. Which is not where you should begin, in my view (years of teaching experience). The use of such textbooks is as a reference to fill in details once your are curious about them.
And more than anything, the best way to learn practical ML is to “apprentice” to some experienced practitioners or team who are willing to act as mentors.
Agree — people should do what they want and have fun learning. Its just a suggestion that’s also colored by my own experience. I will argue that if you’re going for a job in data science it is a bit more of a serious suggestion since you need to be able to know how to answer statistical questions and understand your assumptions, and you really do need to understand Bayesian stats for that (nothing state of the art here, don’t worry if you don’t know what a PGM is, I just mean basics).
Starting with PGMs would kill 99.9% of aspiring ML practitioners. Classes related to PGM at Stanford and MIT are considered to be some of the most difficult ones. I'd rather recommend to start with something they are enthusiastic about and once they become sufficiently advanced, to naturally learn (H)PGM.
Exactly. I’m talking about how to orient yourself, PGMs are hard, sure but they don’t have to be outrageously hard. I would argue that if you do understand naive bayes (and it’s naivety) and you understand priors, then PGMs are just like rules to a game called “make a diagram of your posterior”. That’s not all there is obviously, and that’s kind of my point; you can do a lot with a little bit of knowledge, and then you can slowly climb that ladder for a long time and the more you learn, the more you can apply. Starting with an ad hoc approach (here are all of these classes in scikit learn with .fit() functions, lets just memorize their docstrings) isn’t “bad” per se, that knowledge is important, but it will not take you very deep, and you won’t be able to stray very far from those methods without being out of your depth and running into trouble.
Great comments. I heartily agree and support the statement about probabilistic graphical models. Just to add a couple more facets to this perspective:
'State of the art' does not always mean 'best for your task', and in fact lately depending on your field SOTA sometimes simply means 'unaffordable' for anyone whose budget is under 1 million dollars.
Try linear methods first.
Ensembles of decent models are usually good models. The point above about probability calibration can be at least somewhat mitigated by using ensemble averages.
Don't just assume "the $MODEL will figure it out" if you give it shitloads of degrees of freedom. Machine learning efficiency all comes down to efficiency of representation, and feature engineering can achieve huge payoffs if/when you incorporate domain knowledge and expertise.
Once you gain a perspective into the "universality" of statistical methods, optimization, and Bayesian probability theory, your work will become a lot easier to reason about. As an example, try to see if you can explain why least-squares fit results from the assumption that model residuals are normally distributed (and what connections this may have to statistical physics!).
Thanks for this insight. Can you kindly also suggest a good book for someone to start with Bayesian Statistics? I could really use a suggestion about first and second book on this.
About Probabilistic Graphical Models, is there book other than Daphne Koller's book that you would suggest?
I think PGM's are covered by a lot of "standard" ML texts -- someone else mentioned Murphy's book which is great and is humongous but is a good reference for pretty much every method under the sun.
"Machine Learning: a Probabilistic Perspective" is more an encyclopedia of algorithms I would say, and it has lots of typos. I personally would not recommend it (except for the amount of algorithms that it covers, many of which are usually not found in other books).
Are those really the best starts for "Bayesian statistics"?
Especially the first 2 are rather the standard "intro to ML textbooks", with a frequentist focus (ISL may even have zero Bayesian stuff - Naive Bayes is not "Bayesian" – while ESL still has maybe 10% bayesian content if that).
You make a good point. It's been a while since I flipped through them, they just come up in lots of discussions on this topic. I agree that the series you link to is really great for PPL and Bayesian methods. You may find that the library upon which it's based (PyMC3) is built on top of Theano, which has been abandoned and deprecated. PyMC4 is around the corner and uses TensorFlow Probability. Early, informal reports say it's 10x faster.
The former is a much recommended book since it's very comprehensive and builds everything from the ground up and was the basis for the entire course. The latter is a beast of it's own and we simply covered what was effectively the first chapter as part of the course.
A common type of example involves relatively small or uninformative datasets. Say you flip a coin a few times and only get heads. Your maximum likelihood (frequentist) estimate is that the coin will always land heads. In a Bayesian setting, if you have a (say uniform) prior on the probability that the coin lands heads, your maximum a posteriori estimate of this probability will be non-zero, but will get continue to get smaller if you continue only seeing heads.
The above example is contrived, but makes more sense in the case of language modelling. Since a bag-of-words vector, containing say counts of words seen in a document, is typically sparse (most documents only contain a limited portion of the full vocabulary), a frequentist estimate of word probability will say that certain words can never occur, just because it's never seen them. The Bayesian estimate will still assign some non-zero chance of seeing that word.
Practically speaking, this leads to the idea of "smoothing" in tf-idf (text-frequency-inverse-document-frequency) vectors, by adding 1 to document frequencies. You don't need Bayesian statistics to do this, but maybe you never would have thought of it otherwise!
>Say you flip a coin a few times and only get heads. Your maximum likelihood (frequentist) estimate is that the coin will always land heads. In a Bayesian setting, if you have a (say uniform) prior on the probability that the coin lands heads, your maximum a posteriori estimate of this probability will be non-zero, but will get continue to get smaller if you continue only seeing heads.
Not quite. If you have a uniform prior, there will be no difference between MAP and MLE.
>From the vantage point of Bayesian inference, MLE is a special case of maximum a posteriori estimation (MAP) that assumes a uniform prior distribution of the parameters.
Help me see how this requires a _thorough_ study of Bayesian methods.
What you describe seems to be (additive/Laplace) smoothing [1], a fairly basic concept. And I don't see how this is specifically Bayesian.
Then, one common critique of "Bayesian stats work well with small samples" is that with small data (weak signal), you get your prior back, so you haven't learned anything, and the only thing you're left with is your "bias", erm, prior.
I am still keen to learn why one needs a thorough study of Bayesian methods to help in practical machine learning tasks (meaning, on top of a bit of statistical or data science curriculum plus on the job experience).
* understanding where loss functions and regularization terms come from allows you to reason about the right choice for your problem and possibly to extend/tweak them to suit your needs. Are you working directly with probabilities? Then maybe you don’t want to use L2 regularization (Gaussian prior) but a beta prior or something with the right support. Are you modeling a Poisson rate (e.g. how many people buy my product for every dollar I spend on advertising), then use a Poisson likelihood (loss function would be negative log likelihood).
* do you want to have your NN model your uncertainty as well as the mean? How do you incorporate that into the loss function? Hint: loss = (yhat- y)^2/sigma_hat^2 is missing a term but you wouldn’t know that if you don’t come from Bayesian stats.
* the rabbit hole goes as deep as you want. Understanding Bayesian stats removes a lot of the “ad hoc” and intuitive guesswork that goes into ML when you don’t have a solid statistical foundation for what you’re doing.
The most recent example has been supply failure detection in sales timeseries data with intermittent demand. Ended up using approach described in The Longest Run of Heads by Mark F. Schilling, which is outstandingly well written stats paper and a pleasure to read.
Requesting best book(s) on probability estimation: techniques, model accuracy, and strategies in applying them (e.g. markets, marketing, business operations)?
Do you have any learning resources to recommend for Baysian ML? Especially interested in more applied stuff, and ideally temporal and spatial modeling.
Thanks for the advice. Will definitely try to follow that. I was trying to learn basics of statistics and went through most of the intro to statistical learning. Will complete the rest in few days.
I am more of a book person, if you have any other resource for probabilistic graphical models, please share here.
Nice! There are many books that cover this, even the docs for Pyro/other libraries are useful, it just depends on your preference for how material is presented + your background.
Murphy's "Machine Learning: A Probabilistic Perspective" is another behemoth that covers this stuff, but it's really just your preference.
I say "aspire" because (1) depending on your background, it will likely be something that takes awhile to internalize and really understand, and you will probably realize many times over that you thought you understood something that you actually didn't (2) by learning PGM's, you learn a lot of Bayesian statistics as a side effect, hence why even learning a little bit about them is rewarding.
Once you learn a bit, I would use Pyro/other libraries and try to actually build PGM's for toy problems (or non-toy problems too..) because (1) it will force you to admit to yourself that you don't understand something, (2) the documentation for a lot of these libraries is also useful learning material, and (3) you will see once you learn these libraries that it is fairly easy to do something that would be astoundingly complex if you were to try and do it by hand.
You can basically build most standard ML algorithms as a PGM, so e.g. you can try to do logistic regression as a PGM and compare the results to scikit-learn.
Stay away, in my opinion. I spent a year supporting a SVM in a production machine learning application, and it made me wish the ML research community hadn't been so in love with them for so long.
They're the perfect blend of theoretically elegant and practically impractical. Training scales as O(n^3), serialized models are heavyweight, prediction is slow. They're like Gaussian Processes, except warped and without any principled way of choosing the kernel function. Applying them to structured data (mix of categorical & continuous features, missing values) is difficult. The hyperparameters are non-intuitive and tuning them is a black art.
GBMs/Random Forests are a better default choice, and far more performant. Even simpler than that, linear models & generalized linear models are my go-to most of the time. And if you genuinely need the extra predictiveness, deep learning seems like better bang for your buck right now. Fast.ai is a good resource if that's interesting to you.
Linear models are simpler. GBMs are more powerful, more flexible, and faster.
Every ML course I took had 3 weeks of problem sets on VC dimension and convex quadratic optimization in Lagrangian dual-space, while decision tree ensembles were lucky to get a mention. Meanwhile GBMs continue to win almost all the competitions where neural nets don't dominate.
I suspect my professors just preferred the nice theoretical motivation and fancy math.
Svms are, by default, linear models. The decision boundary in the Svm problem is linear and since it’s the max margin we may enjoy nice generalization properties (as you probably know).
You probably also know that decision tree boundaries are non Linear And piecewise. It’s not so straightforward to find splits on continuous features.
Ie If the data is linearly separable then why not. Even using hinge loss with nns is not uncommon.
You probably see gbms winning a lot of competitions compared to svms because a lot of competitions may have a lot of data and non linear decision boundaries. some problems don’t have these characteristics.
Kernel function is simple - Are you in a high dimensional space? If so, choose linear kernel. Else? Choose the most non-linear one you can (usually a guassian or RBF). I suppose quadratic and the other kernals are useful if what your modeling looks like that but in practice that is rare.
Prediction is not that slow with linear SVMs especially not compared to something like K-NN. The main hyperparamaters which matter are the "C" value and maybe class weights if you have recall or precision requirements. The C value is something that should be grid-searched, but you might as well be grid-searching everything that matters on every ML algorithm and in this regard SVMs are fast to iterate over (because the C value is all that matters).
Applying categorical and continuous features is not difficult if you choose to do it in anything more sophisticated than sklearn. Also, pd.get_dummies() exists (though it may lead to that slow prediction you're concerned about)
You're most likely right with GBM or Random Forests - though they can have all sorts of issues with parallelism if you're not on the right kind of system. You talk about linear models but SVMs are usually using linear kernals anyway and are a generalization of linear models (including lasso and ridge regression models).
Agreed -- linear SVMs, especially in text processing applications, is the one area where they are a natural fit. All their attributes complement the domain. Linear SVMs also have desirable performance characteristics.
But at that point, they also have a lot in common with linear models. Those also seem practical in that domain (though I have less experience here, tbh). And performant, when using SGD + feature hashing like e.g. vowpal wabbit.
My beef with non-linear kernels and structured data is a longer discussion, but I find kernel methods for structured data (which is usually high-dimension but low-rank -- lots of shared structure between features, shared structure between missingness of features) to be highly problematic.
> Prediction is not that slow with linear SVMs especially not compared to something like K-NN.
Provided your structural dimensionality is below about 10 (ie. 10 dominant eigenvalues for your features), then KNN can be O(log(N)) for prediction via a well designed Kd-Tree.
KNN is also really simple to understand, and to design features for. It also never really tends to throw up surprises, which for production is the kind of thing you want. Most importantly, the failures tend to 'make sense' to humans, so you stay out of the uncanny valley.
I’d agree on the training time but your serialized model should be small on disk since only the support vectors are needed for inference. At least with my experience that has been true.
ITT: Whether SVMs are still relevant in the deep learning era. Some junior researchers will say neural networks are all you need. Industry folks will talk about how they still use decision trees.
Personally, I'm quite bullish on the resurgence of SVMs as SOTA. What did it for me was Mikhail Belkin's talk at IAS.[1]
I mean NNs are still quite bad at low n tabular data (and they may always be), which is honestly how a lot of real life data is, so there is clearly a need for not a neural network.
I feel like I've seem more tree ensembles in the wild than SVMs, though.
I've been an ML practioner since 2009. I've used every method imaginable or popular, I think. With the exception of non-linear SVMs. Linear SVM => All good, just the hingle loss optimization. Non-linear SVM, a bit of overkill with basis expansion. Just too slow, or too complex a model?
My impression: SVMs are more of theoretical interest than practical interest. Yeah, learn your statistics. Loss functions. Additive models. Neural nets. Linear models. Decision trees, kNNs etc. SVM is more of a special interest, imho.
We can definitely learn a piece from such an experienced practitioner. Thanks for sharing, I think your intuition matches with the other experienced once in the comments.