> Personally I'd advise against both SVM's and Bayesian methods for a beginner. Bayesian statistics is very much the deep end of the pool.
I don’t know, I think it depends on what you mean by Bayesian. I would say understanding loss functions and regularization requires some understanding of Bayesian stats (just knowing that it comes from log p(x|q) + log p(q) and what both of those terms mean).
> Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling
Can you be more specific here? It sounds like you’re talking about a particular problem or class of methods. PGMs/Bayesian methods can mean basically anything from logistic regression to running HMC on some hierarchical model using 10,000 CPU hours. I just mean aspiring to learn PGMs will force you to quickly learn and gain a deeper understanding of and appreciation for Bayesian stats, and then you can build on that for years and years. But it depends on what you’re interested in doing —- there’s a difference between model building and inference; you can spend your whole life using the same loss function and just focus on making your NN architecture better, you don’t need much Bayesian stats to do that.
> i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method
Huh? Maybe we mean different things by Bayesian — the mode of your posterior seems pretty Bayesian to me!
> Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
Would agree that optimization is an important part of ML/DS, but since nowadays virtually all of the most popular optimization algorithms are available at our fingertips in e.g. pytorch, I would still think its better to start with trying to build a fundamental understanding of how to frame problems. But that’s colored by my own experience and background, people’s priorities should be different depending on what they want to do.
To elaborate on the point: When doing probabilistic modeling, whether one realizes or not, there typically is an underlying Bayesian formulation which explains what one is doing. Now, that might be well-aligned with the problem of interest (or not), and being clear on the fundamentals helps understand that, and also to compose distinct ideas which make sense together in the context of the problem. eg: see my comments below, in the context of linear regression from a Bayesian perspective.
Also, while "scaling" with data is a very hip thing these days, for most problems of interest it is very difficult/expensive to get lots of data (or afford compute). Further, humans often have very useful domain-models which are worth encoding into the structure of the model. This also helps nicely mix together a conventional "software" modeling with probabilistic aspects (for those who weren't aware, this flows towards what is called "probabilistic programming", and recent developments have made significant progress towards methods which work for an "intermediate" dimensionality, if not "large" dimensionality).
--
@astrophysician: Feels nice to see expressed so clearly, a perspective that I share! Feel free to get in touch if you'd like to discuss ML.
Additionally, you should probably learn a (little) R (which you can get from this book). This is not because R is a wonderful language (though I'm pretty fond of it myself) but because it's a great tool for the communication and expression of statistical methods.
Most good stats (which will help you be actually good at ML) books tend to either be written in mathematics, or R (or both). Given that you're already a programmer, R will probably make it easier for you to learn a bunch of this stuff (and the docs for R functions tend to point towards useful literature).
I actually travelled the other way (i.e. from stats to code) and I found R very very helpful. Of course, your mileage may vary, but the link above is probably the best single book that you could read to start learning ML.
Thank you! Providence wills that I have R studio installed to make a wordcloud, but I installed it without actually knowing what it is. (Just followed a tutorial to get my wordcloud :)
So thanks for the recommendation, looking forward reading it!
By Bayesian methods I mean methods which also describe everything with statistical distributions including hyperparameters, and which solve for a distribution (or some hard-to-get feature of it like its mean or confidence intervals), instead of just a maximum probability estimate where you can make shortcuts.
For example Ridge regression where you put a gaussian prior on the predictor then guess at its variance as a hyperparameter for a quadratic penalty (with cross validation or something) is MAP but not actually "Bayesian". Bayesians also put a prior distribution on the hyperparameter itself making life extra difficult.
I agree it's an awkward definition but it does differentiate between the easy tricks and the tough approaches Bayesian researchers actually work on.
Even in deep learning it's simple to put a penalty term on the loss or weights for that matter, so you could call it a MAP estimate. That's implemented everywhere. For that matter you can claim any neural net classifier with a sigmoid or softmax at the final layer is doing logistic regression. With multilayer networks just adding a representation learning stage to it.
As I said cheers to statistics, just that stopping at MAP estimates will cover most everything. If one goes looking for a Bayesian methods text or review article it will be well beyond that.
By scaling problems I basically the fact that monte carlo doesn't work in large numbers of dimensions, where the parameters of the joint distribution of everything are vast.
> By Bayesian methods I mean methods which also describe everything with statistical distributions including hyperparameters, and which solve for a distribution (or some hard-to-get feature of it like its mean or confidence intervals), instead of just a maximum probability estimate where you can make shortcuts.
Ah I see yea, if you’re talking about hyperpriors and marginalization and MCMC I totally agree; these are really valuable in science, and this is the “full Bayesian solution” but with many caveats, one big one being that it’s very easy to be overly confident in the results of some unwieldy hierarchical model and ignore the fact that the priors (or hyperpriors) are often times fudge factors because it’s so damn hard to articulate your prior belief as a bona fide distribution. A lot of times you run into a “garbage in garbage out” problem and it might not be obvious right away since we are probably busy patting ourselves on the back and popping champagne bottles because we’ve gotten our MCMC chains to converge.
> For example Ridge regression where you put a gaussian prior on the predictor then guess at its variance as a hyperparameter for a quadratic penalty (with cross validation or something) is MAP but not actually "Bayesian". Bayesians also put a prior distribution on the hyperparameter itself making life extra difficult.
Yea right — nothing wrong with cross validation in a lot of cases. Since yea we could go with the “full Bayesian solution” and marginalize over some distribution of our hyperprior, but that very well might not give you anything more than cross validation (or maybe will give you something worse if you have a bad hyperprior).
> Even in deep learning it's simple to put a penalty term on the loss or weights for that matter, so you could call it a MAP estimate. That's implemented everywhere.
Agreed, my only argument is that it’s helpful to know where all of these terms come from in a general sense rather than treat each one as some ad hoc solution.
> For that matter you can claim any neural net classifier with a sigmoid or softmax at the final layer is doing logistic regression. With multilayer networks just adding a representation learning stage to it.
Agree and that’s exactly the sort of valuable insight that’s helpful to have (though maybe this one isn’t really an insight from Bayesian stats).
> As I said cheers to statistics, just that stopping at MAP estimates will cover most everything. If one goes looking for a Bayesian methods text or review article it will be well beyond that.
Sure, if you know MAP and what it means, then you know what a posterior means and you know that every model you fit and loss function you use comes from that, and you know if you care about more than a mode or mean of the posterior you can use lots of tools to deal with that. I’ve never really encountered MCMC in the work I do and i don’t think it would really bring any additional insight into what we’re doing, but sometimes we care about uncertainty and Laplace approximation or variational inference can do the job just fine.
> By scaling problems I basically the fact that monte carlo doesn't work in large numbers of dimensions, where the parameters of the joint distribution of everything are vast.
While I would agree MCMC and it’s variants are certainly not a good idea a lot of the time because they require a lot of time and computation, very rarely they are (you want insights from your data to use internally or something and you really want to explore implications of many different assumptions), and HMC etcetera can deal with very high dimensional settings (> millions of parameters). But generally I agree, if you’re talking about “scalable solutions” it usually means MCMC is not really what you want to be doing.
How does understanding loss functions and regularization require understanding Bayesian statistics? Those notions are literally part of linear regression theory.
Only if you use them as a black box. What is considered l2 norm in regression theory corresponds to a Gaussian likelihood, whose log-likelihood becomes quadratic. And one typically might not have any priors (which is subtly misleading, because the "flat" prior is highly dependent on the chosen representation for the underlying degree of freedom). Note that regularizations are just log-priors.
Without an understanding of the underlying Bayesian formulation, linear regression theory might look like a vast and somewhat ad-hoc collection of separate ideas (eg: robust loss functions, and the many different kinds of regularizations), but seen in the correct light, it is easy to start with a general formulation and specialize it nicely to your problem. Working that way, you can often design a good solution for your problem without searching through handbooks for possible pre-defined methods. You can also combine multiple ideas very easily. eg: A couple of weeks ago, working from first principles I rediscovered what is called "Lavrentyev_regularization": https://en.wikipedia.org/wiki/Tikhonov_regularization#Lavren...
Regularization is used by mathematicians and engineers with a different theoretical perspective. A regular solution is one that is smooth or well-behaved in some sense. The motivation to choose the penalty is based on desired properties of the solution. L2 regularization leads to the least-length solution, L1 is chosen if one wants the sparsest solution, and so on. It is a bit less of a deep motivation, but then the decision of which model to choose is always somewhat ad hoc.
A lot of tools in ML might be used in other places with other interpretations, and the intuition for L1 and L2 that you describe is not wrong at all, but (1) ML/DS is usually done in a statistical context so I would argue that it’s a good idea to understand that formalism, and (2) that intuition doesn’t sound like it would help you build more complex statistical models, whereas understanding where L1/L2 come from in a Bayesian context definitely would help you understand what you would need to do to form a regularization term for e.g. a probability, or how to edit your loss to learn uncertainty. It also helps you understand what not to do and why not.
All of this is opinion for the most part, and if you feel there is more to learn from alternative interpretations, fine, but the suggestion is to understand the fundamentals of what you’re doing, and you’re usually doing statistics in ML/DS (whether you know it or not). Also, understanding Bayesian stats will make your life easier and it will make it easier to understand lots of other ML concepts in a unified way rather than in an ad hoc way: “minimum length solution” or “sparse solution” is what I mean by ad hoc. Both of those things are true and important, but they’re ad hoc.
I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.
I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.
And one can do statistics without being a Bayesian of course.
> I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.
This is the number one misunderstanding when it comes to Bayesian stats. Priors are hard, priors are often bullshit, priors are often the source of a “garbage in garbage out” problem, absolutely. I don’t mean to suggest Bayesian stats as something magical (magical thinking will get you in trouble). But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors. That’s just not true: priors are an unavoidable fact of life. If you’re not explicit about your prior, it means you’re still using one but not being upfront about it. So I would agree that priors are difficult and problematic, but they are also unavoidable, and I would not say they’re “ad hoc”. I would also say it’s important to understand what they are.
> I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.
I don’t know what you mean by “deterministic”...do you mean “frequentist”? If so I would disagree completely. Frequentist and Bayesian views are equivalent except for philosophy, and frequentist stats are taught at all levels of school until grad school (at least in my experience) and I think that’s a huge mistake. What do you mean by “nice to have not a necessity”? If you’re solving a statistical question, stats are a necessity. Other fields are the nice-to-have intuition. I would agree however that sometimes you’re solving a NON-stats problem in which case have at it with whatever field makes sense.
> And one can do statistics without being a Bayesian of course.
Again, fine, I agree you can use the same math with a different philosophy, the philosophy is up to you, but if you think somehow you can do inference without priors I’m sorry but that’s wrong. In my experience “Frequentist” usually has meant Bayesian but with a flat prior (please comment if you have a counter example).
In summary: study what you want, and lots of perspectives bring more understanding, absolutely. But I stand by the importance of understanding Bayesian stats for doing ML. Even if you don’t like Bayesian stats, it’s still important to understand what is going on. Also I should be clear by “Bayesian” I mean nothing more than understanding what posteriors, priors, and likelihoods are, not a hierarchical model with MCMC or something.
> But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors.
I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.
As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting".
As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".
As for statistics with out being a Bayesian, I did mean frequentist, though that may not cover everyone. You can even use a prior distribution that is estimated from data (people commonly do that with the naive Bayes method), whatever you want to call such a person. I wouldn't call them a Bayesian. You can simply view it as applying the chain rule of probability to get a more convenient form of your maximum likelihood equation.
> I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.
Oh totally, I agree with that.
> As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting".
As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".
Yea I agree with this too, at least in principle. No issue with solving a lot of problems from a non statistical perspective since many times statistics is not the clear “right” choice. E.g. understanding that L1 regularization corresponds to a “Laplace prior” doesn’t give you that much deeper of an understanding of what you’re doing, since most people use L1 regularization to encourage sparsity. Also, if you’re more comfortable with a non-stats perspective on things, no problem approaching problems in the way you prefer.
Summary: I agree with everything you’ve said here. All that’s left is I think a difference of opinion about how important it is to understand the Bayesian perspective and I think that likely comes down to (1) the types of problems you typically work on, and (2) personal preference. I find personally that understanding the Bayesian interpretation is extremely helpful for building a deeper understanding of a wide variety of ML algorithms but I totally concede this is not necessarily a hard truth. So I stand by my advice, but will definitely agree that there are alternatives. I took the route of understanding ML without Bayesian stats first — really didn’t understand or know Bayesian stuff for a decent amount of time after I got into ML. I’ve found the Bayesian perspective has helped tremendously but that’s just me.
I get it, understanding the Bayesian perspective on linear regression requires understanding Bayesian statistics. But linear regression is not solely a Bayesian technique.
You’re right, but in ML/DS (and I would argue, even in other contexts you might not consider ML/DS) you usually are doing things in a Bayesian context (“what can I say about my model parameters given my observations and assumptions”). If you’re only doing OLS with maybe a L1 or L2 regularization, it’s not critical you understand the Bayesian interpretation, but it helps, and as soon as you start venturing out into other ideas and messier/biased data and want to modify something, you’ll need a Bayesian understanding of what you’re doing.
I don’t know, I think it depends on what you mean by Bayesian. I would say understanding loss functions and regularization requires some understanding of Bayesian stats (just knowing that it comes from log p(x|q) + log p(q) and what both of those terms mean).
> Graphical models and Bayesian methods generally may make a comeback but such approaches have been superseded by other methods for good reasons, i.e. scaling
Can you be more specific here? It sounds like you’re talking about a particular problem or class of methods. PGMs/Bayesian methods can mean basically anything from logistic regression to running HMC on some hierarchical model using 10,000 CPU hours. I just mean aspiring to learn PGMs will force you to quickly learn and gain a deeper understanding of and appreciation for Bayesian stats, and then you can build on that for years and years. But it depends on what you’re interested in doing —- there’s a difference between model building and inference; you can spend your whole life using the same loss function and just focus on making your NN architecture better, you don’t need much Bayesian stats to do that.
> i.e. "MAP" estimation which is more of a hack to ML than an actual Bayesian method
Huh? Maybe we mean different things by Bayesian — the mode of your posterior seems pretty Bayesian to me!
> Meanwhile a strong basis in "deterministic methods", as an alternative way to spend that learning effort, has its own rewards. The training algorithms for deep learning are also the hottest algorithm research area in machine learning, and are certainly applicable beyond deep learning. For that matter a thorough understanding of SVM delves into convex optimization, an extremely powerful framework as well.
Would agree that optimization is an important part of ML/DS, but since nowadays virtually all of the most popular optimization algorithms are available at our fingertips in e.g. pytorch, I would still think its better to start with trying to build a fundamental understanding of how to frame problems. But that’s colored by my own experience and background, people’s priorities should be different depending on what they want to do.