Regularization is used by mathematicians and engineers with a different theoreti...

astrophysician · on May 2, 2020

A lot of tools in ML might be used in other places with other interpretations, and the intuition for L1 and L2 that you describe is not wrong at all, but (1) ML/DS is usually done in a statistical context so I would argue that it’s a good idea to understand that formalism, and (2) that intuition doesn’t sound like it would help you build more complex statistical models, whereas understanding where L1/L2 come from in a Bayesian context definitely would help you understand what you would need to do to form a regularization term for e.g. a probability, or how to edit your loss to learn uncertainty. It also helps you understand what not to do and why not.

All of this is opinion for the most part, and if you feel there is more to learn from alternative interpretations, fine, but the suggestion is to understand the fundamentals of what you’re doing, and you’re usually doing statistics in ML/DS (whether you know it or not). Also, understanding Bayesian stats will make your life easier and it will make it easier to understand lots of other ML concepts in a unified way rather than in an ad hoc way: “minimum length solution” or “sparse solution” is what I mean by ad hoc. Both of those things are true and important, but they’re ad hoc.

unishark · on May 3, 2020

I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.

I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.

And one can do statistics without being a Bayesian of course.

astrophysician · on May 3, 2020

> I'd say pulling a "prior belief" out of the air for your problem, especially if it is conveniently chosen because it is one that is easy to work with, fits your rather broad definition of ad hoc too.

This is the number one misunderstanding when it comes to Bayesian stats. Priors are hard, priors are often bullshit, priors are often the source of a “garbage in garbage out” problem, absolutely. I don’t mean to suggest Bayesian stats as something magical (magical thinking will get you in trouble). But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors. That’s just not true: priors are an unavoidable fact of life. If you’re not explicit about your prior, it means you’re still using one but not being upfront about it. So I would agree that priors are difficult and problematic, but they are also unavoidable, and I would not say they’re “ad hoc”. I would also say it’s important to understand what they are.

> I'd even say the deterministic view is dominant currently. So yes by thinking differently one can get intuition beyond the common knowledge. But it's a nice-to-have not a necessity.

I don’t know what you mean by “deterministic”...do you mean “frequentist”? If so I would disagree completely. Frequentist and Bayesian views are equivalent except for philosophy, and frequentist stats are taught at all levels of school until grad school (at least in my experience) and I think that’s a huge mistake. What do you mean by “nice to have not a necessity”? If you’re solving a statistical question, stats are a necessity. Other fields are the nice-to-have intuition. I would agree however that sometimes you’re solving a NON-stats problem in which case have at it with whatever field makes sense.

> And one can do statistics without being a Bayesian of course.

Again, fine, I agree you can use the same math with a different philosophy, the philosophy is up to you, but if you think somehow you can do inference without priors I’m sorry but that’s wrong. In my experience “Frequentist” usually has meant Bayesian but with a flat prior (please comment if you have a counter example).

In summary: study what you want, and lots of perspectives bring more understanding, absolutely. But I stand by the importance of understanding Bayesian stats for doing ML. Even if you don’t like Bayesian stats, it’s still important to understand what is going on. Also I should be clear by “Bayesian” I mean nothing more than understanding what posteriors, priors, and likelihoods are, not a hierarchical model with MCMC or something.

unishark · on May 4, 2020

> But whenever a statement like this is made, the implication is that there is some alternative where we can solve the same problem but without priors.

I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.

As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting".

As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".

As for statistics with out being a Bayesian, I did mean frequentist, though that may not cover everyone. You can even use a prior distribution that is estimated from data (people commonly do that with the naive Bayes method), whatever you want to call such a person. I wouldn't call them a Bayesian. You can simply view it as applying the chain rule of probability to get a more convenient form of your maximum likelihood equation.

astrophysician · on May 4, 2020

> I don't see the misunderstanding you mean. I said if you think a selection of the least-length or sparsest solution is ad hoc, then so is your choice of prior. Solving the same system without priors would be analogous (actually mathematical equivalent) to solving an inverse problem without regularization. Or failing to solve it in the ill-posed case.

Oh totally, I agree with that.

> As for deterministic, I mean not probabilistic. As in linear algebra and "curve fitting". As for "nice-to-have", I mean you can do machine learning without having any of the statistical understanding we've talked about and instead making various choices simply "because they work".

Yea I agree with this too, at least in principle. No issue with solving a lot of problems from a non statistical perspective since many times statistics is not the clear “right” choice. E.g. understanding that L1 regularization corresponds to a “Laplace prior” doesn’t give you that much deeper of an understanding of what you’re doing, since most people use L1 regularization to encourage sparsity. Also, if you’re more comfortable with a non-stats perspective on things, no problem approaching problems in the way you prefer.

Summary: I agree with everything you’ve said here. All that’s left is I think a difference of opinion about how important it is to understand the Bayesian perspective and I think that likely comes down to (1) the types of problems you typically work on, and (2) personal preference. I find personally that understanding the Bayesian interpretation is extremely helpful for building a deeper understanding of a wide variety of ML algorithms but I totally concede this is not necessarily a hard truth. So I stand by my advice, but will definitely agree that there are alternatives. I took the route of understanding ML without Bayesian stats first — really didn’t understand or know Bayesian stuff for a decent amount of time after I got into ML. I’ve found the Bayesian perspective has helped tremendously but that’s just me.