What kind of problems does Stan / Bayesian inference beat the much more hyped Te...

credit_guy · on Dec 23, 2020

Both Bayesian inference and deep learning can do function fitting, i.e. given a number of observations y and explanatory variables x, you try to find a function so that y ~ f(x). The function f can have few parameters (e.g. f(x)= ax+b for linear regression) or millions of parameters (the usual case for deep learning). You can try to find the best value for each of these parameters, or admit that each parameter has some uncertainty and try to infer a distribution for it. The first approach uses optimization, and in the last decade, that's done via various flavors of gradient descent. The second uses Monte Carlo. When you have few parameters, gradient descent is smoking fast. Above a number of parameters (which is surprisingly small, let's say about 100), gradient descent fails to converge to the optimum, but in many cases gets to a place that is "good enough". Good enough to make the practical applications useful. In pretty much all cases though, Bayesian inference via MCMC is painfully slow compared to gradient descent.

But there is a case where it makes sense: when you have reasonably few parameters, and you can understand their meaning. And this is exactly the case of what's called "statistical models". That's why STAN is called a statistical modeling language.

How is that? Gradient descent for these small'ish models is just MLE (maximum likelihood estimation). People have been doing MLE for 100 years, and they understand the ins and outs of MLE. There are some models that are simply unsuited for MLE; their likelihood function is called "singular"; there are places where the likelihood becomes infinite despite the fit being quite poor. One way to fix that is to "regularize" the problem, i.e. to add some artificial penalty that does not allow the reward function to become infinite. But this regularization is often subjective. You never know when the penalty you add is small enough to not alter the final fit. Another way is to do Bayesian inference . It's very slow, but you don't get pulled towards the singular parameters.

scottfr · on Dec 23, 2020

Stan: Predict the values of parameters in a model

Deep Learning: Predict an outcome variable

For example, if I want to know what effect household income has on a student's chance of getting into college, Stan would allow you to estimate that given a proposed model.

If instead I wanted to predict a given student's chance of getting into college, I might use Machine Learning.

Of course, those two problems are linked, but it's a fundamental difference of focus.

nightski · on Dec 23, 2020

While it is true that Bayesian inference is very powerful in that it allows one to introspect and view effects of the model's parameters on the outcome, it is equally as good at predicting the outcome variable as well. It just depends on what you want to get out of it. In fact you get more information about your outcome variable from Bayesian Inference as it is a distribution.

I'm not saying it is better than DL by any means, as DL can scale much better. Just that I don't think it's necessary to pigeonhole Bayesian inference to just predicting the parameters. In my opinion the "fundamental difference of focus" is just a personal decision, not something inherent to the method.

borroka · on Dec 23, 2020

The focus of statistical models (including Bayesian models) is on inference and uncertainty (both for parameter values and for predictions), the focus of ML models (including DL models) is on prediction and it is rarely possible to obtain any quantification of uncertainty.

jgalt212 · on Dec 23, 2020

> rarely possible to obtain any quantification of uncertainty.

Can't this be estimated via bootstrapping?

borroka · on Dec 24, 2020

It is very challenging for complex models to know what's the coverage (and also could be extremely computationally intensive).

peteradio · on Dec 23, 2020

I guess Bayesian will tend to be underfit while DL may tend to overfit.

darthdeus · on Dec 23, 2020

Stan gives you the ability to do probabilistic reasoning. There is actually Tensorflow Probability (https://www.tensorflow.org/probability) which has a lot of overlapping algorithms, but isn't as mature and approaches some things differently.

The main difference is that with Stan you think in terms of random variables and distributions (and their transformations), while with Tensorflow/DL you think in terms of predicting directly from data. Stan lets model a problem with probabilities and do arbitrary inference, generally asking any question you want about your model.

There are many other interesting alternatives, e.g. http://pyro.ai/ which takes a yet another approach merging DL and probabilistic programming with variational inference. (Stan and TFP can do variational inference too, but I guess it's like Python vs JavaScript vs Ruby vs Java - all of them can be used for programming, but not the same way).

usgroup · on Dec 23, 2020

The next cut of Stan will likely use TFP as a backend. I think that PyMC4 will also. The Stan team wrote everything from scratch in C++ including their own autodiff code which many regard as quite a stretch in terms of long term maintenance. Since TFP executes on top of Tensorflow things like autodiff and many of the other performance concerns that take up so much Stan-dev time are already taken care of.

abhgh · on Dec 23, 2020

PyMC4 on TFP was the plan, but they made a recent announcement [1] indicating those efforts would stop, and instead, they would develop PyMC3+JAX+Theano.

[1] https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-i...

diab0lic · on Dec 23, 2020

Woah. Thanks for the link, as a PyMC3 user I was not looking forward to the transition to 4 expecting to have to relearn the API like the transition from 2 to 3. I was debating wether I should learn 4 or switch to a different library when all I really wanted to do was stick with 3.

Looks like I get the best of both worlds now.

lhomdee · on Dec 23, 2020

Please no, we don’t need Stan to be rebuilt with a Python backend. That it’s built in C++ and can be called with higher level API’s is part of the appeal.

tel · on Dec 23, 2020

Bayesian modeling has a somewhat distinct feeling to both (typical) deep learning algorithms and boosting/bagging classifiers.

Most particularly, Bayesian modeling tends to be generative modeling as opposed to discriminative. This means that you construct your model by describing a process which generates your observed data from a set of latent/unknown quantities.

For instance, we might observe that n[u, d] clicks are observed on user u on day d for various choices of u and d. We could build a variety of generative stories here: that n[u, d] is independent of u and d, just being a random draw from a Normal(mu, sigma) distribution; that n[u, d] incorporates another unknown parameter p[u], the user's propensity to click, and then is a random draw from Normal(mu + b p[u], sigma); or that we also include season trends sm[d] and ss[d] to both the mean and spread of n[u, d], saying it's Normal(mu + b p[u] + sm[d], sigma * ss[d]).

In these examples, the unknown latents are parameters like mu, sigma, and b as well as any latent data needed to give shape to p[-], sm[-], and ss[-]. Once we've posited the structure of this generative model, we'd like to infer what values those latents might take as informed by the data.

This is the bread and butter of Stan modeling. It lets you describe these generative models as a "forward" process where we sample latents in a simple forward program. Similar to Tensorflow/etc Stan extracts from this forward program a DAG and computes derivatives, but instead of simply maximizing an objective function through backdrop, Stan uses these derivatives to perform a sampling algorithm over the latents (mu, sigma, b).

Ultimately, this gives you a distribution of plausible latent configurations given the data you've observed. This distribution is a key point of Bayesian modeling and can provide a lot of information beyond what the objective-maximizing value would. As a simple example, it's trivial from a Bayesian output distribution to make statements like "we're 95% confident that mu > 0.1".

usgroup · on Dec 23, 2020

Stan is exceptional if what you need is a hierarchical Bayesian model, and if what you want is rigorous way of quantifying the uncertainty associated in the parameter selections in your model.

Stan users are more often R users than Python user and mostly come from science backgrounds. They often use Stan via a package called BRMS which stands for "Bayesian Regression Models using Stan" which should give you some idea of its core use case.

You wouldn't use Stan if you weren't trying to model your problem as a distribution based probabilistic model.

nabla9 · on Dec 23, 2020

1) You have too little data for Deep Learning

2) You want to do statistical modelling, not a black box. You already have a statistical model in mind, you just want to fit parameters.

Stan is probabilistic programming system. You describe the data-producing mechanism (the model of reality), and the level and form of approximations used in the estimation. The compiler generates code for the estimators.

celrod · on Dec 23, 2020

It's used a lot for things like analyzing clinical trials, e.g making futility or early stopping calls in interims, or for meta analysis. JAGS may still be the most popular, at least in some companies, but Stan is starting to catch on thanks to its greater flexibility in most respects.

lhomdee · on Dec 23, 2020

Other comments point out to Bayesian inference being good for modelling an uncertain outcome, while deep learning is good for prediction.

However Bayesian inference is a good choice for prediction when you have few data points (deep learning is sample-size hungry). And it is especially good when you have high uncertainty in your labelled training data (ie large variance in the response variable for given input). Here a Bayesian regression (or even classification) model wouldn’t magically remove the uncertainty but rather you’d be able to account for the predictive variance (instead of being none-the-wiser using just good ole deep learning). You can then take it from there how you wish to treat the predictions, given the predictive variance as well.

borroka · on Dec 23, 2020

The choice is not between Bayesian methods and Deep Learning, but between statistical models and machine learning models (say, from random forest to GBM to xgboost and then maybe Deep Learning). There is overlap between statistical models and machine learning models—it is a matter sometimes of focus—and Bayesian methods can also be applied to what are typically considered ML approaches (see for example Bayesian hierarchical random forest).

lhomdee · on Dec 24, 2020

But are machine learning models not statistical models? There is sample data which is statistics, and the objective function is also statistics, eg mean square error or negative log-likelihood or ELBO. And if you’re using stochastic gradient descent or a form of it, then that has statistical properties.

I don’t see any clear distinction between machine learning and statistics. Machine learning is a type of statistical model which relies on iterative optimization.

Bayesian inference on the other hand is a specific type of statistical model where the aim is to model distributions, not just the output variable (which is what ML is traditionally focused on).

And yes there is overlap, you can take a Bayesian approach to machine learning and that can make total sense sometimes.

borroka · on Dec 25, 2020

"Bayesian inference on the other hand is a specific type of statistical model where the aim is to model distributions, not just the output variable (which is what ML is traditionally focused on)." - What distribution are you referring to? One of the advantages of the Bayesian approach (in the context of models and not of probability, it is not a model, it is a way of estimating the values of the parameters of a model) is that it provides a proper statistical distribution—and not a distribution based on theoretical formulas that require certain assumption to be true to have certain properties—of parameters and model predictions.

You can read more at https://www.fharrell.com/post/stat-ml/ (Frank Harrell is a top statistician who was once a frequentist and now is a bayesian. He writes also on the differences between ML and SM and how to choose between the two)

kj98uo · on Dec 23, 2020

I am still learning about Bayesian inference so this might be off-base but isn't the point to compute the full posterior distribution (or an approximation thereof) of the underlying parameters. Whether this is done in the context of a linear model or a deep neural network is a question of tractability.

The other distinction is between discriminative and generative models. In a discriminative model, the output/label is being predicted based on the input features: p(y|x, theta). For example, the probability of an image containing a dog, y based on pixels, x. Theta here refers to the parameters one needs to discover.

In a generative model, one instead models the distribution p(x|y, beta) i.e. given the label, say dog, predicting the joint distribution of all the images.

Neural networks with backproagation can be used for both discriminative and generative models. Bayesian methods can be applied to both discriminative and generative models to compute the full posterior distribution of the parameters, theta and beta.

Edit for clarity: The claim is that the choice of the model vs the choice of inferential methodology (Bayesian vs max likelihood for example) are orthogonal choices.

A neural network doing (discriminative) binary classification based on cross-entropy is maximizing likelihood instead of maximizing the posterior. Most Bayesian examples seem to specify a generative model (a Hidden Markov Model for example) and then infer the posterior. But there's nothing preventing one from using Bayesian methods with discriminative models (generalized linear models) or max likelihood with generative models.

gbrown · on Dec 23, 2020

This question would be super bizarre to anyone coming from a stats background.

Others have commented on the role of inference/estimation, and prediction in small data or non-black-box contexts, so I’ll just add that there are deep theoretical reasons to do Bayesian inference. It’s a framework grounded firmly in decision theory, and provides a coherent way to reason about the world. You can prove, under sensible axioms, that beliefs can be described in terms of probability distributions, and that we should update beliefs based on Bayes’ Rule.

abeppu · on Dec 23, 2020

I like many of the answers to your question. But a refinement of your question is when do we really have to choose between Bayesian inference and deep learning? Under what conditions should one pick Stan over Edward or Pyro?

eggie5 · on Dec 23, 2020

Managing uncertainty with Distributions instead of point estimates

ogogmad · on Dec 23, 2020

I asked essentially the same question as you 5 minutes before you did. Have an upvote anyway.

[Edit] I don't understand these downvotes.

ogogmad · on Dec 23, 2020

Anonymous passive aggressive downvoting cowards go to hell.