Ideas in statistics that have powered AI (columbia.edu)
204 points by MAXPOOL on July 7, 2021 | hide | past | favorite | 78 comments

> Generative adversarial networks, or GANs, are a conceptual advance that allow reinforcement learning problems to be solved automatically. They mark a step toward the longstanding goal of artificial general intelligence while also harnessing the power of parallel processing so that a program can train itself by playing millions of games against itself. At a conceptual level, GANs link prediction with generative models.

What? Every sentence here is so wrong I have a hard time seeing what kind of misunderstanding would lead to this.

GAN's are a conceptual advance of generative models (i.e. models that can generate more, similar data). Reinforcement learning is a separate field. Parallel processing is ubiquitous, and has nothing to do with GANs or reinforcement learning (they are both usually pretty parallellized). Self-play sounds like they wanted to talk about the alphago/alphazero papers? And GANs are infamously not really predictive/discriminative. If anything, they thoroughly disconnected predicition from generative models.

You're right that this is spectacularly wrong.

I dare not even read the rest of the page just in case my brain accidentally absorbs other bad information like that paragraph about GANs.

> GAN's are a conceptual advance of generative models (i.e. models that can generate more, similar data).

This is something I've long had confusion with, coming from a probabilistic perspective.

How does a GAN model the joint probability of the data? My understanding was that was what a generative model does. There doesn't seem to be a clear probabilistic interpretation of a GAN whatsoever.

Part of the cleverness of GANs was to have found a way to train a neural network that generates data without explicitly modeling the probability density.

In a stats textbook, when you know that your training data comes from a normal distribution, you can maximize the MLE wrt the parameters, and then use that for sampling. That's basic theory.

In practice, it was very hard to learn a good pdf for experimental data when you had a training set of images. GANs provided a way to bypass this.

Of course, people could have said "hey let's generate samples without maximizing a loglikelihood first", but they didn't know how to do it properly, how to train the network in any other way besides minimizing cross-entropy (which is equivalent to maximizing loglikelihood).

Then GANs actually provided a new loss function that could be trained. Total paradigm shift!

I'm on board with all of this, I think even before GANs it was becoming popular to optimize loss that wasn't necessarily a log likelihood.

But I'm confused by the usage of the phrase generative model, which I took to always mean a probabilistic model of the joint that can be sampled over. I get that GANs generate data samples, but it seems different.

This is the problem when people use technical terms loosely and interchangeably with their English definitions. Generative model classifiers are precisely as you describe. They model a joint distribution that one can sample.

GANs cannot even fit this definition because it is not a classifier. It is composed of a generator and a discriminator. The discriminator is a discriminative classifier. The generator is, well, a generator. It has nothing to do with generative model classifiers. Then you get some variation of neural network generator > model that generates > generative model. This leads to confusion.

I find https://openai.com/blog/generative-models/ pretty good on this. Reading from "More general formulation" we see:

Now, our model also describes a distribution p^θ(x)\hat{p}_{\theta}(x)p^ θ (x) (green) that is defined implicitly by taking points from a unit Gaussian distribution (red) and mapping them through a (deterministic) neural network — our generative model (yellow). Our network is a function with parameters θ\thetaθ, and tweaking these parameters will tweak the generated distribution of images. Our goal then is to find parameters θ\thetaθ that produce a distribution that closely matches the true data distribution (for example, by having a small KL divergence loss). Therefore, you can imagine the green distribution starting out random and then the training process iteratively changing the parameters θ\thetaθ to stretch and squeeze it to better match the blue distribution.

This is precisely a generative model in the probabilistic sense. The section on VAEs spells this out even more explicitly:

For example, Variational Autoencoders allow us to perform both learning and efficient Bayesian inference in sophisticated probabilistic graphical models with latent variables (e.g. see DRAW, or Attend Infer Repeat for hints of recent relatively complex models).

The issue with GANs is that - while they model the joint probability of the input space - they aren't (easily) inspectable in the sense you can't get any understanding of how inputs relate to outputs. This means they appear different to traditional generative models where this is usually a goal.

For people who want a more stats-grounded approach, VAEs are more or less state of the art these days: https://en.wikipedia.org/wiki/Variational_autoencoder

They are reasonably competitive with GANs. I haven't kept up on the latest models on either side, but VAEs have historically tended to be a little blurrier than GANs.

I think VAE's haven't been the state of the art since around 2016-2017? They have been squeezed from both directions, autoregressive models on the compression side, GAN's on the generation side.

They are still fairly competitive on both sides though.

Yeah, I guess I was thinking of VQVAE as a state-of-the-art example, but it was indeed 2017. Time flies! It's still pretty influential on newer systems though, e.g. OpenAI's DALL-E that made waves earlier this year has a VAE component (in addition to a Transformer component).

The generator implicitly models a joint probability of data by being a generative process that one can draw samples from. GAN training (at least under certain simplifying assumptions) minimizes the JS divergence between the generator distribution and the data distribution.

"Generalized adversarial networks, or GANs, are a conceptual advance that allow reinforcement learning problems to be solved automatically." -

"Generalized" :D Also the description is nonsense. This has nothing to do with reinforcement learning. Makes me wonder about the rest.

Now, if a press release from a top univ is so wrong on something that is easily checkable, how accurate are other forms of news?

Think of "press release wrongness" with a probability distribution. Some press releases are really good, some are really bad. A sensible prior would be somewhere in the middle. If you start to see a lot of bad press releases, then you can update your posterior towards "I can't trust any of these."

Is that a good prior? I expect due to Dunning–Kruger that the willingness to produce an article on a topic would follow a pretty intense bimodal distribution.

The paper has it right, at least.

I’m sorely missing Maximum Likelihood Estimation (MLE). It’s a statistical technique that goes back to Gauss and Laplace but was popularized by Fisher. In AI/ML it’s often referred to as “minimizing cross-entropy loss”, but this is just a misappropriation / reinvention of the wheel. The math is the same and MLE is a much more sane theoretical framework.

“Cross entropy” specifically refers to the log-likelihood function of a binary random variable, and is only used as the cost function for binary classifiers. It does not refer to likelihood functions in general.

Do people not google terms before trying to speak authoritatively on a topic they aren't familiar with? The original commenter is correct, cross entropy is a generic measure of two probability distributions - in the case of maximum likelihood estimation, these are the data distribution and the distribution of the learned model.

You are incorrect.

For a given probability distribution parameterized by θ with probability mass/density p(x|θ), the likelihood of θ given a set of data X = {x_1, …, x_n} (assuming X is independently/identically distributed) is simply the product of independent probabilities,

L(θ|X) = Π_i=1^n p(x_i|θ)

Maximizing this product with respect to θ yields the maximum likelihood estimate of θ. Since sums are generally easier to work with than products, and log is a monotonic function, we generally work with the log-likelihood function

log L(θ|X) = Σ_i=1^n log p(x_i|θ)

since the log-likelihood will achieve its maximum for the same value of θ as the likelihood.

The cross entropy of two discrete probability distributions p and q is

Σ_i=1^n p_i log q_i

(For continuous distributions, replace the sum with an integral.)

This is completely unrelated to the generic log-likelihood function defined above. The two are only related if p happens to be the probability distribution of a binary random variable x = {0,1}, with probability π of equalling 1:

p(x|π) = π^x(1-π)^(1-x)

Its log-likelihood is therefore

x log π + (1-x) log(1-π)

which for this particular case, happens to be a cross entropy. Note that this is the log-likelihood of a single observation in a single class; for multiple observations/multiple classes, we sum across them, e.g.

Σ_i=1^n x_i log π_i + (1-x_i)log (1-π_i)

for a single observation across n total classes.

But again, the relationship to cross entropy only holds for this particular choice of p. It is not generally the case that the generic log-likelihood function,

log L(θ|X) = Σ_i=1^n log p(x_i|θ)

is a cross entropy!

You can take the cross entropy between the the probability distribution and the dirac-delta distribution for the actual data. This will equal the log-likelihood.

Things get a little iffy with continuous probability distributions, but that's just because both your cross-entropy and your MLE estimate will depend on your choice of variables if you don't pick a prior. Just as for MLE you can blindly plug in the probability density and it'll work just fine.

True! Given a ~Dirac comb~ mixture of Dirac distributions

c(x) = 1/n Σ_i=1^n δ(х - x_i)

and some function f, you can express the sum of f over x_i as

Σ_i=1^n f(x_i) = ∫_-∞^∞ dx’ f(x)c(x - x’)

If f were a log probability, this would be indeed be a (continuous) cross entropy:

Σ_i log p(х_i|θ) = ∫_-∞^∞ dx’ log p(х_i|θ) c(x - x’)

However, this isn’t generally how we think about likelihood functions, since there is nothing gained from expressing a simple sum of log probability densities in terms of a Dirac comb. Indeed, every ML text/paper I’ve read only ever refers to “cross entropy” in the context of the cost function for one-hot categorical random variables, since the formula for cross entropy is immediately present in the likelihood function. Cost functions involving other random variables are simply called “cost functions” or just “likelihoods” if the author comes from a stats background.

> Given a Dirac comb

> c(x) = 1/n Σ_i=1^n δ(х - x_i)

Sorry for the pedantry, but a mixture of Dirac distributions is almost always not a Dirac comb. Notice that a mixture of Dirac distributions is a Dirac comb only if you have an infinite number of equally-separated samples (and empirical distributions tend to have a finite number of samples).

> your MLE estimate will depend on your choice of variables if you don't pick a prior.

If you're doing MLE, then you don't have a prior (or rather, you have a uniform prior over the parameter(s) of interest).

Yeah I suppose in the context of MLE it makes more sense to talk about your choice of variables. Which does matter unfortunately (which is kind of obvious when you note that you can convert any distribution into any other by chaging its coordinates [*]).

Using a prior gives you an 'out' by picking the Radon-Nikodym derivative w.r.t. that prior, since this definition of probability density is independent of your choice of variables. In most applications of MLE people implicitly use the (improper) uniform prior in which case you end up with the usual density. However this is usually done without justification, which is a bit dangerous.

[*]: For a rather extreme example consider that if X is exponentially distributed with mean 1 then so is -log(1 - e^-X) which you get by using the CDF, flipping the distribution and then using the inverse CDF, this transformation swaps 0 (the mode) with positive infinity.

No. Maximum likelihood estimation does not involve a prior over parameters, and the MLE does not depend on how the parameters are represented, nor does it depend on the way the variables are represented.

For example, the maximum likelihood estimate for a variance parameter is just the square of the maximum likelihood estimate for the corresponding standard deviation parameter, as you'd expect. And the maximum likelihood estimate for the parameters of a log-normal distribution for positive data are related in the obvious way to the maximum likelihood estimate for the parameters of a normal distribution for the logs of those data points.

Doing a 1-to-1 continuous transformation of the data causes the probability density (for any given parameter values) to be multiplied by a Jacobian factor, but this factor depends only on the data values, NOT on the probability density (determined by the parameters), so the MLE for the transformed data is the same as for the untransformed data.

Agreed - great reply.

You've correctly shown that maximizing the likelihood is equivalent to minimizing cross entropy in the discrete case, but frankly that is unrelated to your claim that the equivalency doesn't hold in the general case. As noted in the sibling comment, the generalization to the continuous case is evident when viewing the empirical data distribution as a mixture of dirac densities.

Hm, I never really thought about it this way - but I guess it does generalize to continuous space in a pretty natural way.

What are you missing about it? MLE is the bread and butter of any deep learning architecture - it is how you train the network!

e: Ah I'm a dunce - missing from the article.

> 2. John Tukey (1977). Exploratory Data Analysis.

> This book has been hugely influential and is a fun read that can be digested in one sitting.

Wow. The PDF is over 700 pages. That seems fairly impressive for single-sitting digestion.

Out of the 10 papers I am able to download 3 of them freely.

- For the papers I am quoted 26EUR - 39EUR

- For the books I am quoted 129EUR - 133EUR

This is audacious. Some of these papers are form the 70ies. And I highly doubt that the authors get any royalties from those sales.

Authors never get _any_ royalties from paper sales as far as I know :) (for books maybe).

> (for books maybe).

We do. I don't know if it's the general rule, but for the one I partook in, we get ~20 €cents.sale-1.author-1.

<donates to sci-hub>

I'm not sure what you mean. I went through the list myself, and while the books are obviously only on Libgen, the only one I didn't find readily available in Google Scholar was the AIC paper (https://www.gwern.net/docs/statistics/decision/1998-akaike.p...), and you can safely assume any paper in GS is in SH too.

sci-hub FTW

why would you want to feed the parasites?

I'm surprised they didn't mention support vector machines and the kernel trick which was discovered by statisticians.

Although Vapnik's treatise is called Statistical Learning Theory, neither statisticians nor he himself identifies himself as a statistician. In fact his proposals were quite radically different from the established norm in contemporary statistics. The same holds for Corrina Cortes.

Kernel 'trick', representer theorem etc are far older and have their origins in functional analysis

I highly doubt that a person with a PhD in statistics doesn't identify as a statistician.

Look at Vapnik’s publications and affiliations around this time (1990 to 1997). He was working at a research group at ATT Labs with people like Corinna Cortes, León Bottou, and Yann LeCun. He had just come over from Russia, as did many technically proficient Russian Jews in these years.

None of the people in this ATT Labs milieu were associated with the Statistics community. They didn’t go to Stats conferences, and they didn’t publish in Stats journals. I don’t believe that any of them were formally trained in what you might call conventional statistics approaches of the time, i.e., no Stanford PhD, no JASA or Ann. Stat. publications.

They knew about that stuff! But they were approaching problems, like digit recognition (which are not amenable to model based statistics) from a more applied math/physics point of view. Vapnik’s work that appeared in English at this time introduces learning as a form of Tikhonov regularization applied to a loss function. Not as a type of maximum likelihood, not as a riff on Bayesian inference. And his SVM work introduces the kernel trick as an interpretation of Mercer’s theorem — a very applied math motivation.

I knew Vladimir around this time, but I can’t really say if he would have described himself as a “statistician” — that would be hard for anyone to know. But I can say that he and his closest colleagues were not part of the Stats community of the time.

On the other side, very few Stats authorities were deeply interested in this stuff. Leo Brieman, of course, Andrew Barron, Art Owen, Trevor Hastie, Rob Tibshirani. It’s the Stats community’s loss that so few recognized the value of these approaches.

Beautifully summarized !

The only way to dispute that would be appeal to authority so I will avoid. Perhaps there was a time when he did identify as a mathematical/theoretical statistician, but his contributions were quite a dramatic break from what was the norm in statistical practice at the time.

I would argue that his contributions were central to the birth of rigorous machine learning (the non deep learning kind) as a field of its own with a focus that's different from that of statistics.

One quantitative test I can suggest -- one can count the number of occurrences of the word 'statistics' in the journals and conferences he has published and compare that with the number of publications he has authored. My sense is that it will be close to 0 and getting closer (if not already there yet).

> One quantitative test I can suggest -- one can count the number of occurrences of the word 'statistics' in the journals and conferences he has published and compare that with the number of publications he has authored. My sense is that it will be close to 0 and getting closer (if not already there yet).

I'm not sure this "quantitative test" is the best approach... after all, I'm pretty sure "Biometrica" and "Econometrica" don't have "statistics" in their names.

Fair point.

Some may identify as mathematicians.

Putting aside discovery, I've always considered SVMs to be in the realm of optimization rather than ML or statistics (though I suppose you could then also put modern deep learning under optimization too).

Why? No one uses SVM as a solver/optimization method (though you do need a solver/optimization method to train a SVM).

Same with "modern deep learning" (whatever that may be): just because you need to optimize something doesn't make the field "optimization". Just because I'm using stochastic gradient descent (or some other optimization method) in the course of my work, doesn't mean that I'm working in the field of Optimization.

To really understand it (primal, dual formulations) you need tools from convex optimization. So it doesn't really feel appropriate to teach it in a standard machine learning class (unless you just toss out the details). In optimization classes, you go through tons of different applications of the methods you learn about: SVM slots in perfectly there. It hits duality, quadratic programming, even gradient descent (Pegasos).

Re deep learning: right, I was bringing up deep learning as a clear example of why you might not want to classify it under optimization. No one considers applications of deep learning to be optimization. However, work on the various optimizers (Adam, adagrad, second order methods, etc) which are all fundamental to doing any deep learning work would be firmly in the field of optimization.

> To really understand it (primal, dual formulations) you need tools from convex optimization. So it doesn't really feel appropriate to teach it in a standard machine learning class (unless you just toss out the details).

Sure. But to understand it, you probably also need to know a bit about arithmetic, algebra, geometry, etc. Still, you wouldn't say that SVM belong to these fields, even though these fields are probably a requirement if you want to understand SVMs.

> So it doesn't really feel appropriate to teach it in a standard machine learning class (unless you just toss out the details).

If the people you are talking about already had an optimization class (including convex optimization), then it should be appropriate to teach it using those formalisms, no?

Another example: you're not going far in understanding Schroedinger's equation if you don't have the necessary linear algebra bases. Does that make Schroedinger's equation part of linear algebra?

> It hits duality, quadratic programming, even gradient descent (Pegasos).

Sure... then it's a subject of machine learning that is good to refresh your knowledge of optimization and linear algebra, sure. It still feels kinda weird if you're going to introduce people to SVM in the context of an Optimization class (other than possibly as an example of a specific optimization problem, or as an application of specific optimization methods).

> However, work on the various optimizers (Adam, adagrad, second order methods, etc) which are all fundamental to doing any deep learning work would be firmly in the field of optimization.

Exactly. If you're doing that, then you are doing research in Optimization, and not research in "deep learning", as far as I'm concerned. But, let's face it... those types of papers are a minority in the field.

> … you need tools from convex optimization. So it doesn't feel appropriate to teach it in a standard machine learning class.

Those topics were perquisites to taking ML at the grad level when I took them. You either had to have relevant courses in your bag or convince the prof that you could handle it.

Yea grad level absolutely (pretty much anything can fly at the grad level). Undergrad? Maybe we should teach it b/c of the historical importance it has to the field and how the community developed but I really do think most ML classes would be better off without it b/c of the extra background you'd have to use precious time on. Kernel PCA, kernel regression are better for demonstrating the power of kernels.

I suppose the idea of a maximum separating hyperplane is kind of unique to SVMs and if you just teach SVMs through the primal and leave it at that, you don't need to spend all that much time motivating the dual.

How have they attributed GANs and Deep Learning to Statistics? I thought Goodfellow was doing an AI PhD and that Hinton is a biologically inspired / neuroscience fellow?

See the table in this link for a humorous comparison by Tibshirani:


Deep learning machine learning models are statistical and probabilistic models. You can categorize Deep Learning under both computer science and statistics. For example stat.ML and cs.LG in Arxiv.

Machine learning and statistics are closely related fields, both historically and in current practice and methodology.

My working definition is that statisticians choose and engineer models while machine learning searches a vast space of models.

That doesn't seem right to me. In both cases you have a model and are just searching for optimal parameters considering the bias/variance tradeoff. There may be a few instances of a neural network or other ML model being set up to dynamically change it's architecture during training but that seems to me like it would not work out well at all.

If anything in specific cases the statistical model (if Bayesian) is more comprehensive in that it doesn't try to find a point estimate of the parameters but instead forms a full distribution around the plausibility of the parameters.

So I was thinking like this - if you have a data set with 200000 variables and 70,000 examples how would you find a model without using a search process? On the other hand if you have 20 variables and or if you understand which of the 200000 variables are the ones to worry about then you can build a model by hand. I guess that also statisticians are working to summarise or create insight about data while ML is working to create a prediction (although statistical models can be used to do that too).

Why do you think that statisticians only build models by hand? Dimension reduction techniques are employed in statistical learning as well.

You can have Bayesian DNNs.

Not easily. I don't think the literature is very incredible on this - how do you define a prior over all of the parameters of an NN?

There’s a vast literature you can read.

I have read some of the literature - ie. the bayes by backprop method, etc.

It doesn't seem like getting a posterior over the parameter space of a neural network is tractable as of now.

It ain't tractable in general either.

Another way I've heard it framed is that statistics cares about the parameters, and ML cares about the answers.

That framing seems a wee bit dismissive of statistics, esp. applied stats -- I'll hazard a guess that this came from the ML camp. :)

There is nothing dismissive about it, they are just different things.

If we have strong theoretical understanding of the physics/model of a problem, but are unsure about some parameters, it makes sense to develop methods to find those unknown parameters accurately. This ability is a big deal in traditional statistics. People working in this field try and prove results that their proposed method can actually do this. If the model happens to be largely correct and the method robust, this even allows prediction.

Often, however, we do not know the model and the 'parameter' is a piece of fiction anyway. If we are interested in prediction alone, its fine to let go of an ability to accurately estimate the parameters as long as predictions are accurate. Think epicycle models of planetary motion. ML folks try to prove that their methods have good prediction properties and are happy to sacrifice on parameter recovery.

Nonparametric statistics and prequential statistics are somewhere in the middle.

Sometimes it does seem though that people are trying to out do each other, coming up with methods that estimates the spectrum of a unicorn's rainbow more accurately than the best known result in research literature. This may look odd because the unicorn and his rainbow are pieces of fiction.

Fair points made, and your unicorn analogy is wonderful. :)

As someone with degrees in statistics who now works in ML, all of these simplistic definitions are pithy, but useless. The reality is that the overlap between the two fields is so enormous that trying to come up with a clean separation for the two is a losing battle. To say that statisticians are uninterested in “the answers” (I.e. predictive accuracy) is foolish and rarely true. To say that ML doesn’t care about parameters (I.e. inference) ignores the vast amount of effort in ML researching topics like causal inference, feature importance, and yes, inference.

At this point I’m convinced the debate regarding what is statistics vs what is ML is largely just tribal warfare. The two parties are fighting over shared territory, and while both may have had some claim to uniquely identifying techniques and research, the lines have long since eroded.

ML cares about predictive generalization, stats cares about understanding and interpreting drivers

The only easy real division between stats and ML is in universities, where it's just a question of which department. If it's CS, then it's ML or AI. If it's stats, it's stats.

If it's industry, it's whatever the marketing department decides, inevitably AI :(

Or data science.

ML, DL, etc have nothing to do with biology or neurology. Neurons and brains work completely differently.

<sarcasm> Psssh.. it's all math. </scarcasm>

half of these are relevant to small data problem which is not exactly we mean when we say AI.

What do you mean when you say AI? I'm curious.

As far as I can tell, most people (e.g., whoever wrote this article) seem to use AI as a synonym for "machine learning", basically.

AI is very old term. But "new AI" is all based on big data. Most of the small data algorithm is always been used and still being used in medicine etc. But when you talk about "new AI" you mean deep learning which is very data hungry and except last 2 none of the algorithm has anything to do with deep leaning.

