When I was a student in the 1990s, I was taught about hypothesis testing (and all the hassle of p-fishing etc.), and about Bayesian inference (which is lovely, until you have to invent priors over the model space -- e.g. a prior over neural network architectures). These are both systems that tie themselves in epistemological knots when trying to answer the simple question "What model shall I use?"
Holdout set validation is such a clean simple idea, and so easy to use (as long as you have big data), and it does away with all the frequentist and Bayesian tangle, which is why it's so widespread in ML nowadays.
It also aligns statistical inference with Popper's idea of scientific falsifiability -- scientists test their models against a new experimental data, data scientists can test their model against qualitatively different holdout sets. (Just make sure you don't get your holdout set by shuffling, since that's not what Popper would call a "genuine risky validation".)
The article mentions Breiman's "alternative view of the foundations of statistics based on prediction rather than modeling". Breiman does make a big deal of evaluation on holdout sets; but his "prediction" idea isn't general enough, since it doesn't accommodate generative modelling (e.g. GPT, GANs). I think it's better to frame ML in terms of "evaluating model fit on a holdout set", since that accommodates both predictive and generative modelling.
As you often want to predict generalization to future unseen data, it's important to consider how the unseen data will be different from your current dataset - all the theory operates on the simplifying assumption that your dataset is an IID random sample from the "true" data but, of course, usually that's not really the case and you know and expect that the future data will have a different distribution than what you have - economics and politics trends will have new major events; future textual data will see new names, event names, terms and concepts that did not exist in 2020; etc, etc. So if you want to estimate actual generalization, you have to at least try to isolate some of these aspects.
For text, if you take a random sample of shuffled sentences then you'll get a different outcome than if you shuffle whole documents, because it'll be "cheating" as your test sentences will always have some relevant context in your training set, which won't be the case for real unseen data which will sometimes introduce totally new things. And if you train something on data gathered 2015-2020 and evaluate on data from 2021, then you'll likely get worse (but more informative!) measurements than training on a random sample chosen from the whole 2015-2021 range, simply because there are major world events and 'distribution shifts' over time.
IMHO a true test for generalization of many approaches would be to train stuff on data before 2020 and test on data from 2020 - to see how well it generalizes given a major event like Covid pandemic that changes all kinds of aspects everywhere in society that generates the new data.
Ideally, there's some field in our data that we can use -- e.g. train on data from cities in several countries, test on data from cities in a different country; train on one corpus, test on another; train on climate data at one level of CO2, test on data at another. If there's no such field, I do a simple 1d dimension reduction, and pick off 20% of the data at the extremes.
If we know exactly the domain where our model will be used, and it matches exactly the disposition of the training data, then shuffling is fine. But if we want our model to work well in new domains, then we're like scientists looking for generalizable laws, and the best way to test this is by testing in unseen domains. It's often hard to get hold of data with this domain-diversity, which is why a lot of ML models are fragile.
For time series data, you don't want to shuffle, because now you are predicting the past based on the future.
I think the best way is to choose a minimum timespan you want to be able to predict on, train on that and predict on the future, then retrain, including that future and predict on the next future.
If your time series allow it (e.g. contains a bunch of independent target groups) you can train/test on a subset and leave a another subset for validation
Very much agree with the simplicity and power of separation of training, validation and test sets. Is this really a 'big data' era notion though? This was fairly standard in 90s era language and speech work.
Big enough data that you can afford not to use some of it for training! Different disciplines hit this threshold at different times -- language and speech much earlier, as you say; clinical trials not there yet.
Maybe we could talk about two cardinalities of "big" data. The first is when you can afford not to use all of your data for training. The second is when you can usefully fit highly overparameterized models.
To be fair, there's a psychology paper from the late fifties that suggests this approach. Much like the early days of double descent, this didn't attract the attention it deserved at the time.
Off-the-cuff, i.e. without digging deeply into a set of history of statistics books:
Tied 1st place:
• Markov chain Monte Carlo (MCMC) and the Metropolis–Hastings algorithm
• Hidden Markov Models and the Viterbi algorithm for most probable sequence in linear time
• Vapnik–Chervonenkis theory of statistical learning (Vladimir Naumovich Vapnik & Alexey Chervonenkis) and SVMs
4th place:
• Edwin Jaynes: maximum entropy for constructing priors (borderline: 1957)
Honorable mentions:
• Breiman et al.'s CART (Classification and Regression Trees) algorithm (and Quinlan's C5.0 extension)
• Box–Jenkins method (autoregressive moving average (ARMA) / autoregressive integrated moving average (ARIMA) to find the best fit of a time-series model to past values of a time series)
(The beginning of the 20th century was much more fertile in comparison - Kolmogorov, Fisher, Gosset, Aitken, Cox, de Finetti, Kullback, the Pearsons, Spearman etc.)
This is a side issue, but I wonder what changes could be made to the teaching of statistics, most importantly "stats for scientists" taught in college and even high school. What occurs to me right off the bat is that I learned stats by the use of formulas to boil data sets down to answers. For example, null hypothesis significance testing and the dreaded p-value. We had to do it this way, as widespread availability of computers was still on the horizon.
Today, computation is easy and cheap. I wonder if we could learn stats in a different way, perhaps starting by just playing with data, and simulated random numbers, graphing, and so forth. Can something like null hypothesis testing be taught primarily through bootstrapping, with the formulas introduced as an aside?
Yes I know that statistics is a formal branch of math, with theorems and proofs. I took that class. But the students who take "stats for scientists" don't ever see that side of it. Understanding the formulas without seeing the proofs is like trying to learn freshman physics without calculus.
That is essentially how Allen Downey approaches statistical education: the analytical solutions came first because we lacked the computational power. Now that we have cheap computation, we should exploit that to develop better intuition. His Bayesian book[0] is available as Jupyter notebooks.
That's pretty cool. I actually would like to see K-12 math education use more computation and data, with less of a focus on algebra (expression manipulation). This would be more reflective of how regular people, including STEM people, use math outside of school.
This is an interesting thread of thought. I personally find many concepts in statistics much easier to understand through a resampling/bootstrapping lens, and I plan to try it on my children once they're old enough.
Kalman published his filter in 1960... a little over 50 years but I'd say it's worthy to mention given its huge impact. The idea that we can use multiple noisy sensors to get a more accurate reading than any one of them could provide enables all kinds of autonomous systems. Basically everything that uses sensors (which is essentially every device these days) is better due to this fact.
IMO, kernel-based computational methods are by far the most important overlooked advances in the statistical sciences.
Kernel methods are linear methods on data projected into very-high-dimensional spaces, and you get basically all the benefits of linear methods (convexity, access to analytical techniques/manipulations, etc.) while being much more computationally tractable and data-efficient than a naive approach. Maximum mean discrepancy (MMD) is a particularly shiny result from the last few years.
The tradeoff is that you must use an adequate kernel for whatever procedure you intend, and these can sometimes have sneaky pitfalls. A crass example would be the relative failure of tSNE and similar kernel-based visualization tools: in the case of tSNE the Cauchy kernel's tails are extremely fat which ends up degrading the representation of intra- vs inter-cluster distances.
My experience with MMD is that unless you've been using it and are familiar with other kernel methods, you probably won't know what to do with it : what kernel do I use ? how can I test for significance (in any sense of the word) ?
Add the (last I checked) horrendous complexity, to me it looks like a less usable mutual information (or KL div) and without all the nice information theory around it.
Shap values and other methods to parse out the inner workings for “black box” machine learning models. They’re good enough that I’ve grown fond of just throwing a light gbm model at everything and calling it a day for that sweet spot of predictive power and ease of implementation.
Strongly disagree. Shapley values and LIME give a very crude and extremely limited understanding of the model. They basically amount to a random selection of local slopes. For instance, if I tell you the (randomly selected) slopes of a function are [0.5, 0.2, 12.4, 1.1, 2.6] which the average is 3.3, can you guess anything? You might notice it is monotonic (maybe) but you certainly don't guess it is e^x.
If you have a relatively small number of features and your features have relatively stable but not necessarily linear partial dependence, then shap can unpack them pretty well.
E.g. value of a house it might not necessarily increase linearly in squarefoot, distance from city centre and other factors so a random-forest/gbt might do better than a linear model without careful feature engineering. Shap is a brilliant tool to unpack the dependencies this model has learned.
There will be other cases where a linear model is adequate, so you can get the explainability just from the coefficients, or where there are very large numbers of features with complicated nested structure. In either case shap is not too useful.
But the 'fit lightgbm and explain it with shap' is a decently pragmatic approach for lots of tabular data machine-learning problems.
I would say that our ML models are not predictable enough yet at the local neighbourhood to really trust LIME. Adversarial examples prove that you just can't select a small enough range since you can always find those even for super tiny distances.
Maybe. But I don’t care about estimating the slopes. I just need to know big values of feature X pull the model prediction up and Small values do nothing.
If you really need to unpack it, cluster your shaps. To see what kinds of predictions are being made
> I’ve grown fond of just throwing a light gbm model at everything and calling it a day
It is not always a good idea to do that. Always try different methods, there is no ultimate method. At the very least OLS should be tried and some other fully explainable methods, even a simple CART like method.
OLS is my default go to. It outperforms a random forest so often in small data real world applications over nonstationary data, and model explainability is built into it. If I'm working in a domain with stationary data then I'd tilt more to the forest (due to not having to engineer features, and the inbuilt ability to detect non-linear relationships and interactions between features).
Seconding Chris Molnar's excellent writeup. I also find the readme & example notebooks in Scott Lundberg's github repo to be a great way to get started. There are also references there for the original papers, which are surprisingly readable, imo. https://github.com/slundberg/shap
There is a nice treatment of resampling, i.e., permutation tests, in (from my TeX format bibliography)
Sidney Siegel,
{\it Nonparametric Statistics for
the Behavioral Sciences,\/}
McGraw-Hill, New York,
1956.\ \
Right, there was already a good book on such tests over 50 years ago.
Can also justify it with an independence, identically distributed assumption. But a weaker assumption of exchangeability can also work -- I published a paper with that.
The broad idea of such a statistical hypothesis test is to decide on the null hypothesis, null as in no effect (if looking for an effect, then want to reject the null hypothesis of no effect) and to make assumptions to permit calculating the probability of what you observe. If that probability is way too small then reject the null hypothesis and conclude that there was an effect. Right, it's fishy.
1) It’s not — there are lots of procedures called “the bootstrap” that act differently.
2) The fact that “substitute the data for the population distribution” both works and is sometimes provably better than other more sensible approaches is a little mind blowing.
Most things called the bootstrap feel like cheating, ie “this part seems hard, let’s do the easiest thing possible instead and hope it works.”
I think it's brilliant as an idea but not particularly mysterious after the fact or something. I love it, but think it's brilliant in part because it's so transparent and simple.
I agree "bootstrap" has expanded a bit in meaning but I think it's basically the same idea.
I used to think it was cheating but have realized there is a cost to it, which is replication. For many things it's just impractical in terms of computation time. So although it's great, it requires a lot.
Yes. It only tells you about variability within the sampled values. If you don't sample outliers, or get unlucky and same many non-representative vvalues, it can't tell you what you're missing out on (obviously).
Solomonoff Induction. Although proven to be uncomputable (there are people working on formalizing efficient approximations) it is such a mind-blowing idea. It brings together Occam's razor, Epicurus' Principle of multiple explanations, Bayes' theorem, Algorithmic Information Theory and Universal Turing machines in a theory of universal induction. The mathematical proof and details are way above my head but I cannot help but feel like it is very underrated.
Statistics is an applied science, and Solomonoff induction has had zero practical impact. So I feel it's not underrated at all, and perhaps overrated among a certain crowd.
My impression, for what it's worth or not worth, is that this area is sort of what Bayesian statistics was 50 years ago. Something that is fundamental but is difficult to implement so doesn't have much impact practically speaking at a given moment in time.
Good review article. I always enjoy browsing the references, and found "Computer Age Statistical Inference" among them. Looks like a good read, with a pdf available online.
Surprised that nobody mentioned the first one in the list:
Counterfactual Causal Inference
The formalisation of casual inference by Pearl is in my opinion a development comparable to the invention of calculus in mathematics: Suddenly the problem space solvable with statistics became at least an order of magnitude larger and we're only starting to see the benefits in other sciences.
I would love to believe this is true, and I think that theoretically it's super important, but I've never had success with this in practical applications.
It’s a tough field to become proficient in because you rarely get to validate your process. But I’m starting to get a better understanding of double ML, and it really does work in a provable way.
My favourite is Mandelbrot's heuristic converse of the central limit theorem: the only variables that are normal are those that are sums of many variables of finite variance.
I have not heard that before, but that seems like a very strong statement to know! How does interact with sample size though? Seems hard to formulate precisely…
The idea is that if you ever find a distribution that seems to be normal, you look for the finite variance variables that are lurking behind it. If you do not find these variables, then your distribution was not likely normal.
Not in the past 50 years, but more like the past 80 years, nonparametric statistics in general are pretty amazing, though underused. Look at the Mann-Whitney U test and the Wilcoxon Rank Sum. Those tests require very few assumptions.
p-hacking is a research "dark pattern" where a researcher fits several similar models and reports only the one that has the significant p-value for the relationship of interest.
This strategy is possible because p-values are themselves stochastic and a researcher will find one significant p-value for every 20 models that they run (at least on average).
p-hacking could also refer to pushing a p-value close to the significant cut-off (usually 0.05) by modifying the statistical model slightly until the desired result is achieved. This process usually involves the inclusion of control variables that are not really related to the outcome but that will change the standard errors/p-values.
Another way to p-hack is to drop specific observations until the desired p-value is reached. This process usually involves removing participants from a sample for a seemingly legitimate reason until the desired p-value is achieved. Usually identifying and eliminating a few high leverage observations is enough to change the significance level of a point estimate.
Multiple strategies to address p-hacking have been proposed and discussed. One of the most popular ones is pre-registration of research designs and models. The idea here is that a researcher would publish their research design and models before conducting the experiment and they will report only the results from the pre-registered models. This process eliminates the "fishing expedition" nature of p-hacking.
Other strategies involve better research designs that are not sensitive to model respecification. These are usually experimental and quasi-experimental methods that leverage an external source of variation (external to both the researcher and the studied system, like random assignment to conditions) to isolate the relationship between two variables.
I saw this firsthand as an undergrad research assistant in a neuroscience lab. How did it go when I brought it up? Swept under the rug and published in a high-impact journal.
The experience helped me realize that a non-trivial amount of work done in these labs is worthless and probably even harmful. They don't seem to care as much about the science as they care about publications (on the part of the PI) and being published (the PhD students).
Moreover, the lab pushed me out and decided to use my work anyway. Specifically, they requested that anybody that used the software I wrote give them authorship on their papers.
I would agree with you but they are speaking to a different audience. This is in a journal for statistics researchers and theorists. These would all be things that would inform the creation of pragmatic tools like SPC.
Meta-analysis is an application of idea #4 (Bayesian Multilevel Models) in the article.
What makes meta-analysis special within a multilevel framework is that you know the level 1 variance. This creates a special case of a generalized multilevel model where you leverage your knowledge of L1 mean and variance (from each individual study's results) to estimate the possible mean and variance of the population effect.
The population mean and variance is usually presented in funnel plots where you can see the expected distribution of effect sizes/point estimates given a sample size/standard error.
Researchers have also started to plot actual point estimates from published papers in this plot, showing that most of the published results are "inside the funnel", a result that is usually cited as evidence of publication bias. In other words, the missing studies end up in researchers' file drawers instead of being published somewhere.
The most important statistical idea of the past 50 years is the same as the most important statistical idea of the 50 years before that:
"Due to reduced superstition, better education, and general awareness & progress, humans who are neither meticulous statistics experts, nor working in very constrained & repetitive circumstances, will understand and apply statistics more objectively and correctly than they generally have in the past."
We barely have any concept of how education has penetrated our society, to a far greater degree, now than fifty years ago.
We as a population are FAR more educated, far less ignorant, and generally speaking extraordinarily better off in our daily lives than in the past.
It's only because the phenomenon of social media, which has crowd sourced ignorance and hurled it in front of our eyes, that we perceive we're in an unenlighted age/trajectory.
I've heard that Google and Baidu essentially started at the same time, with the same algorithm discovery (PageRank). Maybe someone can comment on if there was idea sharing or if both teams derived it independently.
PageRank actually had a predecessor called HITS (according to some sources HITS were developed before PageRank, according to others they were contemporaries), an algorithm developed by Jon Kleinberg for ranking hypertext documents. https://en.wikipedia.org/wiki/HITS_algorithm However, Kleinberg stayed in academia and never attempted to commercialize his research like Page and Brin did. HITS was more complex than PageRank and context-sensitive so queries required much more computing resources than PageRank. PageRank is kind of what you get if you take HITS and remove the slow parts.
What I find very interesting about PageRank is how you can trade accuracy for performance. The traditional way of calculating PageRank by means of squaring a matrix iteratively until it reaches convergence gives you correct results but is sloooooow. For a modestly sized graph it could take days. But if accuracy isn't that important you can use Monte Carlo simulation and get most of the PageRank correct in a fraction of the time of the iterative method. It's also easy to parallelize.
Jon M. Kleinberg, "Authoritative sources in a hyperlinked environment," 1998, Proc. Of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 668-677.
> In 1996, while at IDD, Li created the Rankdex site-scoring algorithm for search engine page ranking, which was awarded a U.S. patent. It was the first search engine that used hyperlinks to measure the quality of websites it was indexing, predating the very similar algorithm patent filed by Google two years later in 1998.
Given that Pagerank was literally invented and named after Larry Page, I would think that Google had a head start.
That being said, Page Rank is a more a stellar example of adapting an academic idea into practice, than a statistical idea in and of itself.
Afterall, it is 'merely' the stationary distribution for a random walk over an undirected graph. I say 'merely' with a lot of respect, because the best ideas often feel simple in hindsight. But, it is that simplicity that makes them even more impressive.
> Given that Pagerank was literally invented and named after Larry Page, I would think that Google had a head start.
I don't think this means much. The history of science and technology is full of examples of results named after someone other than the first person to find them.
The sort of methods 'PageRank' uses already existed. It reminds of Apple `inventing` (air quotes) the mp3 player. It didn't, it applied existing technology, refined it and publicized it. They did not invent it but maybe 'inventing' something is only a very small part of making something useful for many people.
The basic concept behind page rank is pretty obvious. If you stare at a graph for a while, it’ll probably be your big idea if you try to imagine centrality calculations.
Implementing it and catching edge cases isn’t trivial
I heard that the first approach of google was using/adapting an published algorithm which was used to rank scientific publications from the network of citations. Not sure if this is the algorithm you mentioned though.
The ranking of scientific publications based on citations you’re describing is impact factor [1]. I haven’t heard that as an inspiration for Larry Page’s PageRank [2] but that is plausible.
(I do not want to link directly to the pdf shown in the search result).
Section 2.1 deals with related work: "There has been a great deal of work on academic citation analysis [Gar95]. Go man [Gof71] has
published an interesting theory of how information
flow in a scienti c community is an epidemic
process......" (and more)
The most important to me is "There are three kinds of lies in this world. Lies, damn lies, and statistics."
Not attacking the mathematically field of statistics just pointing out that lots of people abuse statistics in an attempt to get people to behave as they would prefer.
I would argue the problem lies in our complete lack of teaching around statistical epistemology. It’s such a critical concept in my eyes and it seems just totally glazed over looking back at my education.
When I was a student in the 1990s, I was taught about hypothesis testing (and all the hassle of p-fishing etc.), and about Bayesian inference (which is lovely, until you have to invent priors over the model space -- e.g. a prior over neural network architectures). These are both systems that tie themselves in epistemological knots when trying to answer the simple question "What model shall I use?"
Holdout set validation is such a clean simple idea, and so easy to use (as long as you have big data), and it does away with all the frequentist and Bayesian tangle, which is why it's so widespread in ML nowadays.
It also aligns statistical inference with Popper's idea of scientific falsifiability -- scientists test their models against a new experimental data, data scientists can test their model against qualitatively different holdout sets. (Just make sure you don't get your holdout set by shuffling, since that's not what Popper would call a "genuine risky validation".)
The article mentions Breiman's "alternative view of the foundations of statistics based on prediction rather than modeling". Breiman does make a big deal of evaluation on holdout sets; but his "prediction" idea isn't general enough, since it doesn't accommodate generative modelling (e.g. GPT, GANs). I think it's better to frame ML in terms of "evaluating model fit on a holdout set", since that accommodates both predictive and generative modelling.