Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to Modern Statistics (openintro-ims2.netlify.app)
740 points by noelwelsh 11 months ago | hide | past | favorite | 132 comments



Statistics education is undergoing a bit of a revolution, driven by the accessibility of computers. For example, hypothesis testing is introduced by randomization[1], using a randomized permutation test[2]. I find this really easy to understand, compared to how I learned statistics using a more traditional approach. The traditional approach taught be a cookbook of hypothesis tests to use: use the t-test in this situation, use the chi-squared in this situation, and so on. I never gained any understanding of why I should use these different tests, or where they came from, from the cookbook approach.

For the same approach in a slightly different context see [3].

[1]: https://openintro-ims2.netlify.app/11-foundations-randomizat...

[2]: https://en.wikipedia.org/wiki/Permutation_test

[3]: https://inferentialthinking.com/chapters/11/1/Assessing_a_Mo...


The difficulty of teaching statistics is that the maths you need to prove things are right and gain an intuitive understanding of the methods are far more advanced than what is presented in a basic stats course. Gosset came up with the t-test and proved to the world it made sense, yet we teach students to apply it in a black box way without a fundamental understanding of why it's right. That's not great pedagogy.

IMO, this is where Bayesian Statistics is far superior. There's a Curry-Howard isomorphism to logic which runs extremely deep, and it's possible to introduce using conjugate distributions with nice closed form analytical solutions. Anything more complex, well, that's what computers are for, and there are great ways (STAN) to run complex distributions that are far more intricate than frequentist methods.


> There's a Curry-Howard isomorphism [between] logic [and Bayesian statistical inference].

This is an odd way of putting it. I think it's better to say that, given some mostly uncontroversial assumptions, if one is willing to assign real number degrees of belief to uncertain claims, then Bayesian statistical inference is the only way of reasoning about those claims that's compatible with classical propositional logic.


The will to assign real numbers to degrees of belief is the controversial assumption. Converted bayesians tend to gloss over this fact. Many, as in a sibling comment, state that MLE is bayesian statistics with a uniform prior, but this isn't true of most if not all frequentist inference, based on frequentist NHT and CI, not MAP. Modeling uncertainty with uniform priors (or even more sophisticated non-informative priors a la Jaynes) is a recipe for paradoxes and there is no alternative practical proposal that I know of. I have no issue with bayesian modeling in a ML context of model selection and validation based on resampling methods, but IMO it's not up to the foundational claims its proponents often do.


Maximum likelihood (which underpins many frequentist methods) basically amounts to Bayesian statistics with a uniform prior on your parameters. And the "shape" of your prior actually depends on the chosen parametrization, so in principle you can account for non-flat priors as well.


IMHO, the discussion should not be so much whether to teach Bayesian or maximum likelihood. But instead, whether to teach generative models or to keep going with hypothesis tests, which are generally presented to students as a bag of tricks.

Generative models, (implemented in e.g. Stan, PyMC, Pyro, Turing, etc.) split models from inference. So one can switch from maximum likelihood to variational inference or MCMC quite easily.

Generative models, beginning from regression, make a lot more sense to students and yield much more robust inference. Most people I know who publish research articles on a frequent basis do not know p-values are not a measure of effect sizes. This demonstrates current education has failed.


Maximum Likelihood corresponds to Bayesian statistics with MAP estimation, which is not the typical way to use the posterior.


I've had similar thoughts, but I think its more to do with what is in your head at the time you hear about it. I found permutation tests satisfying to learn about because they somehow helped consolidate what I knew from distribution theory. If I didn't know any distribution theory prior, I'm not sure they could have that effect.

If you study mathematical statistics, it is not taught as a cookbook. At the elementary level you learn probability theory and distribution theory, all the different distributions, hypothesis tests, regression, ANOVA and so on proceed from there. Meanwhile, I think research scientists are often taught statistics as a set of recipes because its usually a short course for a specific discipline. E.g. Statistics for biologists.


I think those short courses would be more effective if they didn't bother with ANOVA and instead taught intro probability and distributions and then jumped straight to regression. ANOVA is just a really specific way of doing a regression.

In R, and python::statsmodels you get the answer to (essentially) an ANOVA any time you run an LM or GLM; its the Z-statistic for your whole model.

I know there is more nuance to this, but teaching students that they can use regression for most of the problems they would have used seemingly arcane tests for is going to be much more useful for the students.

Here is a lovely page demonstrating how to do this in R: https://lindeloev.github.io/tests-as-linear/


I agree with the sentiment although I'm not sure there is the time for all of it. At least when I took them, probability theory and distribution theory were separate semester long courses, and the former was a prerequisite for the latter.


Stastsmodels and that github page are the only reason I have some understanding of statistical tests.


Principles of Statistics by M.G. Bulmer is a nice introduction to the mathematical side of things. It's part of Dover's classic textbook series, so it's inexpensive compared to newer textbooks, and also concise and well-written.

It does assume you already have a solid understanding of calculus and combinatorics, though. Which I think is fair. Discrete statistics is arguably just applied combinatorics, and continuous statistics applied calculus, so if you have a strong foundation in those two subjects then you're already 90% of the way there. (And, if you don't, stop the cart and let the horse catch up.)


There is also Brillant that has a very polished interactive course:

https://brilliant.org/courses/statistics/


These things are great if they add value for you, but I would be very skeptical of any non-mathematical approach to statistics. I think statistics is only made clear by mathematics, much the same as Physics. And one cannot grasp statistics without being able to understand the maths.

I think that still the best way to understand statistics is to start with the mathematical theory and to grind 1000+ textbook problems.


> I think that still the best way to understand statistics is to start with the mathematical theory and to grind 1000+ textbook problems.

Are there any books you'd recommend for this approach?


My grind was "Mathematical Statistics with Applications" by Wackerly et al. There are PDF versions if you Google for it. I can't say it was quick, easy or intuitive; but it works.

I also liked "In all Likelihood" by Pawitan for a "likelihoodist" foundational approach.


I was introduced to statistics with the mathematical approach, and it failed to help me understand why things were the way they were, due to the lack of context and application.

Later I re-learned it in an applied way, and things made much more sense to me.

(That's fine if you liked learning it your way, but that's no reason to be skeptical about other approaches.)


I think a common way to teach classical statistics is to start with a binomial test. The binomial distribution is quite concrete so the students can hopefully grasp the concept of p-value. Then things become more abstract with the normal, T distribution, etc. and cookbook of when to use what...


Hey i recognized you from the Scala doodle project! that and the other stuff from Underscore was a really cool way to learn Scala & FP, thank you :)


That's very random but very cool! I'm glad the books were useful to you :-) This post was actually motivated by my wanting to write some material on data science, wondering what the modern approach to teaching it is, and falling down a rabbit hole.


Do you know of any validation studies with Advanced Data Analysis (formerly code interpreter) in chatGPT? I think it can be excellent as a teaching tool.


As much as I appreciate and love all pedagogical endeavours in the field, especially in the form of open texts, I really, really, really dislike this overall approach to teaching introductory statistics.

I'm hoping to see, over time, a shift away from ad-hoc null hypothesis testing in favour of linear models (yes, in introductory courses, from the start-- see link below) and Bayesian-by-default approaches.

https://lindeloev.github.io/tests-as-linear/#:~:text=Most%20....


I am partway through McElreath's "Statistical Rethinking" and I fully agree with this.


If you're starting out on statistical rethinking, be sure to check out recoded [0]: a reimplementation of statistical rethinkings exercises using tidyverse, ggplot and brms instead of McElreath's own package. These tools are standard and well worth learning on their own, and they're also more documented than the package made by McElreath.

[0] https://bookdown.org/ajkurz/Statistical_Rethinking_recoded/


That's a great textbook!


It's been recommended on this topic several times, so I'm looking at it. Quite expensive ! I see there is a series of lectures, which seems identical to the book. Is it the same ? Or still worth buying the book ?


The lectures are good, and I've been told the book can be found online by the intrepid. I guess that Anna's Archive or Library Genesis has it.


I've found the book indeed - although it seems to be the first edition.

It's here: https://civil.colorado.edu/~balajir/CVEN6833/bayes-resources...


With all due respect, how hard did you look?

This is the third result from a query I ran:

https://github.com/Booleans/statistical-rethinking/blob/mast...

Main repo w/ some notebooks: https://github.com/Booleans/statistical-rethinking


Thanks !


I agree about teaching from a unified GLM basis. The 'bayesian-by-default' approach seems to going out on a more tenuous limb, imo.


It's only appears tenuous because the subjective choices you have to make when using frequentist methods are made for you by the developer of the method.

It's less comfortable to use Bayesian methods because you have to be explicit about your assumptions as the user, which opens your assumptions up for easier inspection. There's also way less specific information implied by priors than most people think. Informative priors should try to make distinctions between something that's reasonable-ish and something that's essentially infinity (take pharmacokinetics for example, the diffusion velocity of a molecule in your blood stream shouldn't have a velocity near the speed of light in a vacuum should it?). They should not be forcing your model to achieve a particular result. Luckily, because of the need to explicitly state them in a Bayesian analysis, it's much easier to determine if they were properly set.

Prior specification is essentially problem domain-informed regularization where you can actually hope to understand if the hyperparameter is going to work or not.


I understand where you're coming from, and I like the idea for a certain kind of people: those who are very good at handling abstractions. Software engineers do have this skill, but the majority of statistics users do not. Trying to explain the similarities between these linear methods and how all is one [1] to a social scientist who doesn't like numbers nor formulas to begin with would only lead to more confusion.

But if you ever do a randomized test with a suitable linear model to estimate the efficacy of these two methods, do let us know, that would be 10/10 :)

[1]: https://lindeloev.github.io/tests-as-linear/#41_one_sample_t...


> I'm hoping to see, over time, a shift away from ad-hoc null hypothesis testing in favour of linear models (yes, in introductory courses, from the start-- see link below) and Bayesian-by-default approaches.

Is there anything where I can start today, as a guinea pig? My statistics education is basically zero.


There's this great series of lectures I watched during my intro to statistics and probability course: https://youtube.com/playlist?list=PLQfiOKXnQpw_l0rbiV_QW8lwl...

It goes over Bayes' Theorem early on, which I assume is Bayesian-by-default. I didn't realize this isn't universal.

I watched it because my professor seemed to be teaching from the same textbook, so it followed the same general course structure. The textbook we used was "Probability and Statistics for Engineering and the Sciences — 9th edition". You can find a PDF of the eighth edition just by googling the name.

You could probably follow along just by taking notes during the video lectures, but if you want to give yourself homework, the textbook provides a lot of practice problems.


Bayes Theorem is universal. It’s one of the most fundamental results in probability.

Bayesian statistics is not. Most basic intro courses go the frequentist routes for historical reasons, but really both methods have their pros and cons.


There are other comments here that suggests a number of books at varying levels. "Introduction to Modern Statistics" is very approachable in its presentation.


See my sibling comment, can recommend this: https://xcelab.net/rm/statistical-rethinking/


Very excited to see Mine Çetinkaya-Rundel is an author here! Many might be familiar with “R for Data Science” (https://r4ds.had.co.nz/), to which she is a contributor, but she’s also published a lot of great papers around teaching data science.


She also has some online courses on Coursera (https://www.coursera.org/instructor/minecetinkayarundel). Hands down one of the best instructors I have seen.


One of my favourite books on statistics and probability is "Regression and Other Stories", by Andrew Gelman, Jennifer Hill and Aki Vehtari. You can access the book for free here: https://users.aalto.fi/~ave/ROS.pdf


+1, this is a great textbook, and not just for social sciences as the second header would suggest.


Is there a "pre-statistics" book that teaches the thinking skills and concepts needed to understand statistics?


An undergrad (non-measure theoretic) probability book with basic linear algebra and calculus is the best preparation. If you solidly understand the basic tools of probability, you can learn statistics quite easily in my experience. Without that understanding, statistics will likely seem like a bunch of recipes. However, if you are comfortable with probability, you can write down the actual problem you have and find the statistical tool you’re after with a little googling.


Thank you for the guide! It's very helpful. This comment [1] describes my background with math.

Something related: I often feel like I'm doing math when I work with information, especially when drawing out concepts and their relationships. Like 3D Tetris, but with recursion. There are also patterns in categorization. One of my purposes in learning math is to be able to quantify relationships between concepts, and create models, etc. What would I need to know about in this case?

[1] https://news.ycombinator.com/item?id=37857050


I’m not sure what the best advice is for someone with a brain injury, however I feel for you. I didn’t realize I had an attention disorder until I was 25 and it explained why I would screw up signs and struggle to understand formulas.

However I will say that, as you learn more math, the intuition and big picture thinking is way more important than details. I normally forget the specifics of how something works, but I know what I’m trying to do and what thing I need to look up to do it. There is very little need to memorize things besides having enough “RAM” for the moment. You always have pen and paper to write stuff down!

Regarding modeling, the sky is kind of the limit. The more math you know, the more abstractions you learn and start to see in the world. Dynamical systems is a good field to look at but I’m biased because it’s my main topic. Despite its usual applications in physical modeling, an algorithm is basically a time-discrete dynamical system: you recursively apply a function to a state and that gives you a new state. I have certainly seen algorithms analyzed from that perspective before.

Another area you might find interesting is Algebra. This isn’t like the algebra you took in school but more about looking at a space of objects that interact via some operation and characterizing what you know about the space given that operation. A classic example is that rotation operations form an algebraic structure known as a “group” due to the fact that any two rotations gives you a third, rotations are invertible (you can cancel a rotation by rotating in the opposite manner), and they follow the associative property. There’s a book I’ve been meaning to read about how to use this kind of algebra to design safer and more intuitive APIs by considering the data structures of the APIs as objects and functions / methods as operators on those objects.


Any advice is good, I just have to start at the beginning in a way that most people don't. Beyond that, normal advice is fine. I can adjust that advice to the quirks of my systems.

Dynamical systems is a great new keyword! Algorithms are a perfect example. I am exploring system dynamics from a metacybernetic point of view (via the viable system model). My focus is the dynamics of humans in complex systems, how systems change human behavior and vice versa.

Does big picture thinking in math involve understanding/intuition of the implications of formulas, and how they interact with other formulas (or other mathematical entities, I don't know if formula is a general enough term for "group of math actions")?

Graph dynamical systems looks promising, because I draw similar pictures when showing relationships over time. Fractals are always good, drawing fractals taught me to think recursively.

Is this the definition of group in the algebra you mentioned? https://mathworld.wolfram.com/Group.html

I am also interested in the way models create understanding, and how technical models can be visually altered to aid in understanding. I have a background in visual psychology, and see many mistakes that cloud the meaning of what is being communicated.


This book seems to start where you need it to start.

You don't need much beyond basic calculus. Most suffer from some mental block they got installed at a young age akin those that say "I'm bad at math" because their teacher sucked. Dive in and you won't regret it.


I have been a math teacher and although I can't guarantee that I didn't suck, I can say that most of kids don't develop this attitude because of teachers, but because of their parents. "My mum says that she sucked at math/music/whatever as well, so do I!" is far too common. As a teacher I just didn't have resources to influence this attitude either.


Yes, parents can be horrible too. Unfortunately it's somehow socially acceptable and even worthy of pride in some circles, to be "bad at math". It's seems very rare for someone to openly say "I'm bad at [my native language]" or "writing".

I feel stats is has a somewhat similar effect even among those with math education. Several friends who have a degree in math recoil at the first mention of stats concepts.


> It's seems very rare for someone to openly say "I'm bad at [my native language]" or "writing".

It is actually even fashionable in non-english countries. Declaring "I'm bad at [my native language], I only use english anyway" makes you a better person somehow. And it's not rare in other areas either – in post-truth world it's trendy not to know things.


In non-english countries? All of them? Source? I, as a person from one of said non-english countries, disagree.


I didn't mean that in all of them and everyone in any of them. But it has been always the case everywhere in the world. It's the mechanism how languages die - gravitation to the bigger languages. Hundreds of languages die in Russia not only because of limited education in regional languages, but because it's more fashionable to be russian than representative of smaller nation. In world level this gravitation is towards english mainly.

In my country it's especially fashionable not to know a native language for a people in tech. "It's impossible to talk about tech anyway in native languages, so we all should use english anyway in future" is very common. I tried to fight with it localizing/translating software for many years, but I've given up for now.


Not OP, but I recently had a manager in Germany (I'm in the US) say something like that.

Not that he was bad at German, necessarily, but that it was overly complex vs English, so he and his significant other even used English at home when they both didn't have to.

This is just one example, of course.

I was actually surprised to hear him say that (though it may have been to thwart my attempts at speaking some German when we had a meeting. I don't know any German,lol).


My mental block is a brain injury that went undiagnosed until I was 30. I can't really hold more than two numbers in my head at a time. I struggled through math in school because it was lecture based, and the books were written to accompany a lecture.

I can learn math fairly well if I have the right written material and the right direction. However, I do not retain math skills: without active practice, I revert back to "how do fractions work?"

For example, I did extremely well in a college algebra course that was partially online (combined with Khan Academy to catch up). I could do my tests perfectly in pen, much to the amusement of the assistants. I could make connections and see the implications and applications of the math. Roughly three to six months later, I was back to forgetting fractions.

I can't learn these things over time, but I can learn them all at once. I'm collecting resources for my next math adventure.


> My mental block is a brain injury that went undiagnosed until I was 30.

Would you be OK with elaborating on this a bit more?


Yes. What would you like to know? I am better at answering specific questions.

The injury was diagnosed via SPECT scan. Two of them: one after mental activity, one after physical relaxation. This also revealed that my brain relaxes with mental activity, and "lights up" with physical relaxation.


do you know how/when you got the injury? What prompted you to get the scan?


No, I don't know how. The clinic also took a medical history and said, based on my symptoms, it probably happened in early childhood as a baby.

I always knew there was "something" wrong with me, so I jumped at the opportunity to get a brain scan. The clinic was a little suspect in that they claimed to be able to diagnose mental health issues via the scan, but they combined the scan with more traditional evaluations.


Lots of people don't make it past by basic algebra, let alone basic calculus.


Of which many that don't make it because they've had some early bad experience rather than actually being unable to. And there is a parallel for those with some basic mastery of algebra and calculus and have some mental block towards stats.

To quote pg, many conversations on the Internet:

Person 1: ∃x P(x)

Person 2: -(∀x P(x))!


I think Ronald Fisher may not have used bootstrap to calculate confidence intervals; but it looks to me like he invented most of the rest of the syllabus .. in the early 1900s :-)


For studying statistics, I put together a comprehensive cheat sheet: https://github.com/mavam/stat-cookbook


I like the inclusion of randomization and bootstrapping. It's unfortunate that the hypothesis framework is still NHST -- I wouldn't consider that 'modern' by any means.


I don't see widespread agreement in the statistics community as to what should replace NHST. If you go Bayesian you need to completely rewrite the course. I've seen confidence intervals suggested as an alternative, but there are arguments against. I've also seen arguments that hypothesis tests shouldn't be used at all. Given that NHST is still widely used and there isn't a clear alternative I think it's a disservice to students to not introduce them.


I probably should have been more clear. I didn't say hypothesis testing, I said NHST (the binary null/alt hypothesis approach) - which is an approach to hypothesis testing particularly prevelant in certain disciplines such as Psychology.

And in that context, there is a lot of agreement that this approach is fundamentally flawed and outdated. if you are interested, I can provide references when I get to the office. But off the top of my head consider Gigerenzer and Cummings.


For those following along at home Gigerenzer is, I think, "Mindless Statistics"[1] and Cummings is "The New Statistics"[2].

[1]: https://pure.mpg.de/rest/items/item_2101336/component/file_2... [2]: Sample at https://tandfbis.s3.amazonaws.com/rt-media/pp/common/sample-...


Yes, those are appropriate (although Gigerenzer and Cummings both have other relevant publications on the topic).

As for a undergraduate text that 'teaches the difference', you can look at 'An Introduction to Statistics' by Carlson & Winquist.


What is a good book on statistics that one can use for self-learning?


Depends where you are starting from and what you want to learn. The linked book is a first year introduction, and does a good job of that. If you want to go further there are many other options:

* Statistical Inference by Casella and Berger. This book has a very good reputation for building statistics from first principles. I won't link to them, but you can find full PDF scans online with a simple search. Amazon reviews: https://www.amazon.com/Statistical-Inference-Roger-Berger/dp...

* Statistics by Freedman, Pisani, and Purves has similarly very good reviews and can be easily found online. Amazon reviews: https://www.amazon.com/Statistics-Fourth-David-Freedman-eboo...

* The majority of the Berkeley data science core curriculum books are online. This is not purely statistics but 1) is taught in a modern style that makes use of computation and randomization and 2) uses tools that may be useful to learn about.

1. https://inferentialthinking.com/chapters/intro.html (Data 8)

2. https://learningds.org/intro.html (Data 100)

3. http://prob140.org/textbook/content/README.html (Data 140)

4. https://data102.org/fa23/resources/#textbooks-from-previous-... (Data 102; this gets into machine learning and pure statistics)

The Berkeley curriculum is not the only one; there are tens, possibly hundreds, of online courses. The Berkeley curriculum is just 1) quite extensive and 2) the one I happened to read the most about when I was recently researching how data science is currently taught.


I like statistical rethinking. It’s targeted at science phd students so the focus is “how can you use statistics for testing your scientific hypotheses and trying to tease out causation”. It doesn’t go deep into the mathematics of things (though expects readers to be decently numerate and comfortable analysing data without statistics). It only really talks about Bayesian models and how to fit them by computer, so won’t cover much of the frequenting side of things at all.


ISLR/ISLP is free, was used in my masters and is excellent (and has an accompanying video series)

https://www.statlearning.com/


I particularly like Statistical Inference by George Casella and Roger Lee Berger.

You could also look at Introduction to Probability by Joseph K. Blitzstein and Jessica Hwang (available for free here: http://probabilitybook.net (redirects to drive)).


Should be noted that Casella’s book is… well… really great if you thought Spivak’s calculus and Rudin’s analysis to be fun books, especially the exercises.

Casella’s exercises are absolutely brutal.


A couple of more introductory books that come at it from the point of view of "someone who can code" are: - https://greenteapress.com/wp/think-stats-2e/ (and the similar Think Bayes if you enjoy this one) - https://nostarch.com/learnbayes

Can second Statistical Rethinking though if you have the basics of stats and want to learn it again from a very different, more causal/bayesian point of view.


What is your background and what field will you be applying your knowledge to?

There can be a rather wide gap between a theoretical approach that you might encounter as taught by a statistician and an applied approach you might encounter in a business statistics or social science statistics course.

Depending on your math background and the area of intended application, in my opinion, it would sway recommendations for a first 'book' on statistics for self-learning.


A good introductory book is this one: "The Statistical Sleuth. A Course in Methods of Data Analysis" (3rd Edition) by Fred Ramsey and Daniel Schafer <http://www.statisticalsleuth.com/>.

Another good book at a more advanced level is this one: "Regression and Other Stories" by Andrew Gelman and Jennifer Hill <http://www.stat.columbia.edu/~gelman/arm/>.

Another good book at an even more advanced level is this one: "Data Analysis Using Regression and Multilevel/Hierarchical Models" by Andrew Gelman and Jennifer Hill <http://www.stat.columbia.edu/~gelman/arm/>.

These books are all very polished productions. What make makes them special is that they emphasis teaching the statistical way of thinking.


No one ever mentions Concrete Mathematics (Graham, Knuth, et al), but it's a classic text for math and stats.

https://pubs.aip.org/aip/cip/article-abstract/3/5/106/136800...


No it’s not. You can argue about number theory and combinatorics, but the stats/prob part of the book is very inadequate.


Good video lecture series:

https://www.thegreatcourses.com/courses/learning-statistics-...

Might be available for free via your local library, too.


Seems to be mostly Creative Commons BY-SA 3.0 but there's a lot of "yes, but" language in that file: https://github.com/OpenIntroStat/ims/blob/main/LICENSE.md


I'm looking for help with distilling 'truth' from folk belief systems by formalizng them under a Bayesian network framework, in case anyone is looking for a project through which to sharpen their statistical saw.


The epub is apparently too big to send to a kindle, but I can't see the option to download it, only the pdf. Any ideas?


Too big to email? I use calibre to convert any file-type i have and send it directly to my kindle.


me too, but the epub -> pdf conversion is never perfect in my experience, with random headers through the text or weird paragraph splits.

The too big error was from openintro.com, where they have a "send to kindle" option which I was hoping was a proper epub


Thanks to the author for the book and making it open access. I always admire these efforts.


This looks to be the 2nd edition. Can anyone comment on how the 1st edition was?


Can I download this as a PDF? I'd like to read it offline.



This is the first version, not the 2nd?


Hmmm ... must be because the 2nd edition is still in progress. Best option might be to follow the immortal words of Obiwan Kenobi and "use the source": https://github.com/OpenIntroStat/ims

Otherwise you can try building a PDF from the very similar Data 8 book[1] using [2]

[1]: https://github.com/data-8/textbook

[2]: https://jupyterbook.org/en/stable/advanced/pdf.html


Anyone looking to apply and compare frequentist and bayesian methods within a unified GUI (which is essentially an elegant wrapper to R and selected/custom statistical packages), should check out JASP developed by the University of Amsterdam [0]. It's free to use, and the graphs + captions generated during each step are publication quality right out of the box.

Using it truly feels like a 'fresh way' to do statistics. Its main website provides ample use cases, guides and tutorials, and I often return to the blog for the well documented deepdives into how traditional frequentist methods and their bayesian counterparts compare (the animated explainers are especially helpful, and I appreciate the devs reflecting on each release and future directions).

[0]: https://jasp-stats.org/


Even better than just being "free to use" it's F/OSS (under the AGPL):

https://github.com/jasp-stats/jasp-desktop


there was an interview of one of the JASP (creator or maintainer, can't remember) in the "Learn Bayesian Stats" podcast; it was very interesting.


I think the referenced episode is https://learnbayesstats.com/episode/61-why-we-still-use-non-... Thanks for pointing it out!


To me, it's academic software done right, both in terms of accessibility and maintenance. I'd love to hear more about their governance and funding structure and how this might be applied elsewhere, and learn about academic software of similar grade and utility.


How does this compare to other stat libraries?


I'd say in the free to use/FOSS category there are couple of contenders here: R Studio & co, Jamovi, Stenci.la, KNIME, PSPP and Quarto.

R Studio for the uncompromised R language integration.

Jamovi for the clean interface and simple to use analyses and workflows.

Stenci.la for the multiple language notebook paradigm.

KNIME for the node based interface and workflow productionisation.

PSPP for the 'same but different' SPSS interface and feature set.

Quarto for the 'analysis as publication' approach, reactive notebooks and custom integrations.

Myself I often opt for BigQuery + Javascript stat libraries, which is free insofar you remain within the sandbox mode.


What's often missing from these introductions is when statistics will not work; and what it even means when it "works". The amount of data needed to tell between two normal is about 30 data points -- between two power-law distributions, >trillion. (And this basically scuppers the central limit theorem, on which a lot of cargo-cult stats is justified).

Stats, imv, should be taught simulation-first: code up your hypotheses and see if they're even testable. Many many projects would immediately fail at the research stage.

Next, know that predictions are almost never a good goal. Almost everything is practically unpredictable -- with a near infinite number of relevant causes, uncontrollable.

At best, in ideal cases, you can use stats to model a distribution of predictions and then determine a risk/value across that range. Ie., the goal isnt to predict anything but to prescribe some action (or inference) according to a risk tolerance (risk of error, or financial risk, etc.).

It seems a generation of people have half-learned bits of stats, glued them together, and created widespread 'statistical cargo-cultism'.

The lesson of stats isnt hypothesis testing, but how almost no hypotheses are testable -- and then what do you do


"Simulation first" is how I did things when I worked in data science and bioinformatics. Define the simulation that represents "random", then see how far off the actual data is using either information theory or just a visual examination of the data and summary statistic checks. That's a fast and easy way to gut check any observation to see if there is an underlying effect, which you can then "prove" using a more sophisticated analysis.

Just raw hypothesis is just too easy to juke by overwhelming it with trials. Lots of research papers have "statistically significant" results, but give no mention of how many experiments it took to get them, or any indiciation of negative results. Eventually, there will always be the analysis where you incorrectly reject the null hypothsis given enough effort.


>The amount of data needed to tell between two normal is about 30 data points

What are you trying to say here? If there are two normal distributions, both with variance one, one having mean 0 and the other having mean 100, and I get a single sample from one of the distributions, I can guess which distribution it came from with very high confidence. Where did the number 30 come from?


> Where did the number 30 come from?

Yeah, I've also heard 30 for normal distributions over and over in ~7 stats courses that I've taken.

This SE stats answer sounds reasonable enough: https://stats.stackexchange.com/a/2542


This really resonates with me. I've attempted self-study about statistics many times, each time wanting to understand the fundamental assumptions that underlie popular statistical methods. When I read the result of a poll or a a scientific study, how rigorous are the claimed results, and what false assumptions could undermine them?

I want to build intuitions for how these statistical methods even work, at a high level, before getting drowned in math about all the details. And like you say, I want to understand the boundaries: "when statistics will not work; and what it even means when it "works".

I imagine that different methodologies exist on a spectrum, where some give more reliable results, and others are more likely to be noise. I want to understand how to roughly tell the good from the bad, and how to spot common problems.


It's ironic that this ... rant? ... is basically unreadable without knowledge of basic statistical methods.

How do you teach any of this to someone who hasn't already taken introductory statistics? How do you learn anything if you first have to learn the myriad ways something you don't even have a basic working knowledge of can fail before you learn it?


The comment is addressed to the informed reader who is the only one with a hope of being persuaded on this point.

To teach this, from scratch, I think is fairly easy -- but there's few with any incentive to do it. Many in academia wouldnt know how, and if they did, would discover that much of their research can be shown a priori to not be worthwhile (rather than after a decade of 'debate').

All you really need is to start with establishing an intuitive understanding of randomness, how apparently highly patterned it is, and so on. Then ask: how easy is it to reproduce an observed pattern with (simulated) randomness?

That question alone, properly supported via basic programming simulations, will take you extremely far. Indeed, the answer to it is often obvious -- a trivial program.

That few ever write such programs shows how the whole edifice of stats education is geared towards confirmation bias.

Before computers, stats was either an extremely mathematical disipline seeking (empirically useless) formula for toy models; or using heuristic empirical formula that rarely applied.

Computers basically obviate all of that. Stats is mostly about counting things and making comparisons -- perfect tasks for machines. with only a few high-school mathematical formula most could derive most useful statistical techniques as simple computer programs.


The modern approach, of which this textbook is an example, does start with simulation. In fact there is very little classical statistics (distributions, analytic tests) in the book. The Berkeley Data 8 book, which I link to in another comment, takes the same approach. I imagine there is still too much classical material for your tastes, but there is definitely change happening.


“ that much of their research can be shown a priori to not be worthwhile”

Bingo. Cargo cult stats all the way down. It’s not just personal interest, it’s the entire field, it’s their colleagues, mentors, and students. Good luck getting somebody to see the light when not just their own income depends on not seeing it, their whole world depends on the “stat recipes” handed down from granny.


I think the egotistical aspect is the most powerful: many researchers have built an identity based on the fact that they “know” something, so to propose better alternatives to their pet theories is tantamount to proposing their life is a lie. To change their mind they need to admit they didn’t “know”.

The better the alternatives, the more fierce the passion with which they will be rejected by the mainstream.


I now think it’s best explained by simple economics. Academia and academics are the product of economic forces by and large. It’s not quirky personalities or uniquely talented minds that make up academia today. It’s droves of conscientious (big five sense) conformists, with either high iq or mere socio-economic privilege, who have been trained by our society to feel that financial security means college, and even more financial security means even more college. Credentials are like alpha .05, they solve a scale problem in a way that alters the quality/quantity ratio. If you want more researchers/research/science output, credentials and alpha .05 cargo cult stats are your levers to get more quantity at lower quality.


It seems like a reasonable critique. The suggestion is to include such ideas as people are taking introductory statistics which isn’t inappropriate. I wouldn’t suggest forcing students to code up their own simulations from scratch, but creating a framework where students can plug in various formula for each population, attach a statistical test, and then run various simulations could do quite a bit. However, what kinds of formula students are told to plug in are important.

If every formula is producing bell curves then that’s a failure to educate people. 50d6 vs 50d6 + 1 is easy enough you can include 1d2 * 50 + 50d6 for a 2 tailed distribution, but also significantly different distributions which then fail various tests etc.

I’ve seen people correctly remember the formula for statistical tests from memory and then wildly misapply them. That seems like focusing on the wrong things in an age when such information is at everyone’s fingertips, but understanding of what that information means isn’t.


I work in applied ML and stats. Whenever a client gets pushy about getting a prediction and would not care about quantifying the uncertainty around it, I take it as a signal to disengage and look for better pastures. It is really not worth the time, more so if you value integrity.

Competent stakeholders and decision makers use the uncertainty around predictions, the chances of an outcome that is different from the point-predicted outcome, to come to a decision and the plan includes what the course of action should be should the outcome differ from the prediction.


Model building, at large, is the thing I regret being bad at. Model your problem and then throw inputs at it and see what you can see.

Sucks, as we seem to have taught everyone that statistical models are somehow unique models that can only be made to get a prediction. To the point that we seem to have hard delineations between "predictive" models and other "models.".

I suspect there are some decent ontologies there. But, at large, I regret that so many won't try to build a model.


> The amount of data needed to tell between ... two power-law distributions, >trillion.

I don't agree with this as a statement of fact (except in the obvious case of two power-law distributions with extremely close parameters). Supposing it was true, that would mean that you would almost never have to actually worry about the parameter, because unless your dataset is that large one power law is about as good as any other for describing your data.


>> between two power-law distributions, >trillion

Do you have anywhere I can read more about this? I would have assumed that a trillion data points would be sufficient to compare any two real-world distributions


I am a noob and I've always got stuck on comparing two independent means. Assumption: normality. Yeah, data is never normal in my bakery.


The sample means approach the normal distribution as the sample size grows, even if the underlying distributions are not normal. That's the the central limit theorem.

(Requires some very lax assumptions like finite variance on the underlying distribution)


They should remove "modern" from the title, because who the hell uses the "R programming language" these days anymore?


A lot of people... in fact a huge portion of statisticians, epidemiologists, econometrics, use it as their primary language.

I do genetic epidemiology (which is considerably more compute intensive than regular epidemiology), and R is still the most common language, with the most libraries and packages being used for it, compared to python for example.

I think maybe you should consider being less forthcoming with your opinions on topics which you are not well informed on.


I worked in data science for a few start ups, and even though I know Python (it's my LeetCode language of choice), R just dominates when it comes to accessing academic methods and computational analysis. If you are going to push the boundaries of what you can and can't analysis for statistical effects and leverage academic learnings, it's R.


Before I knew command line, I tried to install python and spent the next 3 days resolving an installation issue with 'wheel'.

By contrast, from first downloading R to running my first R script took about 1 hour (the most difficult part was opening the 'script' pane in RStudio IDE, which doesn't open by default on new installations, for some reason).

There's huge demand out there for statistical software that's accessible to people whose primary pursuit is not programming/cs, but genetics, bioinformatics, economics, ecology and other disciplines that necessitate tooling much more powerful than excel, but with barriers to entry not much greater than excel. R is a fairly amazing fit for those folks.


R and CRAN really get package management right. Even as a very infrequent R user, there are no surprises, it "just works". Compare that to my daily Python usage where I am continually flummoxed by dependency issues.


Strong disagree, there's a reason RStudio/Posit are spending so much time trying to develop 3rd party alternatives to install.packages() and CRAN.

Try installing an older version of a package without it pulling in the most recent incompatible dependencies, it's a whole adventure.


Everyone in my branch of Toxicology? Tons of people in biological sciences. Just because you have bias against the tool and don't run in the same circles doesn't mean that R isn't used and love by a subset of devs.


Respectfully, I'm going to ask, "what what?". I can't swing a cat without hitting dplyr. It's probably industry dependent though - I could see a dataset that's 99% text having absolutely no reason to even look at R at all.


Probably most people who do statistics.

R sucks as a language but it excels at that specific application, just because of its tremendous ecosystem (putting even python to shame in some niche areas).


R is fine, it's no more absurd than other non-typed languages like javascript. Most languages are very good at one or two things, then not so good or appropriate for other tasks. For R, that's statistics, modeling, and exploratory analysis, which it absolutely crushes at due to ecosystem effects.


Well… I also consider Javascript to be a horrible language. Python is horrible as well, but better than R. IMO python and javascript are in the same ballpark.

Not all non-typed languages are bad. Clojure, for example, is one if the most elegant languages I’ve worked with (despite my dislike of the JVM).


Statisticians do. The Berkeley curriculum, which I've linked to in another comment, uses Python.


fyi many state-of-the-art statistical libraries exist (or are properly maintained) in R only


I find it depends on what you want. There is no canonical GAM (gen. addative model) library in python but there are a few options - which are not easy to use. The statsmodels GAM implementation appears to be broken. R, of course, has a stupid easy to use GAM library that is pretty fast.

On the other hand, R has too many obscure options for what I can find in scipy or sklearn. So I find it easier to just jump into sklearn, use the very nice unified interface "pipelines" to churn through a whole bunch of different estimators without having to do any munging on my data.

So I think it just depends on your field. But R seems to stick more with academia.


Everyone but you. Check any statistics journal. Only a few people developing methods switched to Python or Julia.


Most people in bioinformatics.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: