Most commonly used statistical tests and implementation in R

IndianAstronaut · on Jan 1, 2016

Shapiro Wilk isn't all that useful with practical data unless your sample sizes are fairly small. Once you deal with anything above 5000 values, you are better off with QQ plots.

ekianjo · on Jan 1, 2016

> If the p-Value is less than significance level (ideally 0.05),

Erm, no. P=0.05 is borderline meaningless, there could as much as 30% chance you are wrong about the actual difference being there depending on the true probability of the initial hypothesis.

P-values should be used with strong caution.

snydly · on Jan 1, 2016

> P-values should be used with strong caution.

FiveThirtyEight (and Scientific American, and others) did some pretty interesting articles about this recently if you haven't seen it:

http://fivethirtyeight.com/features/science-isnt-broken/

Just from personal experience, the use of p-values is really broken in biology/chemistry. The things I've heard principal investigators say...

cetacea · on Jan 1, 2016

Even better, p-values should not be used at all. If I have data in hand, I want to use it to find out the probability that my hypothesis is true. But p-value analysis requires me to instead ask a different question that I don't really care about, involving whether my data are consistent with the null hypothesis.

Everything is just so much more sensible if you allow yourself to assign probabilities to hypotheses, rather than assuming a hypothesis from the outset and computing opaque statistics relating to your data.

healer · on Jan 1, 2016

There is in fact a probability attached to p-values. A p-value of 0.05 for instance means your conclusions will be wrong 5 out of 100 times. You can reduce the p-value to e.g. 0.001 or any other value you want.

cwyers · on Jan 1, 2016

No, it means that the probability of seeing an effect of that magnitude on a dataset of that size when the null hypothesis is true will happen due to random chance 5 out of 100 times. It says NOTHING about your hypothesis, it is entirely a statement about the null hypothesis.

GFK_of_xmaspast · on Jan 1, 2016

> could as much as 30% chance you are wrong about the actual difference being there depending on the true probability of the initial hypothesis.

I'm having trouble parsing this, are you talking about the power of the test?

capnrefsmmat · on Jan 1, 2016

No, the base rate. If you're doing a significance test to detect a rare disease that only occurs in a tiny fraction of people, for example, the majority of statistically significant results you obtain will be false, despite having p < 0.05.

This is a common problem in many fields, and you can use false discovery rate control methods to account for it: http://www.statisticsdonewrong.com/p-value.html

minimaxir · on Jan 1, 2016

It's also worth looking at the documentation in R for each of the functions too. (can invoke with console with ?chisq.test for example).

For example, the chisq.test has optional built-in Monte Carlo testing, and none of the other functions do, oddly.

cloakanddagger · on Jan 1, 2016

This is a great post! Bookmarking this for future reference.

hackaflocka · on Jan 1, 2016

This is a good resource for those new to R.

R has some really good GUI layers now. I struggled and struggled for years trying to learn the command line methods, but it was too much for me. The following do a great job (these are alternatives)

- Deducer

- R Commander

- RKWard

earino · on Jan 1, 2016

It seems like this list is incomplete without mentioning that both RStudio[1] and Jupyter[2] notebooks now have really first class support for R. There are also two upstatrs, Rodeo[3] and Beaker[4] are doing cool stuff as well.

The company I work for, Domino Data Lab[5], let's you fire up a lot of these notebooks in a nice hosted environment on big cloud servers with minimal cost and effort. It's a fun way to learn how all these new environments can work together. From RStudio for exploratory analysis, to Jupyter notebooks for presenting a topic. The other two I haven't really found the superior use-case. The tools in this space are just getting better and better.

1. https://www.rstudio.com/ 2. http://jupyter.org/ 3. http://blog.yhat.com/posts/introducing-rodeo.html 4. http://beakernotebook.com/ 5. https://www.dominodatalab.com/

minimaxir · on Jan 1, 2016

> Jupyter[2] notebooks now have really first class support for R.

Jupyter and R is a bit iffy since the R kernel is not native. Although the kernel works fine, setting it up has a ton of manually-installed dependencies, and in-line plots flat-out give unexpected output. (I've had to cheat by embeding charts via Markdown. Although that has the benefit of having the charts be responsive)

The important perk is that Jupyter notebooks are now rendered natively on GitHub, which I've made considerable use of: https://github.com/minimaxir/sf-arrests-when-where/blob/mast...

earino · on Jan 1, 2016

> Jupyter and R is a bit iffy since the R kernel is not native. Although the kernel works fine, setting it up has a ton of manually-installed dependencies, and in-line plots flat-out give unexpected output. (I've had to cheat by embeding charts via Markdown. Although that has the benefit of having the charts be responsive)

You know, to be completely honest, I've never used it directly. I've always used it on our platform. It's very possible that our engineers already did all that setup so it "just works." I took the original post: http://r-statistics.co/Statistical-Tests-in-R.html and reimplemented it in an R notebook with some simple plots at the end, but yeah, the plotting just sort of works for me. I didn't realize I had an incomplete view of the complexity of getting that working :(

https://app.dominodatalab.com/earino/statistical_tests/view/...

We also render the notebooks. The difference is that we also let you run them :)

stared · on Jan 1, 2016

Manually setting it is hard (on OS X + Homebrew Python I did it after a long fight; main problem: rmzq library). But... it is super easy with Anaconda: https://www.continuum.io/blog/developer/jupyter-and-conda-r

minimaxir · on Jan 1, 2016

Huh, I thought conda was Python only. I'll definitely take a look!

IndianAstronaut · on Jan 1, 2016

Package installation can be a bit of an issue as well, especially if you accidentally install a package twice. But overall, I still prefer notebooks to Rstudio. They are transparent and you can really trace your progress and share the info with others.

TheLogothete · on Jan 1, 2016

And you can't with Rstudio? There's a freaking notebook feature built right in.

IndianAstronaut · on Jan 1, 2016

Much easier to turn the Jupyter noteboks into a clean html format. Also, it's sequential, with Rstudio it's nice, but doesn't have the cell format.

hackaflocka · on Jan 1, 2016

Does R Studio have a bunch of GUI plugins for doing the various common statistics tasks? Because the base R Studio doesn't do much (it's nice for running R, but I don't think that one can do linear regressions etc. via a GUI -- correct me if I'm wrong).

jupiter90000 · on Jan 1, 2016

No that's not really what it does. You can edit code separately from the REPL(which is also directly available), view plots, examine some data objects, view help/command history/etc. Essentially it's like an IDE for the R language, it doesn't turn R into like an SPSS GUI type interface.

Edit: the closest thing I can think of in RStudio to that is installing the manipulate package which allows adding sliders and such to plots for some custom plotting controls.

nafizh · on Jan 1, 2016

For custom plotting controls, you can just use shiny. It is awesome. Check out the shiny gallery and you will see that.

sandGorgon · on Jan 1, 2016

What do you recommend to build business dashboards in R? By pulling data from an api or sb for example

earino · on Jan 1, 2016

Two approaches:

1. If you're interested in running a shiny server, use http://rstudio.github.io/shinydashboard/! I have used it to build professional high quality dashboards VERY quickly.

2. You can use an API server like Domino's API end points or OpenCPU to expose R APIs and build the interface using JavaScript at plot.ly! This really can be incredibly elegant and you can do really neat dynamic dashboards.

sandGorgon · on Jan 1, 2016

That is great to know!

Is opencpu something you recommend for production? I'm just starting to work with analysts who work in R, and I have struggled with the question whether we should wrap existing R code as an api...or port to python.

Low volume right now, so not really concerned with performance... But rather that can R deployment play nicely with things like supervisord,etc in production

earino · on Jan 1, 2016

Well, I really don't want to turn this into a sales pitch, but that's the exact use case for Domino's API endpoints. Check out http://support.dominodatalab.com/hc/en-us/articles/204173149... for an explanation. If you're interested, drop me an email and I can work up an example project for you. Exposing R algorithms as REST endpoints is exactly what it does quite well.

As for OpenCPU, I know that the guy who wrote it, Jeroen Ooms is genuinely quite brilliant. I know it was his project during his PhD, and I don't know what his plans are for continuing to support it. It's up to you to determine what that means for your "production" needs.

sandGorgon · on Jan 1, 2016

Interesting. Did you post about this on /r/rprogramming

There are lots of people who have rolled their own solutions for production deployment. Including nodejs !

earino · on Jan 1, 2016

Greetings!

So I made a small demo for you. If you go to https://app.dominodatalab.com/earino/d3_dashboard_demo/raw/6... you will see a very simplistic, d3 powered, R backed dashboard. It just draws a pie chart, a line chart, and a bar chart. Every 10 seconds, it polls an R API endpoint to get new data.

The code for the R endpoint is https://app.dominodatalab.com/earino/d3_dashboard_demo/view/.... It's a simple R function which generates synthetic data. It could be used to pull much more complex data, generate predictions from an ML model, etc...

Please don't hesitate to reach out if you have any questions!

sandGorgon · on Jan 1, 2016

this is awesome! you should put stuff like this on the front page of your site. It makes it incredibly simple to understand what you are about.

phillc73 · on Jan 1, 2016

I've also used Shiny to construct fairly complicated web apps, with dashboards (using shinydashboard). It's very good, up to a point.

I keep banging my head against issues around persistent data storage and app customisation at a user level. Unless one pays for Shiny Server Pro, the free Shiny Server doesn't support user authentication. Hosting on shinyapps.io doesn't really support persistent user data, unless it's offloaded elsewhere such as Dropbox or a remote SQL database, which brings into play a bunch of security questions.

Shiny is good, but not quite yet outstanding.

I was really interested in an article I read recently about using jQuery.ui widgets and R to build interactive web app.[1] I'm keen to explore this as a potential way forward.

[1] http://www.r-bloggers.com/creating-multi-tab-reports-with-r-...

sandGorgon · on Jan 1, 2016

That's interesting to know. The reason shinydashboard or something else a nice is because the data scientist can work on his own. Expecting him to learn js would be a dead end.

I am willing to bet that he would rather use R api and excel to build a dashboard, rather than anything else.

phillc73 · on Jan 1, 2016

If you just want a personal or internal dashboard, then Shiny is a good solution, but you will need to deal with setting up and managing Shiny Server. If you don't care about the world viewing the dashboard, or can afford a small fee each month, then shinyapps.io is a decent solution.

I just wanted to point out that there are limitations with this route, when the apps start to become more complex, with multiple users, various access permissions and personalisation requirements.