Shapiro Wilk isn't all that useful with practical data unless your sample sizes are fairly small. Once you deal with anything above 5000 values, you are better off with QQ plots.
> If the p-Value is less than significance level (ideally 0.05),
Erm, no. P=0.05 is borderline meaningless, there could as much as 30% chance you are wrong about the actual difference being there depending on the true probability of the initial hypothesis.
Even better, p-values should not be used at all. If I have data in hand, I want to use it to find out the probability that my hypothesis is true. But p-value analysis requires me to instead ask a different question that I don't really care about, involving whether my data are consistent with the null hypothesis.
Everything is just so much more sensible if you allow yourself to assign probabilities to hypotheses, rather than assuming a hypothesis from the outset and computing opaque statistics relating to your data.
There is in fact a probability attached to p-values. A p-value of 0.05 for instance means your conclusions will be wrong 5 out of 100 times. You can reduce the p-value to e.g. 0.001 or any other value you want.
No, it means that the probability of seeing an effect of that magnitude on a dataset of that size when the null hypothesis is true will happen due to random chance 5 out of 100 times. It says NOTHING about your hypothesis, it is entirely a statement about the null hypothesis.
No, the base rate. If you're doing a significance test to detect a rare disease that only occurs in a tiny fraction of people, for example, the majority of statistically significant results you obtain will be false, despite having p < 0.05.
R has some really good GUI layers now. I struggled and struggled for years trying to learn the command line methods, but it was too much for me. The following do a great job (these are alternatives)
It seems like this list is incomplete without mentioning that both RStudio[1] and Jupyter[2] notebooks now have really first class support for R. There are also two upstatrs, Rodeo[3] and Beaker[4] are doing cool stuff as well.
The company I work for, Domino Data Lab[5], let's you fire up a lot of these notebooks in a nice hosted environment on big cloud servers with minimal cost and effort. It's a fun way to learn how all these new environments can work together. From RStudio for exploratory analysis, to Jupyter notebooks for presenting a topic. The other two I haven't really found the superior use-case. The tools in this space are just getting better and better.
> Jupyter[2] notebooks now have really first class support for R.
Jupyter and R is a bit iffy since the R kernel is not native. Although the kernel works fine, setting it up has a ton of manually-installed dependencies, and in-line plots flat-out give unexpected output. (I've had to cheat by embeding charts via Markdown. Although that has the benefit of having the charts be responsive)
> Jupyter and R is a bit iffy since the R kernel is not native. Although the kernel works fine, setting it up has a ton of manually-installed dependencies, and in-line plots flat-out give unexpected output. (I've had to cheat by embeding charts via Markdown. Although that has the benefit of having the charts be responsive)
You know, to be completely honest, I've never used it directly. I've always used it on our platform. It's very possible that our engineers already did all that setup so it "just works." I took the original post: http://r-statistics.co/Statistical-Tests-in-R.html and reimplemented it in an R notebook with some simple plots at the end, but yeah, the plotting just sort of works for me. I didn't realize I had an incomplete view of the complexity of getting that working :(
Package installation can be a bit of an issue as well, especially if you accidentally install a package twice. But overall, I still prefer notebooks to Rstudio. They are transparent and you can really trace your progress and share the info with others.
Does R Studio have a bunch of GUI plugins for doing the various common statistics tasks? Because the base R Studio doesn't do much (it's nice for running R, but I don't think that one can do linear regressions etc. via a GUI -- correct me if I'm wrong).
No that's not really what it does. You can edit code separately from the REPL(which is also directly available), view plots, examine some data objects, view help/command history/etc. Essentially it's like an IDE for the R language, it doesn't turn R into like an SPSS GUI type interface.
Edit: the closest thing I can think of in RStudio to that is installing the manipulate package which allows adding sliders and such to plots for some custom plotting controls.
1. If you're interested in running a shiny server, use http://rstudio.github.io/shinydashboard/! I have used it to build professional high quality dashboards VERY quickly.
2. You can use an API server like Domino's API end points or OpenCPU to expose R APIs and build the interface using JavaScript at plot.ly! This really can be incredibly elegant and you can do really neat dynamic dashboards.
Is opencpu something you recommend for production? I'm just starting to work with analysts who work in R, and I have struggled with the question whether we should wrap existing R code as an api...or port to python.
Low volume right now, so not really concerned with performance... But rather that can R deployment play nicely with things like supervisord,etc in production
Well, I really don't want to turn this into a sales pitch, but that's the exact use case for Domino's API endpoints. Check out http://support.dominodatalab.com/hc/en-us/articles/204173149... for an explanation. If you're interested, drop me an email and I can work up an example project for you. Exposing R algorithms as REST endpoints is exactly what it does quite well.
As for OpenCPU, I know that the guy who wrote it, Jeroen Ooms is genuinely quite brilliant. I know it was his project during his PhD, and I don't know what his plans are for continuing to support it. It's up to you to determine what that means for your "production" needs.
So I made a small demo for you. If you go to https://app.dominodatalab.com/earino/d3_dashboard_demo/raw/6... you will see a very simplistic, d3 powered, R backed dashboard. It just draws a pie chart, a line chart, and a bar chart. Every 10 seconds, it polls an R API endpoint to get new data.
I've also used Shiny to construct fairly complicated web apps, with dashboards (using shinydashboard). It's very good, up to a point.
I keep banging my head against issues around persistent data storage and app customisation at a user level. Unless one pays for Shiny Server Pro, the free Shiny Server doesn't support user authentication. Hosting on shinyapps.io doesn't really support persistent user data, unless it's offloaded elsewhere such as Dropbox or a remote SQL database, which brings into play a bunch of security questions.
Shiny is good, but not quite yet outstanding.
I was really interested in an article I read recently about using jQuery.ui widgets and R to build interactive web app.[1] I'm keen to explore this as a potential way forward.
That's interesting to know. The reason shinydashboard or something else a nice is because the data scientist can work on his own. Expecting him to learn js would be a dead end.
I am willing to bet that he would rather use R api and excel to build a dashboard, rather than anything else.
If you just want a personal or internal dashboard, then Shiny is a good solution, but you will need to deal with setting up and managing Shiny Server. If you don't care about the world viewing the dashboard, or can afford a small fee each month, then shinyapps.io is a decent solution.
I just wanted to point out that there are limitations with this route, when the apps start to become more complex, with multiple users, various access permissions and personalisation requirements.