Hacker News new | past | comments | ask | show | jobs | submit login
How R Took the World of Statistics by Storm (statisticsviews.com)
98 points by mindcrime on Nov 22, 2015 | hide | past | favorite | 48 comments



I've regretted that Octave hasn't done for Matlab what R did for S. I understand some of the cirumstances that made this happen, but I'm deeply saddened of the entrenchment that Matlab has in scientific computing. It's getting chipped away little by little at the edges by Python, and to a lesser extent by Julia, but Matlab still is strong. And yes, some uses of Matlab can be replaced by R, but overall the two packages target different problem domains.

I'm taking a break from Octave, but I plan to come back to it and take Matlab head-on, not chip away at the edges.


I think the main problem octave faces is matlab's libraries (such as simulink) which have no comparable open source or even free alternatives, that and writing matlab code sucks for anything bigger than a 50 or 100 lines.

Anyone that isn't using those could probably switch to octave with few issues but last time I checked some functions were starting to differ (see eig)

Octave is nice when you really need to have matlab compatible files but don't have a matlab license but now everyone that I know in the sciences if they are not using matlab and don't need to use fortran/c/c++ they are using python because why keep the matlab syntax if you can gain so much by switching to another language.


I used OpenModelica a number of years ago as a simulink replacement on a project which couldn't afford MATLAB.

https://openmodelica.org/

EDIT: Also, although it's not opensource, I had used Mentor SystemVision eons ago in college. It now has a free web interface. I haven't tried it, but honestly, what could be worse than DxDesigner ;-)

https://www.systemvision.com/


I think Julia stands a really good chance at "chipping away" Matlab to a significant degree once it gets to a stable version 1.0 The syntax is close enough to Matlab to make the switch fairly easy, its got the performance, and the community has already developed a lot of cool scientific computing libraries even though each new version has breaking changes.


I think when Julia gets to 1.0, Matlab will begin to decline in use. I also think MathWorks is well aware of this "threat". The next few years will be interesting.


I'm optimistic about Julia also as a general purpose programming language. It seems to have taken a lot of good features from a lot of different languages, and managed to do it in a way that feels natural to me.


As someone who had to use a ton of Matlab in grad school, Julia's (and Python's) biggest issue in overtaking Matlab is replicating the large collection of battle-tested add-ons that Matlab offers.

Matlab is a crappy language, but a productive environment.


> I understand some of the cirumstances that made this happen

What are they?


The story has fallen off the front page, so there's no point in writing a detailed response to this. If you're still interested, email me at jordigh@octave.org for my analysis.


Compared to Matlab. I liked Octave a lot. The kOctive gui was great.


> I've regretted that Octave hasn't done for Matlab what R did for S

R was much more programmable that the other systems (except S) in that time -- while the R language is not pretty, the scripting languages in SAS, Stata etc. were much worse. So R provided an improvement in people's workflow. Whereas Octave is just a free version of the same language as Matlab (with some small improvements).

So, switching from SAS, Stata, SPSS etc. to R provided an improvement in productivity, but switching from Matlab to Octave does not.


You can make the same comparison between Fortran and Octave as you do betwenn SAS and R, and Octave at first did not even aim to be Matlab-compatible. Matlab compatibility came years later, as people using Matlab requested it.

This is not the reason for why Octave has not overtaken Matlab. Maybe I should write a blog post about it.


what are the problem domains ? Do you think something like Python Pandas... and especially Jupyter/IPython can replace it ?


Python has a steeper learning curve and is not as curtailed to simple data analysis. Many use Rstudio (an ide) and use the import data, and other tools to make then skill entry even lower.

Also, mathematicians and statisticians think functionally and the general attitude in python is to do object oriented programming while R is strictly functional programming with a little bit of object programming.


I'm a little at odds on this--for production quality analyses, (and only for analyses) R is excellent.

However, in my experience, for the data munging required as a preliminary to the analyses, R is worse than bad. It's as if satan himself designed a language.

I find that what then happens is this: data scientists/statisticians/[your favorite word here] become reliant on programmers to clean/format the data to do the analyses.

This is all fine, but those same scientists are then put off learning python, where they could do all of their own munging, and probably 95% of the analysis they need to do, and where they could further add value by writing programs that are easier to production-alize.

Job security for those who know how to write production code, I guess.


Is your assessment based on using recent R packages? I recently learned about dplyr, magrittr and rvest in a couple of recent data science courses and it seems to me that data munging is a pleasure with R. For example, I had a rough time scraping Wikipedia using Python/BeautifulSoup (I might be a little weak using them tbh) but knocked it out with rvest and magrittr. I never wrote it up but this guy[1] did something similar and wrote a nice post about it.

[1] http://opiateforthemass.es/articles/james-bond-film-ratings/


I couldn't disagree more. R is great at munging pretty much everything but unstructured textual data. The tools are definitely behind Python if you're dealing with literal written documents.

I don't know anyone who considers themselves a "data scientist" of any sort that doesn't view their job as 80% or more data wrangling/munging/cleaning.

I write production ETL processes in R at my current job. AMA.


May I ask what tools you favor in the R environment? I just haven't found anything as performant for operations on irregular and poorly formatted time series as the pandas library, and in fact I just finished an ETL in pandas for my current job.

I'm always interested in learning a new tool, though.


I don't work much with data that would benefit from being very tight about datetimes as a dimension. I'd have to know a bit more about what was challenging before I could confidently recommend for your particularly case. My email is on my profile and I'd be happy to chat there if it's something that would be helpful.

I have largely avoided ts, zoo, etc where possible. Time series stuff seems to have a lot of specialized tooling all of which tends to be much more strict about data structure than I'm comfortable with for my flow.


I might be misrembering, but I think that it was assumed for a while that perl would be used for data munging, so there wasn't much effort put into that part of the language. That's being addressed now by packages, and a lot of the uptake in R seems to coincide with work by R developers to make the language less hostile to new users. (Have you used the bundled IDE? Satan's work again.)

BUT perl + R was a really nice combination for a while.


Depends on coworkers. When in school, my professors couldn't do anything if it wasn't in a nice csv file. But, us data scientist / statistical programmers are well versed in digesting data in almost any form, especially a database. When on a new project, I just get handed new ip addresses and login information and I am off.


What do you mean by data munging? Things like extracting data from XML files for instance?


It can mean anything, and that is why it is hard. Many old school statisticians can not work with anything other than csv, excel spreadsheets or basic sql queries. Munging is the conversion to a nice format that can then be used for analysis.


Understood. Yeah, this sounds like a job for awk or some such specialised tool.


I wouldn't be surprised if it did. Lots of universities are slowly switching to Jupyter as the primary computational environment.


Numpy/Scipy (and Julia, maybe) are taking this role. There's really no reason to use Matlab these days.


There are a lot of reason to use Matlab these days.

A lot of package are built from matlab and not for Python, if you are a PhD student you want to use those package, not write your own, if you really really need to write some software you want to write it on top of something that already exists...

It is really sad, but it is the reality...


If only that were true. There are many, many reasons to keep using Matlab, mostly toolboxes and decades of software written on top of it. As I said in another comment, Matlab is a bad language, but a productive environment.

In grad school, if I wanted to do my own EEG connectivity analyses, I could just include the Signal Processing Toolbox, the Stats Toolbox, and crunch my own numbers. Or, if I wanted to do a more standard analysis of my fMRI or EEG data, I would turn to the world's most popular open-source toolkits (SPM and Fieldtrip), both of which require... you guessed it, Matlab.

The only place I ever found Matlab's libraries deficient for my needs was in machine learning. (I ended up doing an SVM-based spotlight fMRI analysis in Python.)

There's a lot of lock-in and quality toolboxes around Matlab, Python/Julia won't knock it over yet, though I wish them the best of luck.


I program in R 80% of my day. I have experiences in all the major alternatives but keep returning to R. It has one huge flaw, being slow but otherwise is fantastic to work with and has a vibrant community.

The bigger issue is that while R is liked by statisticians it lacks many of the features for the software development. We run across difficulties with logging, version control of packages, speed, size of docker image, build time etc. But, with these drawbacks I keep coming back because I develop faster and better in R.


I agree, but this stuff is getting better. I'm actually considering breaking away from the Rocker-derived stuff because the images get so big I'm pretty sure I could maintain a faster build myself. Problem is I haven't used R locally in Linux for a long time and the split off between things in the OS package manager and R can be a bit tricky with dependency management.

packrat has helped a lot with version control of packages, but it still doesn't quite feel like the right solution.

I've been really impressed in the last 5 years how far R has come in these areas though, so like you I keep coming back. By the time I start getting over the learning curve other places, R seems to have developed better tooling for what I want to accomplish anyway and I can come back and right cleaner, clearer, better software faster in R.


I agree with your assertion that R is slow, yet quick to develop in.

I recently had to loop through 1.3Gb of data (5000 files) and merge just one column from each file into a new dataset. It did so in ~2 hours. Yet the loop was just ~5 lines of code.


This task sounds almost uniquely poorly suited for R, but this has gotten better. For example, adding a column (did you append to the right or do an actual merge/join?) used to require copying the previous table but doesn't any more.

I wonder if you tried doing things like:

* preallocate a list, then do.call(cbind, your_data) * Same as above, but with some of the faster alternatives to cbind like dplyr::bind_cols or data.table::cbind * Use data.table, which has far faster joins than base R (so does dplyr) if you were doing a true merge/join

If it was truly just adding a column rom each file together into a file, these kinds of tasks are much better using UNIX tools, in my experience.


It is slow. And it is ok. Very few times will R ever beat any other language. Usually it is not off by much, but especially if coded by a novice using for loops vs apply functions can make is 100 -1000 x slower.

Another example is the immutable structure that causes R to be a memory hog. Creating copies of data everywhere. But, again if you plan well and execute the 'best' solutions you can avoid the giant pitfalls but will rarely ever beat a equally well written python equivalent.


Post R 3.1 there are far fewer deep copies (e.g. modifying a list or adding a column to a data.frame no longer copies the whole thing like it used to).


What would be a good start to learn this?

I have some programming background and really would like to get into statistics. Should I do some R tutorial and throw my weblogs at it to see what I can do? Or is there some awesome learning resource you could share?


I'm the organizer of the Dallas R Users Group. I've compiled a list of helpful resources for beginners.

http://www.meetup.com/Dallas-R-Users-Group/pages/R_Helpful_L...


If you come from a programming background then Hadley Wickham's book is probably the best place to start.

http://adv-r.had.co.nz/


There is a nice book by Brett Lantz called Machine Learning. The first edition (which I have) built machine learners in R, I assume the new second edition does the same.


Statistics does not mean machine learning just fyi.


Discovering Statistics Using R by Andy Field.


R replaced SPSS.

Octave replaced Mathlab.

Python based libraries are somewhere in between.

Julia with Jupyter will probably replace Mathematica, LabVIEW and Mathcad (and unify all of the above) with a powerful native language and environment.


???

R replaced S and S-PLUS, not SPSS. SPSS is still around as a light, user-friendly stats tool.

And Octave is nowhere near replacing Matlab, not by a long shot. It's the complete opposite story as R/S-PLUS.

Source: was in grad school for cognitive neuroscience. Saw Matlab everywhere. Saw SPSS here and there. Saw Octave nowhere. Briefly looked at Octave and stopped as soon as I realized all of the packages everyone used required Mathworks toolboxes.


I don't really see how Julia replaces LabVIEW - and I say this as someone who greatly dilikes LabVIEW after having done a lot of work in it. But then I'm only passingly familiar with Julia.

What does it offer in terms of controls systems/realtime, GUI building, and data flow?


Mathematica is quite a different beast. It will be a while until julia has a native CAS and even then it will likely not be a CAS (Computer algebra system).


Still early days, but see http://www.nemocas.org/


I have high hopes for Julia, but it's not really competing with Mathematica AFAIK, so I'm not really sure what's behind these claims.


Not Julia alone, but Julia in combination with Jupyter notebook technology - at least for common usage. (Mathematica has of courses a lot of features)





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: