How the R-project is taking over statistical analysis software

6ren · on Oct 22, 2011

I think open-source eventually replaces commercial products, in the same way that proprietary products become commoditized. The response for commercial products is also the same: continual differentiation, adding new features, benefits, support, documentation etc. Exceptions are also the same: natural monopolies (e.g. strong network effects).

Open-source is great at hill-climbing, where there are clear directions for improvement and especially for features that are obviously needed by users (provided the structure of the project is sufficiently modular to facilitate it), by tapping the collective intelligence of users.

It's not great at "hill-hopping": originating radically different products.

cageface · on Oct 22, 2011

Counter-examples abound. Can you name even one open source app that has displaced a mature, user-facing desktop app with a non-trivial UI, other than a web browser?

Open source only seems to win in domains in which it makes sense for companies to share work in order to compete at a higher tier of functionality.

jeffmk · on Oct 22, 2011

It has by no means "displaced" its proprietary equivalent, but Inkscape is one of the most user-friendly open source apps I've ever used. I find it far more intuitive than Illustrator. An incredible amount of power and complexity is presented in a way that makes it quite intuitive and a joy to use. It's also easily extensible if you're a programmer.

6ren · on Oct 22, 2011

You skipped the bit about differentiation: if, for example, photoshop didn't keep improving, do you think the gimp would never catch up? I think you could find lots of examples where today's open source version is better than an x-years-old proprietary product.

The only way it realistically can realistically happen if is the commercial product is not being improved (i.e. differentiated) any more. Your example is one of these - standardization is related to commodification.

I suspect also that user-facing apps are easier to keep improving, because the user is right there, and always has more needs that could be served (e.g. text editors will evolve til they can read mail; those that can't will be replaced by those that can). Non-user facing apps tend to be defined by their environment, rather than by users - although, any component that creates a benefit that the user wants more of will keep being improved (from Clayton Christensen). e.g. databases, CPUs.

kragen · on Oct 22, 2011

You probably don't remember this, but Emacs did that in the 1980s. And then of course there's Android, but I guess you might not consider it a "desktop app". And then there's Wikipedia, which has completely displaced Encarta.

I don't think it makes sense to make generalizations about where "open source seems to win". Things are changing too fast; the circumstances that made it possible for Mozilla to beat IE in the mid-2000s no longer exist, for example.

sandGorgon · on Oct 22, 2011

hboon · on Oct 22, 2011

Does OS X apps like iTerm and CyberDuck count?

But I agree with you and think that in the long run, open source is a big winner in infrastructure software, and generally just that.

anthony_barker · on Oct 31, 2011

putty

R's integration with free data sets is crazy. For quick analysis it is great. For development Numpy/SciPy/Matplotlib/PyCUDA is more suitable.

dfc · on Oct 23, 2011

I do not think this applies to office suites? OpenOffice/abiword v. Ms Office/Pages or whatever it is? I

earl · on Oct 22, 2011

I don't think it's obvious that open source displaces commercial for scientific computing. For every example like R which has in many places displaced S-Plus, there are counterexamples like matlab, for which the open source clone Octave is a bad joke, at least the last time I tried using it: missing functions, slowness, extreme difficulty installing; or Mathematica, or eviews, or gauss, or Maple.

One other potential factor: a lot of this software is driven by academic use, either because academics used it or that's where people were first exposed, and academics often receive large discounts.

muuh-gnu · on Oct 22, 2011

> extreme difficulty installing

Today's Octave installs do not need more than click-click-ok-done or apt-get install octave.

> for which the open source clone Octave is a bad joke

I think you're not giving Octave enough credit, considering that they have only a few part-time developers, and that nobody does sponsor them, they have accomplished a respectable amount of functionality over the last 20 years, it is extermely unfair calling them a "bad joke".

Of course with those limited resources they are not able to match the output of Mathworks, but what they can do as of yet is usually more than universities teach, and _still_ many departments are kind of married to Matlab, only mention Matlab to students, give only matlab examples, matlab labs, matlab exercises, etc. Also very respectable people like Gilbert Strang, who gave the MIT basic Linear Algebra and Computational Science and Engineering classes, seem to have enough vested interests in Mathworks to not even mention Octave to students briefly as something they can download and work at home. Octave is extremely powerful and capable for what you pay for it, and deserves at least a mention.

It is probably not different at other universities and other departments. Several professors I had to deal with were similar, either not even aware that open source packages like Octave, Scilab, Maxima, Scipy exist at all, or extremely faithfully married to companies behind proprietary packages like Matlab/Maple/Mathematica.

Create · on Oct 22, 2011

This is not a new issue. Institutionalised education is most of the time producing "knowledge workers", as MS put it a while back. But without the knowledge, of course.

http://www.cs.utexas.edu/~EWD/transcriptions/EWD12xx/EWD1283...

ehsanu1 · on Oct 22, 2011

When did you last try Octave? Professor Andrew Ng recommended using Octave (probably because it's free) for the online Stanford machine learning class (http://ml-class.org/).

botj · on Oct 22, 2011

Probably not for a while. At this point octave has a TON of Matlab compatibility:

The grammar is pretty spot-on although there is usually some release latency when Mathworks changes it (obviously since their plans are not made know. Ahead of time).

Octave even has Matlab source level compatibility for mex files although they are slower than octaves own c interface.

If you start drifting away from Matlab core needs into the specialized add ins Mathworks provides (simulink, financial packages, etc) then Octave can't help. If you need those then I find that Matlab is rarely the tool for the job either (you just don't know it yet ;))

steve_cronin · on Oct 22, 2011

Cant even compare R, a statistical engine to Matrix Laboratory, a numerical matrix manipulation engine. they are different software packages designed for different purposes. eg. plugging R into a high grade NMR magnet and processing signals is probably not a good idea.

R is sick though and I am alwalys pleasently surprised at what clever people are doing with it. Octave on the other hand shouldn't be used until someone writes a proper interface and decent graphing. MATLAB is a million light years ahead of Octave in that regard

Thrymr · on Oct 22, 2011

What's replacing Matlab is not octave, but Python/Numpy.

dlib · on Oct 22, 2011

I still wish there are going to be some syntax improvements to Python then. Matlab's way of working with matrices is simply excellent. Formulas on paper map almost one to one to the code. Apart from that I'd take Python over Matlab any day but am forced to work with it for some of my classes.

dj_axl · on Oct 21, 2011

Anecdotally, NumPy (Python) has some traction. Similarly they don't consider SQL libraries. And I'm sure there are statistical analysis libraries for Java. According to the bar chart below R is mentioned by 45%, SQL by 32%, Python by 25%, Java by 24%. This seems a more reasonable comparison to me than the graphs earlier (higher up) in the post.

https://sites.google.com/site/r4statistics/_/rsrc/1318535062...

UrbanPat · on Oct 21, 2011

What do you mean by "SQL Libraries"? Do these interface with SQL to perform analysis?

dewarrn1 · on Oct 21, 2011

I use R as my primary data-analysis tool for almost all of my work, with occasional recourse to SAS for certain specialized models (e.g., PROC GLIMMIX for generalized mixed models).

My only complaint is the awful default IDE, which can be mitigated to a large extent by scripting elsewhere and source()ing the script, and some odd edge behaviors including the mystifying row names of dataframes, the difficulty of dropping unused factor levels from aggregated or sliced data (another dataframe issue), and the perhaps unnecessary obscurity of some of the plotting functions (although holding R responsible for the lattice library is unfair).

All that said, for a free tool, it's extraordinary, and the authors of the base language and the many packages that I use have my gratitude.

roxtar · on Oct 21, 2011

Default IDE? Do you mean the R interpreter REPL? If you are looking for a nice IDE for R, I would suggest RStudio: http://rstudio.org/

dfc · on Oct 23, 2011

Why do all the R guis depend on QT? Is qt big in science applications in general?

linhir · on Oct 22, 2011

Totally agree, R studio has made my life much easier.

dewarrn1 · on Oct 22, 2011

This looks slick, thanks for the pointer!

dennish00a · on Oct 21, 2011

I love R--but I end up using Stata more often because it is easier to produce vector graphics that can be imported to Illustrator. I wish that the R community would start to focus on graphics.

dewarrn1 · on Oct 21, 2011

I've had some success with output from lattice using Cairo's SVG option, although you're right that it's never easy. Self-citing, the plots in these pubs were generated as above (JoCN may be behind a paywall):

http://www.frontiersin.org/human_neuroscience/10.3389/fnhum.... http://www.mitpressjournals.org/doi/abs/10.1162/jocn_a_00089

I'll need to give Stata a look, too.

jcdreads · on Oct 22, 2011

Probably worth noting about the author:

> Robert A. Muenchen is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to helping people learn R. Bob is a consulting statistician with 30 years of experience

Disclaimer: I hate R's syntax, but my company's analytics group uses R for just about everything.

zzleeper · on Oct 22, 2011

I started learning SAS (I mostly use stata/matlab/python for my daily needs) but also ended up abhorring some parts of the syntax..

migiale · on Oct 21, 2011

Unfortunately, it's almost impossible to work with a very large datasets in R, because of the speed limitations. Many researchers I know use Matlab because of this.

hvs · on Oct 21, 2011

What about Octave? Other than my use in the Stanford Machine Learning class, I've never really used either, so I don't have any basis for comparison.

rflrob · on Oct 21, 2011

My recollection is that Octave is significantly slower than Matlab, and some quick googling on benchmarks [1] suggests that it is (was?) as slow or slower than R.

I've complained before that Octave is the wrong solution to the Matlab problem, and if you aren't attached to one of the many fine Matlab toolkits, you're likely better served translating to a more expressive language, like Python+Numpy+Scipy.

[1] http://sciviews.org/benchmark/

migiale · on Oct 21, 2011

Octave is Matlab clone, in fact Octave developers openly say that except for some special cases, any difference between Octave and Matlab is a bug.

The biggest difference between Matlab and Octave is JIT compiler in Matlab, which does incredibly good job at vectorizing simple (or sometimes even not-so simple) loops.

I think it's fair to say that Octave performance is very close to a Matlab in a pre-JIT time.

There's also a huge difference in toolboxes, profiling, sparse matrix operations, parallel computing and many-many more. In these areas I'm afraid Octave is light-years behind Matlab.

However, you still can do a lot of useful simple stuff with Octave and it's free! Matlab-like syntax is really, really cool then it comes to vectorized operations. So probably these two reasons determined Andrew Ng's choice of Octave as a main environment for ml-class. Huge win for Octave I guess. This might spur some interest in the development, attract new people to the product. I think it's a well-deserved success for John W Eaton and other people who develop(ed) Octave all these years.

mturmon · on Oct 21, 2011

I agree with your take on Octave performance relative to Matlab. The Matlab parallel toolbox is getting more and more useful in a multicore world.

As you note, the Matlab profiler is very nice. You can zero in on the 80% of the 80/20 tradeoff very fast, during your usual development cycle. It's as simple as:

>> profile on >> do_something >> profile report

and you get a nice graphical/textual report on time usage in everything do_something called.

muuh-gnu · on Oct 22, 2011

> in fact Octave developers openly say that

This is not true. They strive for Matlab language compatibility, but none of them refers to Octave as a "Matlab clone", nor are they working on cloning Matlab, nor was the project started to become a matlab clone. It is like calling Linux a "Unix clone".

tonyt · on Oct 23, 2011

It can't be that bad, Oracle are shipping it in their new Big Data Appliance.

http://radar.oreilly.com/2011/10/oracles-big-data-appliance....

It's probably more an issue of easily pre-filtering/aggregating the data before analysing it with R. I like this approach of moving the calculation to the data, but we must be very late on the adoption curve if Oracle are doing it already.

carbocation · on Oct 21, 2011

For statistical genetics at least, it's common to process much of the data in parallel, so the RAM limitations on one R instance are not the gating factor.

eastwest · on Oct 22, 2011

Having seen and heard about what Bioconductor had to do to process genetic data, memory is a huge issue. It is even more so with next-generation sequencing data.

carbocation · on Oct 22, 2011

Yes, I guess I've always operated under the assumption that I've needed to parallelize dramatically. I usually operate on data from families of ~40 people with next-gen sequencing data, and the tools that I use generally finish within about an hour.

xtracto · on Oct 22, 2011

I use R every day for my research (doing social simulations sometimes based on sample surveys). An additional R limitation is the memory limit. R cannot use virtual memory and the maximum amount of data is limited.

There are two ways to deal with that, one is to load datasets through SQL database (using a SQL library) which IMHO is a "dirty hack". The other (what I usually do) is to load the huge datasets in STATA (or any other stats package) and filter the data to get a set that is small enough to work with R.

Other than that, the available libraries in R are crazy good. for example stuff like Approximate Bayesian Computation or survey analysis (considering weight factors) is straightforward with available libraries.

aditya · on Oct 21, 2011

Can someone point me to a good introduction/resources to R? Especially for web stuff?

csmt · on Oct 21, 2011

R in Nutshell is pretty good book: http://shop.oreilly.com/product/9780596801717.do

linhir · on Oct 22, 2011

I do a good bit of R programming, and R in a Nutshell has been the best quick reference guide I have found.

SkyMarshal · on Oct 21, 2011

http://www.reddit.com/r/rstats

http://stackoverflow.com/tags/r/info

Arjuna · on Oct 21, 2011

http://cran.r-project.org/other-docs.html

roxtar · on Oct 21, 2011

I like this site: http://www.statmethods.net/

ahalan · on Oct 22, 2011

http://www.quora.com/What-are-essential-references-for-R

jsavimbi · on Oct 21, 2011

I keep an eye on R-bloggers: http://www.r-bloggers.com/

lemming · on Oct 22, 2011

Moderately related - has anyone who previously used R in a serious way switched over to Incanter? Is Incanter comparably powerful?

traveldotto1 · on Oct 21, 2011

love R.. but have to say because it's open source, you do have to watch for the quality of libraries

bwaynelewis · on Oct 22, 2011

The core libraries available in R are some of the most well-reviewed, carefully written, and correct codes available.

There are a huge amount of available libraries (thousands!) of variable quality thanks to the open nature of the project. But commercial software has problems too, especially with new and niche products. And when something goes wrong in those cases, you can't see why for yourself. Worse, other independent experts would not have the chance to either.

burgerbrain · on Oct 21, 2011

Surely that would still be the case under any license.

pyoung · on Oct 21, 2011

He is probably comparing R to SAS (which are the two most popular statistical programming languages). SAS doesn't really have libraries, instead you buy additional packages from SAS, which are very reliable and well supported, but expensive.

My company shuns R (although I personally like it), primarily because of this issue. If we need to run a rare or uncommon statistical procedure, it is a lot easier to trust the SAS procedure, rather than an open source R package written by some grad student.

TalGalili · on Oct 21, 2011

True, Though if you need to run a rare or uncommon stat procedure, SAS is not likely to have it in the core, and then you are back to using what "some grad student wrote".

eastwest · on Oct 22, 2011

I am shunning SciPy and to a less extent NumPy for the same reason. I have reason to believe the developers are not experts in numerical linear algebra and some of the documentation also do not lend confidence.

traveldotto1 · on Oct 21, 2011

yes.. but for less adopted or emerging platforms, you have to be more conscious of the source of the library, and should look at the source to verify its functionalities

burgerbrain · on Oct 22, 2011

This still has nothing to do with the licensing of the software.

dfc · on Oct 23, 2011

If R did a little more hand holding it would be awesome.

georgieporgie · on Oct 22, 2011

Looks interesting, but that page renders only on the right half of my Android screen, and can't be zoomed or reflowed.

ginzasparrow · on Oct 21, 2011

DIE SAS DIE

eftpotrm · on Oct 21, 2011

Having worked on and off with SAS in recent years I'm aware it has its limitations, but round here we like constructive contributions please. Would you like to expand upon your remarks?