Hacker News new | past | comments | ask | show | jobs | submit login
The future of R - pessimistic thoughts by R founder Ross Ihaka (r-bloggers.com)
76 points by TalGalili on Sept 13, 2010 | hide | past | favorite | 75 comments



> The license will need to a better job of protecting work donated to the commons than GPL2 seems to have done. I’m not willing to have any more of my work purloined by the likes of Revolution Analytics, so I’ll be looking for better protection from the license (and being a lot more careful about who I work with).

Not having used R or being more than passingly familiar with it, I'm wondering if anyone could shed some light on what this is about? I notice that on the Revolution Analytics website, they are selling an "enterprise" version of R which they claim has a number of advantages over the mainstream form of R.

How exactly is a proprietary form of R even able to exist if R's codebase is GPL'd?


Revolution Analytics has released some extensions that are proprietary, for instance their doSMP library that provides multicore support under Windows. They have also released several of their libraries into the "commons" -- doMC and foreach are the two that immediately come to mind.

I'm sort of torn on this issue because setting aside the terms under which RA releases their libraries, what they produce comes with great documentation, and tends to be pretty useful. Using foreach and doMC I was able to cut down a calculation that normally takes 8 hours to 3 hours on a 6 core machine.

On the other hand, I strongly believe that proprietary code should be avoided when doing scientific research because it inhibits peer review and makes it harder for others to replicate your work. As Warren DeLano said: "The only way to publish software in a scientifically robust manner is to share source code, and that means publishing via the internet in an open-access/open-source fashion."


And Robert Gentleman has always been one of the strongest proponents of reproducible research, which includes being able to create packages that can be shared and freely distributed.


It appears they release the sources for the GPL parts and their changes to the GPL parts:

http://www.revolutionanalytics.com/downloads/gpl-sources.php

And from their FAQ:

Yes. CRAN R is licensed under the GNU General Public License (version 2). As permitted under that license, Revolution R is based on a modified version of CRAN R. The source code to that modified version (including a list of changes) is available for download when you download the binary version of Revolution R, as required by the GPL. As an open-source company, Revolution Analytics respects the Free and Open Source Software philosophy and respects all free-software licenses. Revolution Analytics is also a direct supporter of the open-source R project: financially (as a benefactor of the R Foundation), organizationally (by sponsoring events and user group meetings), and technically (by contributing or modifications to CRAN R back to the community).


Hmm, that seems to be interpreting the GPL as closer to what the LGPL's intent is. I assume they're able to do this because their other tools (e.g. IDE) are careful not to actually link to R, but only call it via things like shared files and pipes?


IANAL. This is discussed at http://www.gnu.org/licenses/gpl-faq.html#GPLInProprietarySys....

RA claims to have adopted this "arm's length" approach, releasing their direct extensions to the community. They ask me for identifying information to let me download the software; don't know if that's against the rules. Critically, their enterprise product is pitched as a collection of separate tools: IDE, debugger, analysis tool, etc.

There are a few other ways that people have made GPL'd code proprietary: (1) providing a hosted service instead of distributing software, or (2) distributing only a hardware appliance that contains inextricable software. The latter is controversial with respect to GPL2; GPL3 more clearly prohibits it.


It's interesting that he saw that has a violation of the commons. I don't know the details, but it seems like a feature rather than a bug that businesses can be built around open-sourced code.


R has been really surprisingly shoddy about it's license enforcement on dynamicly-loadable extensions (esp. for an official GNU project). Crucially, they declared an extremely useful subset of their API as usable for non-GPL works (similar to how the two flavors of drivers in the linux kernel work) and people now drive trucks through it all the time. Early on this benefited R, though.


To me the problem with R isn't performance problems, which I've never run into myself, but rather the complicated and confusing semantics of its data types.

R's aggregate data types are: vector, matrix, array, dataframe, and list. The semantics of these types and the relationships between them are extremely confusing. I wish I had gathered examples of this so I could be more specific, but I have basically come to the conclusion that I will never get familiar enough with them to do any better than random guessing until it works right. And I've written somewhat in-depth analyses in R.


I may be wrong but I think that:

list => are basically hash, or an array that can have mixed objects inside

vector, matrix, array => are all the same thing. They are what in most computer languages are called arrays, and can have only one type. The difference between those three is just the number of dimensions (vector:1, matrix:2, array:3+).

dataframe I will concede is a little more complex, and I still have some problem with it. But I basically think of it as a table, where a row represents a value (say temperature) and the column different measuremnts. So, for example:

rows=> temperature, humidity, hours of light, peak UV columns=> Day1, day2, day3, day4, ...

Hope that helps.


Lists are hashes on acid. The default return value of indexing in is NULL. That's a weird semantic for a list, but maybe not so much a hash. The weird part is that if you explicitly assign null to an element of the list, it deletes the element. That's particularly weird because the list has knows all the elements in it, so it's not like it can't tell the difference between a value that has been set to NULL and a value that has never been set. See http://gist.github.com/578110 for a little transcript.


Vectors are not the same as 1d matrices/arrays because a vector does not have direction (i.e. it's not a row vector or a column vector, but will work as either depending on circumstance).

  > all.equal(1:10, matrix(1:10, ncol = 1))
  [1] "Attributes: < target is NULL, current is list >"
  [2] "target is numeric, current is matrix"   
  > all.equal(matrix(1:10, ncol = 1), array(1:10, c(10, 1)))
  [1] TRUE


The semantics are fairly straightforward once you figure out the two main differences.

Vectors, matrices and arrays are atomic/homogeneous objects, and only differ in their dimensionality. Vectors are 1d, matrices 2d, and arrays are 3d or higher. Calling a 2d homogenous structure is a matrix is just a convention: a matrix is identical to a 2d array in every important way.

Lists and data frames are heterogenous/recursive. Lists are 1d, and data frames are (essentially) 2d (each row is homogenous, but each column can be a different type).


I would suggest checking out Octave, a Matlab compatible language. It doesn't have the rich libraries of statistical procedures like R, but it far easier to use to build your own matrix based algorithms.


To be honest I've been using R a bit lately for my work and while I like it I don't find it at all innovative. That's not a criticism of R: the libraries it has are amazing, as well as the mindshare among people who care of statistics.

But I wonder why R actually needs to exist as its own language. It seems it could be recast in Ruby for example or one of the latest functional languages.

So I am kind of pleased my this news... if there's gonna be a need for R to have its own language, speed seems to be the most important distinguishing feature. A bit like Fortran is still used in science.

(incidentally... don't let people tell you otherwise: Fortran(90+) is a very nice language... much more pleasurable than C to use and gives you better performance (unless you know a lot about compilers and compiling flags... but most scientist don't ^_^))


But I wonder why R actually needs to exist as its own language. It seems it could be recast in Ruby for example or one of the latest functional languages.

I agree if we're talking about defining a new R (like the blog post discusses), but the existing R makes sense to me to exist as its own language. It wasn't really invented from scratch gratuitously, but began as an open-source reimplementation of the Bell Labs "S" language, which had already become close to a de-facto standard in the statistics community. After 30 years of writing S and R code, I think there's going to be a big uphill effort if you want to convince statisticians to read and write Ruby code instead. You'd also lose the ability to run snippets of code from the thousands of existing papers that include R/S code in an appendix.

One in-between possibility could be to retain the standard syntax/semantics but target an existing VM with a bigger development community. Ihaka seems to think that's impossible (he briefly discusses attempts to compile R as futile), but lots of weird and highly dynamic languages now have more efficient implementations than most people would've thought possible 10 years ago.


R and S already have completely different scoping rules (S is dynamically scoped, while R has static scope, although quite broken according to the article). So it might be totally possible to develop a third language with mostly the same syntax (so snippets from papers that don't explore the darker corners of the existing semantics can be reused or easily adapted) and this time consistent scoping rules. If it would be worthwhile, I don't know.


"One in-between possibility could be to retain the standard syntax/semantics but target an existing VM with a bigger development community"

I think that's a great move for a number of languages as they lose popularity over time. The Scheme on lisp machines was outpaced by version on compiled machines, and is the version we use today.


The problem is a huge number of R packages use compiled code that won't be portable to this new VM.


But I wonder why R actually needs to exist as its own language. It seems it could be recast in Ruby for example or one of the latest functional languages.

Indeed - it would be a shame for them to start over from scratch and end up coming up with a brand new language, brand new syntax, brand new quirks, brand new performance problems, etc., while they could have simply searched around a bit for something that's already mature and somewhat optimized as well as suiting their needs.

If they wanted to add on to or modify an existing language (for instance, to provide more concise syntax for some of the things that are more important in statistics than in general purpose programming), that would be just fine, it could become a dialect of some other language, but starting fresh seems like an awful waste of energy...

Something that ran on the JVM would be awesome, they'd have no trouble at all rebuilding the massive library of contributions.


For JVM, there is Incanter which is a statistics library written in Clojure. It is backed by Parallel Colt for the heavy number lifting. Note that I'm not trying to say that Clojure would be a scientist-friendly language :)


A lot of people blame Lisp for too many brackets, but I generally feel they are not the issue after reading Lisp for a while. What may make it scientist-unfriendly is to write math formulas in infix grammar.


Incanter has the $= macro for writing infix expressions, so I don't think that need be an issue.


How does Incanter work on HPC? R is pretty awful from that point of view, and if there ever will be room for a specialised statistical language it's got to be able to do massive number crunching.

I've heard bad things of JVM for tightly coupled jobs on HPC (though I know there's been some improvement: e.g. a lot of work done by EPCC in Edinburgh). Does Clojure manages to offer a good parallel implementation on top of the JVM or has no work been done in this area?


According to the website, Parallel Colt supports multicore machines. There's no mention of MPI though. As far as I know, Incanter only wraps it, so it does not influence the performance that much..


Perhaps I'm wrong, but scipy and numpy seem to do a lot of what they want, and if they started there a lot more of their effort could go into porting the libraries.


Yes, but (to my shock), their speed makes even R look fast. Which I didn't think as possible. My problem was taking a matrix market formatted matrix, loading it, turning it into a sparse column vector representation, then computing norms of the columns. It was running for roughly 8 hours on 12MM columns.

R of all things was running faster.

I reimplemented in java and it takes < 1 second.


Given that numpy/scipy is basically a collection of C/C++/Fortran primitives, chances are that you managed to write a program that spent very little time actually computing things, and a lot of time doing something else.

Not sure what that could be, though; even low-level algorithms usually run 10-100x slower than C speed if naively coded in plain Python, so a 30000x slowdown using a specialized library sounds rather odd. I assume you checked for memory leaks and swapping. Did you do any profiling?


To be fair, the sparse matrix package is still very rough. I am highly skeptical of the 1sec in java vs 8 hours in scipy, though.

In general, numpy/scipy is quite faster than R: it does not have the pass by value semantics for once. I am also skeptical about writing a "new" R: the main value of R is in the R packages. Any new language would threw that away.



thanks, I will take a look at it at home (where I have enough memory to deal with that kind of matrix). As I said, the scipys.sparse is pretty rough and too low-level, but 3 sec vs 8 hours is way too long to be normal.

(I forked your gist so you know how to contact me)


Here's my code -- I would have contacted you directly, but I can't find your email. It's as simple as this:

http://gist.github.com/578226#file_gistfile1.py (sorry, that's an editable link, so please be nice)

Anyway -- the machine is not swapping -- python grabs ~20GB of ram, there is another ~100GB available. It pegs one of the cores. Nothing else was running during this test so there was no competition for the fsb. This is on a recent-gen 16 core xeon server. The bit with the pipe and the popen is just me writing the header for the matrix market file in a separate file because it's simpler to dump it from hadoop that way. My time measurements above did not include file reading time -- just the experience of running the norms code (which is just a dot product of each column with itself, essentially). The matrix file is 2GB uncompressed on disk; dimensions are [5e5, 13e7] with 1.2e8 nz.

I rewrote as a sparse, column major matrix in java, running on the same machine, with a sparse dot product implementation. I was off a little before -- the time to stripe the entire dataset is ~2.7 seconds, averaged over 1000 tries. I was confused because I'm spawning 5 threads which gets the time to compute a dot product of one column against every other column down to an average of 1 pass through the data per second. If you contact me, I'm happy to share the code, but it's rather more lengthy.

One more edit -- the java code is not particularly optimized and doesn't batch columns. It just spawns 5 threads that bang on my giant array. Where the data is stored as a column major array of sparse vectors, and each sparse vector is an array of integer indicies and an array of integer values.


Such large differences imply a difference in the algorithm

Indeed, you write

c = mat2.getcol(j)

norms[0, j] = scipy.linalg.norm(c.A)

which means (i) extract a sparse column vector, (ii) convert it to a dense vector, and (iii) compute the norm. Now, this should explain the speed difference. Looking at the nnz, a dense norm can take up to a factor 5e5/(1.2e8/1.3e7) ~ 54000 longer :)

The main issue here is that the linear algebra stuff under `scipy.linalg` doesn't know about sparse matrices, and tends to convert everything to dense first. You'd need to muck around with `m2.data` to go faster.


Thanks a bunch for the help. It would be nice if this were documented somewhere besides reading the source code though :( And none of my googling turned up m2.data.

I'd actually guessed that it might be making columns full, but I'd expected to see a step-ladder up and down memory pattern as fectors were allocated, gc was triggered, vectors were allocated, etc. I didn't observe such a pattern; memory usage was almost constant.

Anyway, thanks again for your help -- I'd offer via email to buy you a beer if you're ever in SF, but no email, so...


Is there any chance that SciPy was compiled without the ATLAS/BLAS/LAPACK libraries? (You can verify this using with the "numpy.show_config()" and "scipy.show_config()" functions in an interactive python session.) In that case, all the numerical math would be done in straight Python rather than optimized C/FORTRAN.


ATLAS/BLAS/LAPACK is mostly useless for basic sparse operations, at least in scipy. It becomes useful for more advanced operations like SVD, etc... Unless you can use dot (scalar product), which is very fast in numpy if you use atlas (e.g. time the difference between np.sum(x * x) vs np.dot(x, x) - sum does not use atlas, dot does).

For sparse SVD as I implemented in scipy, even fast dot does not seem to make that much of a difference (would be interesting to do real benchmarks, though).


Typo in above -- it should be 1.3e7 columns. Sorry.


Fear not, some of us are working on this very hard.


o? what is it that you're doing?


Building a next generation of statistical computing.


I was hoping you'd be at least the slightest bit more specific :) are you working in a proprietary startup? are you working in academia? presumably you're working on a codebase with the intent to distribute it to users at some point?


Would you be willing to be more specific? That's not very helpful.


I can't be yet, unfortunately. But there are definitely several different groups of very talented people working on this problem (think MIT and Harvard PhDs). I suspect some very interesting work will come out in the next year or two. Whether they are open source remains another question.


Austin Heap? Is that you?


I'd be interested to learn how to add missing value support into an existing language. It's pervasive in R, so important for statistics, and it seems like it would be hard to patch onto an existing language.


How important is the distinction between NaN versus NA? I agree this is a subtle issue.


It is important - you cannot know whether NaN is coming from a computation or is really a missing value otherwise. In Numpy, we have the MaskedArray implementation to do this.


What does a NA value become when you extract it to a float? i.e. What is the behavior of X[0]?

It is somewhat confusing that python base types and numpy differ in behavior, for instance when dealing with inf or divide by zero exceptions. I think this gets to hadley's point that it will be hard to bolt on R to an existing language.


If X[0] is masked, it will return the value mask, of type MaskedConstant.

As for python vs numpy differences: yes, those can be confusing, and that's inherent to the fact that we use a "real" language with a library on top of it instead of the language designed around the domain. If you want to do numerical computation, you do want the behavior of numpy in most cases, I think. There is the issue of "easiness" vs what scientists need. You regularly have people who complain about various float issues, and people with little numerical computation knowledge advising to use decimal, etc... unaware of the issues. Also, python will want to "hide" the complexities, whereas numpy is much less forgiving.

As for the special case of divide by 0 or inf, note that you can get a behavior quite similar to python float. You can control how FPU exceptions are raised with numpy.seterr:

import numpy as np a = np.random.randn(4) a / 0 # puts a warnings, gives an array of +/- inf np.seterr(divide="raise") a / 0 # raise divide by 0 exception


I agree. It's pretty important. Missing values can be coded as funny things like minus infinity in some languages. So then, if you want to assign a subset of a vector to a new group (say the number of people who have had a an event e.g. "heart attack") and you used the argument x <= 1, "missing values" would be incorrectly categorized as "heart attack". You can see how this could really affect your analysis. With R, the usefulness of NA is that if you're careful when you import data, this never happens.

Of course if you're always careful, if your NA values in other languages are stored as numbers, you can avoid this error. But it's made easy by R's approach to NA.


And this isn't just a hypothetical example - it has happened and has affected the results of important medical studies.


Indeed, R "the language" has been the main barrier to my adoption of R.


If you want to do scientific computing in a modern general purpose language, NumPy/SciPy are pretty popular. Haven't used them much myself, but they look pretty decent.


R needs it's own language because one of the most important factors in it's widespread use is the language itself. First, it is very similar to an older language that is one of the first widely used statistical packages. Second, the syntax is very easy to use at a simple level. For example, to read a csv file and compute a linear regression with betas, p-values, the works, all you have to do is:

  # read a csv file with headers into ram
  data <- read.csv(file='blah.csv', header=T)

  # compute a linear model with dependent variable income, explanatory variables gender, education, and ethnicity, with automatic creation of (n-1) indicator variables as appropriate for categorical data
  model <- lm(income ~ gender + education + ethnicity + age)

  # if instead, I want to use a glm family of models, this is also trivial..
  # note this is nonsensical statistically with respect to my data, but I just thought of this off my head, and it demonstrates how easy it is to do sophisticated things in R
  model.glm <- glm(income ~ gender + education + ethnicity + age, family=binomial, link=cloglog)
That's it. So you can see why stats people love it -- it's easy to pick up and use to get productive work done. It's easy to show students. For what it's meant to do originally -- desktop statistics -- the language works quite well. It lacks as a programming language in many regards, don't get me wrong, but simplicity and ease of use, particularly in the beginning, are critical.

Also, the data frame is the single best data structure I've ever used for manipulating tabular data. Finally, the excellent repl makes working in R and exploratory analysis absolutely awesome.


I'm pretty sure you could do a lot of this in Python as well.


It seems that way on the surface, but their are a lot of subtleties to R (lazy evaluation of function arguments, access to both static and dynamic scope) that would make implementation in another language difficult. R has many features that are not that desirable from a programming language stand point (and that do make optimisation difficult), but do make it a very expressive, flexible language for statistical computing.


Could you tell me a bit more about why you like data.frame? Is it not just a 2D matrix?


I think the interviewee has, elsewhere, been pretty explicit about his desire for the R committee to get away from maintaining their own language and to build on a platform maintained elsewhere. Basically, it seems like he thinks R should be a package and some libraries, but not in the runtime/compiler business.


I would welcome a replacement for R just because the new language's name might be easier to google. I hope they don't call it Q.


BTW, that was my motivation for starting the www.r-bloggers.com website. It now has over 110 bloggers (who write about R) there. When I started and looked for them on google, all I could find was bloggers who wrote about pirates :D


Google thought R = Arrrgghh? Lol...


Yes, please! It seems nowadays one of the most important feature in a name is to be unique from the point of view of Google. :)

(incidentally, if you are a mpaa operative, call your blockbusters with just one very generic name... a lot easier than suing half the world, and just as annoying ;) )


There is actually rseek.org just for this purpose.


The top five links for "R" are all the right ones. Google seems to have you covered.



google: r statistics foo


It almost sounds like they want to switch to Haskell.


Or perhaps incanter with all it's lispy goodness.

"Incanter is a Clojure-based, R-like platform for statistical computing and graphics."

http://incanter.org/


Ihaka himself seems to be eyeing with Common Lisp as the next best statistical environment:

http://www.stat.auckland.ac.nz/%7Eihaka/downloads/Compstat-2...


Lisp-Stat used to be a big deal in the late-80s/early-90s, and had a shot at becoming the dominant open-source statistics package inspired by Bell Labs S. It eventually lost mindshare to R, though, and was pretty solidly eclipsed by the late 1990s. It'd be interesting if the R developer responds by trying to build something like Lisp-Stat again. ;-)


Wouldn't get you pervasive laziness, though I don't know whether R is lazy by default or like Clojure in that you have to ask for it.


if you use Clojures primitives you pretty much end up with laziness by default. for, map and most of the other list processing functions all return lazy lists. So while Clojure does technically require you to "turn on" laziness in most cases you'll find it's turned on already.


in R, arguments are passed around lazily, akin to scala's =>. otherwise, evaluation is not lazy (as in Haskell) unless done so explicitly (as in force/delay for scheme).

i think essentially no one using R professionally would accept Haskell or Clojure as a substitute. having spoken to a few: lisp-like syntax is "unreadable", and Haskell is too much effort.


a translator from R to Maxima would be another idea. Maxima is Lisp based, use an algol like language, is used in education and is free. But developer are scare so more hands are needed. Maxima can use BLAS and other libraries for numeric computation, and it support many implementations of Lisp (sbcl, gcl, cll, ecl ...), some of their member are working in a java based lisp (Arms beast?). With so many tools available it is a something to consider.


python

has, RPy2 so you can still access R's statistics libraries http://scikits.appspot.com/statsmodels, and http://pandas.sourceforge.net/ (dataframes)

plus, scipy has a decent stats library as well (random variables, etc)

still rough around the edges, but a good solution in my opinion




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: