Hacker News new | past | comments | ask | show | jobs | submit login
Don't Enforce R as a Standard (timotheepoisot.fr)
166 points by doctoboggan on April 3, 2015 | hide | past | favorite | 88 comments



To give the other side of the argument:

Most scientists have little or no quality training in software development, but scientific research is increasingly reliant on software. At present, software is a glaring black box in a great deal of research, because very few reviewers have the skills to thoroughly scrutinise it.

A software monoculture always has negative consequences, but fragmentation can be equally problematic in many cases. A reviewer saying "This would be better in R" usually means "I have no chance of understanding your code, because it's not in R". For better or worse, R is currently the statistical computing lingua franca in most fields.

I believe that the scientific method is in real trouble, due largely to the immense complexity of much modern-day research. Scientists have access to immensely powerful analytical and statistical tools, but most lack the training and infrastructural support to use them in a rigorous manner. Bad practices in software and statistics are the norm, rather than the exception; I'm sure most of this is just an honest shortcoming, but I'm equally sure that the lack of CS and stats experience amongst reviewers is a gift to would-be Bogdanovs and Obokatas.

Science has always been a collaborative effort, but I think most fields are in desperate need of greater support from computer scientists and statisticians. Ultimately I would like to see those professions become deeply integrated into all scientific fields, with the expectation that all papers should credit a statistician and (where applicable) a computer scientist. Likewise, editors and reviewers need far closer ties to CS and statistics professionals. Of course, these issues are intertwined with many other problems in funding and peer review.

Until then, the hegemony of R may simply be a price we have to pay for better research in the short-term. A software monoculture at least gives reviewers a fighting chance of spotting issues with software that might affect the validity of results.


I think its more like, "I have no chance of understanding your code, BECAUSE IT'S IN R." The vast majority of R users and I bet 95% of ecologists, can't read and understand R modules. Its the PHP of Data Science.

The point, then is to supply code with simple instructions so that anyone can run it. Secondary goal: write it in a language a human can actually read.

If you're doing science and you want people to understand your code, you use Python, not R.


> If you're doing science and you want people to understand your code, you use Python, not R.

Wouldn't that depend on which language "people" know?

> Its the PHP of Data Science.

Some might disagree with that statement. Some might even argue that languages where for loops are idiomatic and strongly promoted by the BDFL are the PHP of Data Science.


If so, you're the first "people" I've ever heard this argument from :)


Python is an ugly language for scientific work; any of R, matlab, or julia are far preferable for not massively obscuring the intent of your code with the programmatic hoops you have to jump through. People who are programmers I think tend to wildly underestimate the effort needed to learn and read python.

Further, the vast majority of R modules are at least partially implemented in R and tend to be very readable.


what hoops are you referring to? My phd work was all matlab, and I vastly prefer python for readability and practicality


Any linear algebra looks ugly --

   python: X.dot(Y)
   matlab: X*Y
   R: X %*% Y
It makes less difference when it's only 2 matrices, but when it's 3+ it's far more readable.

pandas has a lot of functionality, but the interface is far inferior to R's data frame, where you have

   df[row predicate, column predicate]
particularly the highly useable intermingling of column access by index or name. reshape2 and plyr are, imo, far more elegant and less wordy apis.

The api to sklearn, while definitely more consistent than R's apis, has a lot more programmer nonsense interjected: imports of random packages, the difference between pandas and numpy that still peeks through the second you step outside of statsmodels, etc.

numpy has a serialization format, while pandas uses pickle or hd5. R just uses save/load, and unlike pickle in my experience, it reliably works.

matplotlib is an ugly api, particularly compared to ggplot.


few things that do not completely address your points, but may help

- python 3 (.5?) is adding the @ operator for matrix multiply. Of course this only helps you if your company uses python3

- There are 2 packages (seaborn, ggplot(clone)) which sit on top of matplotlib and provide a much nicer interface for statistical plots


Too bad Python 3 equals Perl 6. :(


If you want people to understand your code, you need to treat it like writing: rewrite, rewrite, rewrite. The language itself is almost irrelevant: you can write horrible code in Python and beautiful code in R.


True, but if you write decent or so-so Python code, it is readable to anyone who knows a C like language. So-so R code is still impenetrable.

The language matters. One is optimized for ... honestly I have no idea what R is optimized for. I don't think R works that way. It just is. Python though, its optimized for readability.


Badly written R is impenetrable. So is badly written R. I suspect your standards for so-so R code are vastly different to mine.

R and Python really are very similar as languages. R is more functional, but that shouldn't impede your ability to understand code (once you master some of the basic idioms of FP)


I can't actually comment on what is good/bad R code, because I've never seen any R code beyond a few lines that I could follow.


python is optimized for readability to C programmers

R is optimized for statisticians


Actually, no. I have written several modules and packages in R, both native and C++, and it is impossible to write efficient R that is beautiful or even comprehensible.


Hopefully you know the parent was Hadley Wickham who has very much contributed to the readability and beauty of R. So much so that even Python data scientists are lamenting that their code is not quite as beautiful as the "hadley stack" [1]

FWIW, I've done both Python and R and they both have their pluses and minuses, but beauty is in the eye of the beholder and I've seen some R packages that do, indeed, make R beautiful, case in point, the dplyr package from Hadley Wickham. I'd love to see the future of R be inspired by dplyr and ggplot.

[1] -- http://technology.stitchfix.com/blog/2015/03/17/grammar-of-d...


I don't disagree, but PHP is not the canonical reference for code which cannot be understood, that would be Perl.

PHP is the canonical reference for code which can be understood but not interpreted, because you don't know the values of various obscure global flags.


See Perl seems easy to read to me, and functions work like I expect, whereas PHP has a bunch of functions where variable order is not consistent.

R isn't horrible, and if you are in the repl, you can print any function's code you can access, which is awesome.


Also, many R packages end up being written in cpp.


Many Python packages are written in C. So long as there's a clean interface, I don't see it as a problem.


There's no problem here, the issue is that the journal reviewers were complaining the code wasn't written in R, which seems strange when non-vectorizable algorithms essentially force code to be written in another language (the same is true for python/numpy - sometimes you need to drop to cython).


Do you mean C++?

(It's a pet peeve of mine. The language is called "C++". "cpp" is commonly used as the suffix for C++ source file names, but the name "cpp" also refers to the C preprocessor.)


The common R package for integrating C++ is called "Rcpp", so calling it "cpp" isn't that strange in context.


Option 2: Make an API with a simple and well-documented interface. Forget about the language BS.


Well, the point here is reproducibility, which requires reading and understanding the code, right?


Thank you.


"A software monoculture at least gives reviewers a fighting chance of spotting issues with software that might affect the validity of results."

If I understand it correctly, this observation shows just what a short distance we've come in programming languages and software development since the 1960s. Verifying the correctness of a program shouldn't depend on good naming or being able to execute it in your head.


> Science has always been a collaborative effort, but I think most fields are in desperate need of greater support from computer scientists and statisticians. Ultimately I would like to see those professions become deeply integrated into all scientific fields, with the expectation that all papers should credit a statistician and (where applicable) a computer scientist.

This is very true, but it will be difficult to realise within the university centred system of research composed of many small labs, none of which can really offer any sort of career path for non-academic professionals.


R is an amazing data language and ecosystem. There just is no parallel in terms of diversity of capability. Python is a second but not a close second.

Our battle should be more with SAS.


I'm torn on this. I've spent nearly the last 20 years in some form or another in academic/research computing.

On the one hand, attitudes like the third reviewer's are a primary reason for the state of HPC today, where new advances in research are just as often finding a way to run 30 year old code on a modern supercomputer as they are writing new software to take advantage of the fantastic array of new hardware. The number of times you hear "sorry, we can't change that. Livermore wrote that 15 years ago, and nobody knows how to change it anymore" is enough to drive a rational person over the edge.

On the other hand, the state of academic peer review being what it is, I don't fault the reviewer for suggesting the paper be resubmitted using more common methodology. A conscientious reviewer has a lot of papers to review, and spends a good amount of time on the ones that aren't written by crackpots. While the author was of course fully in their rights (and may even have advanced the field) to use their own software written in a relatively uncommon language, for the results and methodology to be understood, that's asking a lot of a reviewer.

I'm genuinely unsure as to how I would have responded to the review request. I have much sympathy for both people involved.


The correct response from the reviewer would have been "I am not capable of reviewing code written in Julia, therefore I must decline this review request" rather than "I am not capable of reviewing code written in Julia, therefore I must recommend rejecting the paper".


But if the paper is not about the code and the code was just supplemental (as the article states), why should the language of the code have any relevance? If the author had declined to submit code, the author's paper might have sailed through just fine.


Have you submitted to academic journals before?

Reviewer comments are not a list of absolute requirements; they're thoughts and suggestions that help both the authors and editors improve the manuscript. They will include a recommendation for publishing, and they can make that recommendation contingent on particular issues, but that's somewhat unusual.

It's not completely clear what's happening in this case, but I'm strongly inclined to think this was a 'minor comment' from a reviewer.

   The three reviews were helpful and constructive, but these two comments infuriated me.
I think the author is simply taking exception that these comments are prevalent attitudes, not that they were significantly contributing to the editorial decision.

It's very common to address reviewer comments without actually changing anything. Basically, you say you disagree for reasons X, Y, and Z. The editor can disagree, ask for further clarification from the reviewer, or simply accept the argument. Nothing is set in stone.

As for the implication that there are other reviewers waiting in the wings with suitable experience, good luck.


Yes, I have some publications, and am familiar with the process.

I was attacking a theoretical review to make a point I wanted to make. The blog post doesn't say the reviewer recommended rejection on the basis of implementation in Julia, or that the editor's decision cited the language choice. My point was that if those things were true (which is not clear from the post), that would be bad. I make that point because in my experience, it's not uncommon for reviewers to take similarly unreasonable positions.


>Reviewer comments are not a list of absolute requirements;

Where are you publishing? I have had reviewers block if certain experiments were not done or things were not done a certain way.


Are you some sort of supreme authority on such matters?


You are being downvoted because the charitable interpretation of the parent comment is: "Based on my experience and understanding, I think the best response from the reviewer would have been ...".


No, but he does sound right. Rejecting a paper because of a made up standard based on brand preference is stupid.


Where does it say that this is the issue that caused the rejection? It sounds like a minor comment among other helpful comments, as the author stated.

I got the strong impression that the editorial decision was based on other factors, with these comments only being minor ones.


I just find it very, let's say funny, that he (or she!) simply declares what the correct procedure is. One must wonder, if he/she is some sort of an arbiter in such cases.

Also, it was rejected, with an invitation to resubmit, which makes quite a bit of difference. Especially when you take into consideration that the reviewers expressed why they can't publish it in the rejection.


These were the same types of people that said to me "don't use R, no one uses it" several years ago.

I know there must be a balance between bleeding-edge and stability, but if in research you cannot use the new tools, then there will not be any progress at all.


That's true. In my field, there's plenty of Matlab and it's arguably not even the best tool for the job: someone started using it, people improved on the work and now, no one cares about rewriting that.


Why did R take over the world of statistics anyways? I remember about 8 years ago I had an interest in statistics. Everything was SAS, SPSS etc which i just didnt have the budget for... I happened to find R as it was a freely available for linux, impression I got was no one used it, well some limited use in academia, but no professional usage.

Fast forward it seems like the most widely used tool in computer data analysis. Did something fundamentally change?


The biggest boon for R was the rise of "data science". Data scientists have actually existed for decades, but they had different titles (quants, actuaries, quantitative researchers, etc) and came from fields that preferred enterprise software for various reasons.

Data scientists, on the other hand, largely rose from tech companies that hired software engineers and preferred open-source software. When I started in the field ~10 years ago there were very few reasonable open-source options outside of R. Matlab and SAS were prohibitively expensive, Octave was too immature, Python didn't have basic functionality like data frames. In short, it was our only viable option, so we used it. A lot of people started using it for the same reasons, so R's library support became the best. This is what keeps me semi-locked into R - I'll probably eventually move to Python or Julia, but R hit the critical mass first and in a huge way.


In the world of Bioinformatics, R has been the tool of choice for a long time due to packages like Bioconductor [1], which has been around since 2001.

----

1. http://www.bioconductor.org/


R's library support predates your timeline.


The most impactful, important packages I use were all developed after 2005: ggplot2, plyr, glmnet, gbm, reshape, knitr. Out of curiosity, which ones are you referring to?

edit: I may be using the word library imprecisely here, which may be causing miscommunication. I mean 'packages', to be clear.


Prior to those packages, R was the language in which most new statistical methods would be published. So it has had very good library support for statistics for a long time. Only more recently has it also gained excellent data manipulation support.


Which kinda violates the principles expressed by the writers of said open source project, but I'm all for it on the basis that if I never again have to use base reshape, stack or gsub then it would be a good thing.

I believe An Introduction to R specifically notes that one should use other software to provide R with appropriate data.


glmnet, plyr, knitr, and ggplot2 were all developed by academic statisticians (some as grad students who have subsequently moved out of academia.) They didn't choose R because of the reasons you listed, but because it was already popular among statisticians, along with S-plus. (This is a bit of an educated guess, I haven't directly asked any of these people why they chose R. Maybe one of them will chime in!) ggplot2 and knitr are extremely well-executed extensions of ideas that had existed in R before: lattice and sweave; and lattice and ggplot2 are both fairly direct implementations of ideas developed in quasi-academic statistics by Cleveland and Wilkinson. I don't think I'm selling anyone's contributions short, just pointing out the packages you're talking about came out of academia and built on ideas and implementations that existed for at least a decade before your timeline.

But the bigger point I was making is that CRAN (and, as pointed out by bbgm, bioconductor) and R's package system have been around a long time before that. And, unfortunately, was built with _very_ little input from software engineers. :)

edit: incidentally, this is a case where the opensource ideas really help a project. SAS and Stata don't exactly encourage any motivated grad student to rewrite their graphics interface or report generation tools.


At the time I wrote ggplot2, R really was the only game in town. But I strongly believe that R has excellent features to support data science: non-standard evaluation, NAs at a very low level, data frames, first class functions, ...


Thanks for sharing that. Much more informative than my speculation :)


Packages.

They can be built by a single prof/grad student.

They can be used by anyone based on R's package-manager GUI.

Research in most fields is going to move more quickly than any software team can add functionality to monolithic programs like SAS. Therefore, most functionality becomes available only in R. Users - people who want to run specialized-domain-function-X on their dataset are best suited by learning R, then using their colleagues packages.


I don't know much about SAS or SPSS, but do they really not have a usable package/module system for code reuse? Or is it just not as good as R's?


I don't know about the package system for SAS or SPSS (though I am aware that papers get published with code for said platforms), but I think the bigger barrier is availability; R is free and open.


SAS a bit, but as another comment notes, nothing like CRAN. SPSS not at all, at best you could publish some sort of script.


Agree entirely. Academics led and industry followed.


R won because of

1 - very high usability, particularly (contrast to python) for users who aren't developers, don't think like developers, and don't want to be developers

2 - very high expressibility

3 - it's free, so you can install it on lab computers and your laptop and everywhere else without forking out tons of money (see the price of matlab)

4 - a generate of grad students used it and it, in turn, was taught to a generate of undergrads

5 - academic statisticians tend to like it and tend to implement stats techniques in it, so there's a wide body of shared code on cran implementing most analyses you can think of


It's funny that you mention usability and expressibility (expressiveness?) as R's upside compared to Python. I've used both extensively, and in my experience, R is horrendously bad at both, and Python blows it out of the water.


1) It's free.

2) CRAN, nothing like it for SAS & SPSS.

3) Early adoption by book authors. A whole generation of data scientists were taught in R.


I remember back in college that Microsoft was pretty much giving away student editions of all their software. I think they realized that if students, who are invariably on a budget, couldn't get access to MS software, they'd learn to use the FOSS replacements instead, and by the time they graduated and entered the workforce they would know the free stuff better and start spreading it. I guess SAS and SPSS didn't have any similar kind of strategy.


You are guessing wrong.


Fair enough, I wasn't heavily into stats programming at the time, so I wouldn't necessarily have heard of such a program from SAS or SPSS.


To expand on the free point - it's not just that SAS and SPSS have a price tag. You're then supposed to pay for a load of bolt-on modules. The extreme case is if you want to do decision trees in SAS, they've made a commercial decision that you must be doing some kind of 'Enterprise-Grade Data Mining', and therefore need to pay their 5-figure price tag rather than their 4-figure price tag.

I think this has been a big barrier to any sort of package ecosystem to rival CRAN. Even if your entire target audience has SAS or SPSS, if your new package depended on one or more of those modules, it's going to cost some of your prospective users hard cash to run it. Whereas R just installs the dependencies for you.


I think you answered your own question: "...SAS, SPSS etc which i just didnt have the budget for... I happened to find R as it was a freely available..."


1) compatibility with S-Plus, which was widely used by statisticians for long before R

2) developing new statistical methods and other innovations (knitr, etc) is much easier in R or S-Plus than SAS, SPSS, etc. So R's improved a lot faster than the others.


People's attitudes towards open source changed, no doubt thanks to vendors like IBM, Oracle and others, who embraced R (and open source as a whole).

Also, R Studio has made the entry barrier to R very low...


Agreed on RStudio being a factor. For me its all about packages and RStudio. I totally believe some other languages are better in general (Python), but RStudio makes a difference at work. As I can quickly show co-workers charts, data views and Shiny Apps.


The author writes that the software wasn't the focus of the manuscript and that the reviewers should not have commented on the software at all but that it actually not true. If you look at the preprint manuscript, you'll find that the software is prominently advertised in the abstract of the paper. In this situation, it is perfectly reasonable for the reviewer to comment on the implementation of the software. And as much as I like Julia, I think he is right when he says that an R package would be more useful.


"...it can be written in lisp or cobold for all I care"

The typo is too good. Cobold is the sprite that plays its tricks on us even 56 years after its first appearance.


I assumed it was intentional.


That makes it even better.


The only reason I still use R is because of the hadley-verse packages. (Namely dplyr and ggplot2). R's base packages are incredibly terrible and counterintuitive, especially compared to Python's pandas and scipy.


My non-academist opinion is that the biggest problem in science is that there is little incentive for scientists to reproduce each other's results. Anything that makes it easier for other labs to rerun an experiment is a good thing for your field overall. The goal of a paper should be to communicate as plainly as possible how to reproduce the experiment. That probably means a good-enough standard programming language that everyone learns in school and uses in their work.


Well I think it should communicate how to reproduce the experiment in a lab with the same or similar setup, not how to reproduce it with all kinds of different setups. If a lab does not have the equipment to reproduce a certain experiment, it's not the problem of the original authors.

Also, I think you should choose equipment which best suits the experiment, not other people.


>If a lab does not have the equipment to reproduce a certain experiment, it's not the problem of the original authors.

While I think that's true, a paper can potentially be much more impactful if its' results can be easily understood by others, and isn't that really the point of science? In that sense, it's better for the author to stay away from niche methods and tools unless absolutely necessary.


I agree, unless it's something that can't be done well in the current language. Or, if like me, you're an economist and the default language used by academics is horrible, horrible Stata, which needs to be replaced by R.


For me, both are horrible. I use Stata a lot and thought a lot about switching to R (there is even a nice tutorial by Matthieu Gomez at http://www.princeton.edu/~mattg/statar/ ).

However, I believe that moving to R is not the answer, as you will still be riddled with a lot of the problems that you faced in Stata such as a quirky syntax (except backticks. I hate those backticks). Maybe Python will catch up (I doubt it, what we need is a DSL, not something too general that requires loong commands to do a simple regress). My best bet would be Julia, but it's still a long way to go regarding things like missing values.


If you're a macroeconomist, the default is Matlab. I'm not sure which is worse.


I agree, but I still think the idea/algorithm/etc should be conveyed in sentences and math in the body of the paper. The code should purely be optional and not associated with the paper. I think the problem here is the author is referencing code in the paper, which puts the onus on the journal to evaluate the code. Juts put the code on your website. If people want it enough they will google you.


Look at the Reproducibility Project: https://osf.io/ezcuj/


This is really just an example of Bike Shedding.

Reviewing research papers is really, really hard work to do well. It takes a massive time commitment to truly digest and understand novel work to the level where you can provide a quality critique and review.

It's much easier to throw out facile critiques like the ones these reviewers provided. It shows that you know your stuff and did your job as a reviewer, without requiring you to actually understand the authors' innovations or lack-there-of.


You are implying that the rest of the review, which we haven't seen, was useless and shallow. However, the author himself said that the reviews were actually "helpful and constructive". I agree that cases of bike shedding exist, but there is no evidence that this was the case here.


I think the author is being overly sensitive here, and the reviewers are actually giving good advice.

Most methods papers in most fields are ignored by the vast majority of researchers. Just like the author doesn't have the time to rewrite his code in R, his prospective audience doesn't have the time to learn a new software package just to try out his proposed method.

If the author actually wants to change the way research is conducted in his field, he needs to make it easy for others to try his method out and possibly change the way they do research. As of today, that means an R package.

Or, the author can dig in his heels, refuse to write an R package, and be ignored. Maybe that isn't fair, but as I've learned through my own experiences, that's the way it is.


I think it's just that R has become the de-facto standard since the previous tools used were either terrible, or closed source and expensive.

And it's probably taken them so long to adjust to R, that they don't want to change again any time soon.

And you're right, the tool shouldn't matter, and I definitely agree with your stance on open source, but keep in mind those attitudes do exist. Speaking of attitudes concerning open source, here's one of my favourite blog posts about the subject: http://www.catuhe.com/post/Le-syndrome-du-Puppy.aspx


Even the language we use to talk about arithmetic, algebra, and calculus had to--at one time--be agreed to and standardized, so that mathematicians could have a common ground on which to speak to each other about their ideas and how they reached them.

The program is not just "a tool". It's your proof. You may have written a proof in the paper, but that was just a practice run. It's the one you wrote in the code that is the real proof, because there is little garaunteeing that seasoned software developers write code that matches the spec, say nothing about" amateur" programmers in the sciences.

Now, that does not preclude innovation. People still invent new mathematical notation systems. It's usually with the express purpose of solving problems that can't easily or at all be stolved in current systems. It is the bleeding edge of the science of math. But physicists are still expected to use the math their colleagues will understand, or spend a lot of time explaining their new system (QM anyone?).

There is, of course, the issue that math is significantly less rigorous and specific than code. Math doesn't compile and doesn't run. Or rather, it gets compiled and ran by people. Part of avoiding new, arbitrary notations is so that your work is accessible for verification. That gives running code, especially coffee with a comprehensive set of unit tests, a distinct advantage over math. But that doesn't obviate the bed for verification.

If you are using any nontrivial piece of custom software in your science, it needs to be scrutinized just as much as the rest of the math in your paper--if not more so, since it does the actual work.

The first reviewer is almost certainly right. If they can't understand your code, then they aren't the equivalent of lay-users complaining about open source software. First of all, it is still an issue of much debate whether or not releasing source code to user's for software products is necessary. If you are writing code for your science, you have an absolute duty to open source that code (even if in a non-extendable way, we still need to see the code). You wouldn't make a claim without providing the math. And you wouldn't provide math other people couldn't understand, arguing "some math is better than no math".

I don't think that standard should be R, or that it necessarily has to always be the same thing. It can adapt over time. But until you manage to release a few papers on how Julia improves over R, in explicit detail... "when in Rome, do as the Romans."


Does this story tell us more about R as a standard, or more about the refereeing process? I've dealt with referees for a long time and I'd have to go with the latter. This is to me yet another case of a referee viewing his/her preferences as the only correct way to do things.


Peer review is a huge part of our culture here at Climate. It's often unreasonable to expect a single reviewer to review a complex paper, so we sometimes address this by assigning each reviewer a domain.

Something like:

Reviewer 1: Review scientific theory (domain expert)

Reviewer 2: Review scientific methods and conclusions (senior scientist)

Reviewer 3: Review code for scientific accuracy (senior programmer)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: