Hacker News new | past | comments | ask | show | jobs | submit login
What's Next for R? (qz.com)
189 points by carlosgg on Dec 27, 2019 | hide | past | favorite | 66 comments



I would highly recommend the use of the package data.table over tibble or the basic data.frame if you are doing any type of modeling in R with larger datasets. Yes R has many data structures but knowing how to use data.table will blow your mind in term of efficiency. Matt and other contributors have built something extremely fast and flexible.

I get that R is not for everyone but used correctly it is a beast.

Now this is anecdotal, but we have in the insurance industry what we call on level premium calculators. It is basically a program that will rerate all policies with the current set of rates.

Our current R program can rate 41000 policies a second fully vectorized on a user laptop that has a an i5 from 2015.

In contrast, the previous SAS program could do 231 policies a minute on xeon 64 core processor from 2017.

For our workload and type of work, R has been a godsend.

Bonus, we can put what our data scientist develop in R directly in production. (after peer review, testing, etc, not different than any other production code)

Back when I started in 2005, we modeled in some proprietary software like Emblem, used Excel to build a first draft premium calculator, rebuilt the computation in SAS for the onlevel program and sent specs to IT to rebuilt the program again for production. All three had to produce the same results.

I've tried Python, Go, Rust, Julia. I'd say Python could be a good alternative but speed of data.table, RStudio IDE and ease of package management in R makes R an obvious choice for us. I believe Julia to be the future but so far the adoption rate in house has been low.


As someone "fully fluent" in both, for many workflows that can be properly implemented in SAS, you would expect on a technical level the SAS program could be faster. It's a fully compiled language, it's a "simple" compilation model (compared to R), and the interaction between incremental compilation and the macro system allows you to do some really good blurring between run-time and compilation when performance matters. Plus, by abusing the fact you can define both sql and data step views to further minimise disk read/write, database pass through on certain procedures, and allowimg for in-memory operations (like R) with the sasfile command, from a purely technical point of view, an experienced user of both should be able to beat R in SAS.

But... and here's the big but...I almost never actually meet anyone capable of putting all these steps together in SAS these days that actually understands the SAS computation model end to end.

And SAS's strength, a computation model not being limited by memory by default, becomes a performance weakness when everyone reads/writes every step out to disk and programs without understanding all those little intricacies. SAS hasn't helped any of this by trying to move its eco system away from "programmer" to "application users", so now "programmers" can pick up an interpreted language like R with in-memory default vectorised operations and beat SAS.

Course, I'd still recommend places move to python/R these days because of the broader ecosystems, university talent pool, and avoiding the extensive lock in of proprietary software, but I still feel I have to reflexively respond to "R faster than SAS" claims :p


Believe me, I know. The code just becomes unreadable when you put all execution inside the same data step and use hash table to do fast small to big merging. And not to mention debugging that mess when you have a macro layer on top of it. Not having access to function source code, installation process being what it was. I do not miss it.

And yes technically SAS is faster than R but part of the equation is how many people can make SAS code faster than R/python. I had maybe, 1-2 people that could write efficient SAS code.

One version we had was a bunch of macro producing hash merge plus the whole how can I do something without having to get out of the data step. Just horrible. Number of characters in a line of code? You forgot your quote somewhere and now you have to run the magic line.

I hope I'm not too emotional when I say I hope SAS disappears from my industry and we embrace less adversarial licensing.


I don't think that's being emotional at all.

I'm being emotional when I say I have a soft spot for it because of some nostalgia and occasionally dropping in to do some "rock star" programming moments with it. But that's the opposite of what I'd want if/when I was running my own ship.

I too almost always try to steer myself and others away from it now because of the licensing/customer hostility. It's absolutely ridiculous...


Do you have any resources that help explain these SAS performance measures? A book perhaps?

I have been trying to help with exactly this (and your breadcrumbs help) but it is tricky for me since I am used to open source/*nix environment where you can use much different tools and also information and tutorials are distributed much more widely.


Unfortunately not. With SAS I never used books and relied solely on having access to the fully licensed system at a previous job and all of the SAS PDFs floating around the internet and findable with specific searches.

That combined with a general computer science background and you can start to put the whole thing together.

I'd be lying if I said I hadn't considered writing one, but at my age I'd honestly ask why write one for an old proprietary system and make business for someone else when, if I'll ever go back long term, they can pay me an exorbitant amount as a consultant. Might as well start writing 'the dark arts of COBOL' :p


Just use https://diskframe.com and you will not limited by memory!!


For larger-than-RAM data I would recommend diskframe.com

It uses dplyr and data.table syntax to manipulate data on disk


I've not used diskframe.com, but from experience can recommend the 'fst'[1] file format with 'fsttable'[2] for reading on disk data tables.

[1] https://github.com/fstpackage/fst

[2] https://github.com/fstpackage/fsttable


disk.frame uses fst as the underlying format


Thanks, so far we just scaled up our vm ram but i might find a use for it.


This may be useful. I prefer dplyr's syntax. https://github.com/tidyverse/dtplyr


One of the reason we use data.table is that it reduces the depencies when building custom images and its stability has been better than the tidyverse in the past. It might not be the case in the future, but that is how we made our choice initially.


> I believe Julia to be the future but so far the adoption rate in house has been low.

Why do you believe it will be the future, and what do you see as the barriers to roll-out? I ask as someone who is curious about when/whether to start investing in Julia competence


> We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled. (Did we mention it should be as fast as C?)

https://julialang.org/blog/2012/02/why-we-created-julia

I've been playing around with it. As a Python/MATLAB guy, the syntax is very friendly. I can see it displacing Python in production code where you need speed and might avoid some of the heavy Python DS libraries. Overall it seems like a thoughtful combo of a lot of good numerical programming features.


Debugging is something I could not do as easily in Julia vs debugonce and trace in R. Compiling takes time.

But so far we have seen great development. Flux is a truly beautiful ml library. Being a compiled language remove a lot of headache when building production images. The syntax, the full utf support in variable name. Package management is great. Having that abstraction layer between CPU, GPU so you don't have to rewrite code. Dispatch based on signature, type management. I don't see it going away soon. It took me 13 years to make them transition out of SAS, good thing cloud computing come around and someone realised the clusterfuck of having to manage SAS licence in the cloud.


Both R and pandas force you to wrap your problem around dataframes and vectorized operations. But sometimes you really do just want to write a loop that iterates over the data.

Right now the only way to do that without significant performance costs is to drop down into C or avoid the problem completely by using Julia.

Having worked with both R and Python on large datasets, I think both languages are really easy until they aren’t. Eventually you hit a performance wall.


You can increase the speed of loops in Python using Numba. It's really a great performance booster with just a few decoraters added.


You can drop down into the Numpy values array in Pandas to get your performance gain when iteration is otherwise slow.


I'm so thankful for R, it's community and their great libraries! I've built a eight year (so far) career in data science using R to model data and perform experiments. I love R's functional programming style / dplyr which makes manipulating data a delight. ggplot2 is such a great plotting library, well worth the investment to learn. Then there's all the stats tools like glm, MASS, through brms for advanced Bayesian analysis (https://github.com/paul-buerkner/brms#brms). With R and Python, it's a great time to be a statistician-programmer!

I recommend folks looking to start with R check out: https://r4ds.had.co.nz/


There is also ”Advanced R” by Wickham, that goes into more technical details on how the language itself works (and datastructures, etc).

It is also available for free.


Cannot comment from my personal impressions, as I have almost zero knowledge of R, compared to several years of using Python for writing apps and working with data. I like R's focus on functional programming, though.

However, a couple of years ago, my wife tried to transition from business consulting to a data analytics / data science role. She started with taking an R course. She was put off by R's complexity and the course's early focus on the details of R syntax, function definitions, closures etc. and abandoned it.

The year after, she decided to try again and enrolled in a course that used Python (with numpy+pandas+scipy as data science stack) and she reported it to be much simpler, more intuitive and easier to learn compared to her previous experience with R. Now she has successfully completed the program and is employed as a data analyst.


We ran into this problem often teaching R when I was in grad school. However, in the years since, the tidyverse has put strong focus on getting users manipulating data even if they don't know how to define a function, etc..!

Here's a useful post, comparing the classic approach you mention to an alternative

http://varianceexplained.org/r/teach-tidyverse/


R has a number of features that are intended to facilitate interactive use, which despite being very convenient can be confusing to someone who is trying to learn the language. With Python, on the other hand, it is easier for a novice to build a mental model of how things work. However Python is pretty awful as an interactive language due to the way it interprets white space. Personally I think taking the time to learn R is well worth it.


I guess that's more an issue with the courses than the language per se. Sometimes it is a good idea to begin the course with direct application, instead of focusing on the language.


I have encountered a lot of really terrible R learning materials. One data viz course I took (a very, very reputable and widely-used course on a major MOOC platform) taught how to make several simple chart types in each of base R, a library called lattice that I've never encountered since, and ggplot2. I think a lot of it comes from R instructors who started out back before the tidyverse trying to teach the path they _took_ to learning the language, rather than the quickest path to being proficient in the language as it exists today.

The tidyverse is incredibly controversial in parts of the R community; it's essentially an opinionated set of packages that basically comes with its own "standard" library. But I think that wholeheartedly embracing it, and hiding the way to do things in R that you would do them without the affordances that the tidyverse offers, is absolutely the right way to teach R these days. Unfortunately, a lot of courses and books haven't caught up to that yet.


You only have to go through the learning process once. You are able to use the language for a lifetime. I find it so strange how much emphasis we tend to put on things being simple to learn and pick up.


Because if things aren't simple to learn and pick up, people will get discouraged and move on. As was the case in the comment above.

Great documentations and tutorials go a long way.


The most rewarding things I've learned in life have not been easy by any stretch of the imagination.


What's Next for R?

Doing the exact same thing we did before!

We have a new library called "dtplyr" (no seriously!) it is designed to save users from the arcane and obtuse sides of R by combining the power of "dplyr" and "data.table", the two libraries that were designed to save users from the arcane and obtuse sides of packages such as "data.frame" and ....

I wish I were kidding. There is the absurd contention in the R world that by introducing yet another weirdly named package people can avoid having to learn and suffer through the "real" R.


I started at a company using Shiny for their applications and R as part of their data pipelines.

A huge pain point for us is the packaging system. It is absolutely awful. Packages constantly get overridden so we have to install packages in a specific order. Whenever I have reached out to the community (including prominent members, which have written R books) I have always been told to just use the latest version of all packages and just get on with it, which as anybody knows, isn’t always possible, especially as there are constantly breaking API changes.

I understand R’s history and that in general, it is a lot better than it use to be, but I would only recommend R is used for notebook style work and to keep it well away from production.

We have migrated to Python, which isn’t perfect, but the difference in logging and packaging has been night and day.


I have also found R in production to be a nightmare. On packaging, the renv package seems to be the new way to try to manage things. It’s not perfect but seems to be a step up on what was around before. Have you tried it out at all?


I haven’t, thank you for the suggestion. I will give it a go.


At my old work we would “freeze” CRAN, by downloading a complete dump of everything and setup R to install from that version instead of the online version as a way of version controlling packages.

So instead of defining our app to use version 1.4.5 of a package, we would use “latest version from 3rd of May”.


Same experience here.

A lot of packages/functionality are not available in Python, however.


Disappointed in the lack of discussion of R-Shiny or Plumber.

R-Shiny is a full stack platform for web apps, and it’s how I leveraged my data science background to get into web development. It’s incredibly powerful in my opinion, with the only obvious limitation being the speed of R itself.

And Plumber. It’s become the defacto method for deploying R code in a REST api. It too is still maturing, but I see it eventually becoming the Flask of R.

Truth be told, however, after developing quite a few projects on the Shiny/Plumber stack, I wouldn’t recommend anyone do it.

If for some reason you can only have an R interpreter, go for it. But learning multiple languages really is the best solution if you want to manage efficient applications. I say this, however, realizing that all of my colleagues writing R don’t have engineering backgrounds.

I can’t help but feel like R is like JavaScript in many ways. Ease of use and the ease of publishing packages very quickly clutters the repository.

R will always have a special place in my heart, after all it’s the language that made me discover programming. However, I can’t help but feel that my thirst for efficiency is making me outgrow it as a language quickly.


I understand where you are coming from and had similar experience.

After learning a fair bit of web-development, I feel R should focus on an analytics oriented path.

R just isn't designed for a web-app. Web-apps are much better and faster developed in more focussed languages / frameworks (node/python/django/express, etc) and can be seamlessly integrated to leverage R modules / scripts.


on the shiny note - check out streamlit. declarative python equivalent. it's pretty incredible how easy it is to use


When I used R in University (majored in Applied Mathematics and Statistics) I was always awestruck at how every sort of novel modeling technique from GLM, to Beta Regressions, to GARCH, is all easily accessible for free, with proper academic paper and documentation, and with a cohesive standard support.

It was really useful to be able to apply most theory I was learning to actual research datasets. This is what I miss the most since moving to Python.

What I don't miss is R's terrible packaging system and how it made collaborating with colleagues near impossible. I can't count the amount of times I had to debug dependencies on others' script just to be able to move forward with some team project.


What didn't you like about the packaging system? Even if you hate R the language, R has among the most user-friendly, cross-plaform packaging systems I'm aware of.


https://stackoverflow.com/questions/10947159/writing-robust-...

Historically, the conventional way to write R code was one that tended to result in shadowed names (and hence brittle code).


R has actually come a long way on the environments front. Check out “renv” from the good people at RStudio.

Link: https://rstudio.github.io/renv/articles/renv.html


I used R when I took an online course on Data Analysis. I didn't like it at all. Its syntax is weird and painful to read. The only nice things about R are Tidyverse and ggplot. I found Python to be a better alternative. You can use Pandas for data analysis y EDA. Matplotlib and Seaborn for plotting. Scikit-learn for training your models. An additional benefit is that Python is a general purpose language that you can use to build a complete application.


In almost all of the use cases you mentionned, R blows Python out of the water.

Working with dataframes in R is much much more convenient than Pandas (loc, iloc, etc??)

Plotting is an obvious win for R, matplotlib is horrible, it's powerful yes but it is an absolute pain when compared to ggplot.

Scikit is definitely unmatched but caret is not so far behind. Also, R has a plethora of implemented models that Python lacks (from something as basic as decent quantile regression to time series analysis tools).

As for building a complete application, Python is indeed the go-to.

Syntax wise, using magrittr's pipes is an absolute pleasure. Good luck doing that with Python.


Just as an FYI - the statsmodels python package just released numerous new time series tools in version 0.11 rc1 [1] and also has functions for quantile regression [2]

[1] https://github.com/statsmodels/statsmodels/releases [2] https://www.statsmodels.org/dev/examples/notebooks/generated...


His #1 requirement was not being a painful language and nothing but being not-R can resolve that.

I use R everyday for statistical analysis due to it having certain interfaces and I still hate it every day.


Exactly. I initially liked Pandas, but then I discovered what I can do with data frames in R, visualizations with ggplot, and the SQL-like data manipulation using dplyr w/ pipes from magrittr. R may have the steeper learning curve -- and for certain uses, be inferior to Python -- but it's a wonderful language.


This currently missing are better LSP (Language Server Protocol)[1] (it supports only some of the LSP features), better linter[2] and static analysis, better integration with GitHub[3], and so on. More on the tooling side, I believe.

[1] https://cran.r-project.org/web/packages/languageserver/readm...

[2] https://github.com/jimhester/lintr

[3] https://github.com/github/semantic/issues/382


I also got excited when I found out about R Markdown, and how well is integrated with RStudio. I believe that it is a decent alternative to Jypyter Notebook.


I hope a hospice. Ugh that language has damaged me worse than Perl.


I know this is a dead horse, but I think R seriously shot itself in the foot with its data structures[1]. I don't really see a solution for this, as fixing it would never be backward compatible. I'll always pick Python over R because the data structures actually make sense to me as a programmer (objects that look like lists, dicts, matrices, etc. or any combination of the above, and they all behave in very predictable ways). I think this puts off a lot of other people like me.

[1]: https://jamesmccaffrey.wordpress.com/2016/05/02/r-language-v...


True, the default semantics of R's data structures are somewhat arcane (of course as they're based on S [1] from the 70's). And the current support for e.g. 64bit integers leaves something to be desired.

But behind the scenes, R is just a lisp with some data structures that are adapted to statistics and data science.

All base data structures are by default immutable. And e.g. the vector type is extremely performant as it's just a thinly wrapped C Array. In Python you need to reach for Numpy for anything similar, and you do feel some pain when converting between native python types and Numpy types for various functions which support one or the other.

The data frame is immensely powerful. And has excellent performance characteristics as it's built upon vectors. A list of objects, like you'd make in python is just a lot slower and more unwieldy to deal with. And much harder to make generalizable functions upon.

Hadley Wickham's Tidyverse[2] is exactly an attempt to hide away the arcane details and create a modern, coherent and consistent language on top of R, keeping the power of all the great statistics R libraries. The fact that R behind the scenes is a Lisp, with support for macros, makes this possible. For doing data-transformations and statistics, I can't think of anything currently as powerful as CRAN + Tidyverse.

[1] https://en.wikipedia.org/wiki/S_(programming_language)

[2] https://www.tidyverse.org/


In typical Lisps a vector would be a one-dimensional array, which by default is not specialized to a particular data-type. So the most general data type would be the n-dimensional array and a vector would be a one-dimensional array. A matrix would be a two dimensional array. In Common Lisp one can also ask Lisp to generate a type-specific array (like a string, a bitvector, an array of single-floats, ...).

In R it's slightly different. The vector (being generally without dimensions) is the base data type and n-dimensional arrays are made of a vector and dimensions. A matrix is then a 2d array. Also vectors/arrays are by default type-specific.

> support for macros

From what I've seen, R does not support macros, but functions which can retrieve/generate code at runtime. That's an early mechanism which got replaced by macros in Lisp. Macros in Lisp are source code transformers and can be compiled - thus they are not a runtime mechanism like in R or earlier Lisps with so-called FEXPRs.


This 5 minute video by Wickham was eye opening for me regarding the lispiness or R.

https://youtu.be/nERXS3ssntw


modern Lisps don't use unquote/quote like that.

This looks more like 'FEXPRS' from decades ago.

1962 the ideas of macros were introduced and macros are source code transformers, which take source code and generate new source code. This can also used in a compiled implementation, where macros translate the code before compiling.

FEXPRs are then functions which get arguments unevaluated and can decide at runtime which to evaluate and how.


> 64bit integers leaves something to be desired

This is something I wish there was more progress on. A serious limitation in some contexts.


From that link:

> A vector is what is called an array in all other programming languages except R

Vectors are called vectors in several "wispy" languages: Common Lisp, Scheme, Clojure...

> An array with two dimensions is (almost) the same as a matrix.

I think it's the same, not "almost" the same. At least in the current version of R:

  > class(array(1, c(2,3)))
  [1] "matrix"
  > identical(array(1, c(2,3)), matrix(1, nrow=2, ncol=3))
  [1] TRUE
In 4.0 there will be a change and the class of a matrix will be both "matrix" and "array", but I think the fact that there is no difference between a 2-dimensional array and a matrix remains.


I look at it and don't see what is the problem. I think in fact is a very sensible progression of structures?


It's based pretty directly on S, which was designed in the mid 1970s. Yeah, it has very rough edges here, but hard to argue that they should have foreseen the future back then.

That said, the real value in R seems to be the libraries. Has anyone looked at a shim that could make those libraries available to Python in a reasonably natural way? If that existed, the R language itself could be allowed to finally rest in peace.


There is something to be said to build a programing language to solve a certain task in mind.

Being vector aware and having a dataframe support in R is much more elegant for me than Python's add on library. It's like Scala building on top of Java but trying to have an Actor paradigm vs Erlang built from get go around concurrency and choosing Actor as it's main concurrency paradigm. You can see this in other language on PHP and C++ let you be OOP but it's an after thought compare to Ruby or Python.


I'm not unsympathetic to this idea, but after learning my 87th domain-specific language that couldn't be bothered with reasonable control structures, or even solid error checking, it's really starting to wear.

Statisticians aren't that interested in writing a really good programming language. And why should they be? They have better things to do. The trick is to not take on responsibility for something you don't care about, if you can help it.


You can embed an R interpreter in any language with a C interface. That said, most of the complaints I see about R reflect preferences and prior programming experience with newer programming languages. While there are things I don't like about R, it's a Scheme without s-expressions, and overall I like it.


There are various ways to call R from Python, or Python from R. They never end up being very idiomatic, which typically makes them a pain to work with.


The only thing you need to understand about R data structures is that everything is a vector, including scalars. You have atomic vectors and lists, which are a special kind of vector. Everything else is built on top of those.


This is an insight you gain very early in your R experience. It breaks down rather quickly. Not that it is not true — there are just too many details around the core concept.


It's possible if you provide the migration tool, something like Rust's `cargo fix`[1]. Apart from small obvious warnings, it can apply the migration from the Rust-2015 edition to Rust-2018 one[2]. Introducing the new R edition and a similar tool could help with this.

[1] https://github.com/rust-lang/rustfix

[2] https://doc.rust-lang.org/nightly/edition-guide/editions/tra...




Consider applying for YC's first-ever Fall batch! Applications are open till Aug 27.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: