Hacker News new | past | comments | ask | show | jobs | submit login
One Year with R (github.com/reecegoding)
247 points by _dain_ on March 22, 2022 | hide | past | favorite | 258 comments



R, and by R I mean R+tidyverse, is the world's best graphing calculator attached to an OK scheme.

To which I mean R is a highly optimized, well-oiled machine if you're using it for its highly-optimized, well-oiled purposes. I tend to have notebooks full of tiny fragments like this

    dat_min %>%
      group_by(ymd = make_date(year(date), month(date), day(date))) %>%
      summarize(vol_btc=sum(vol_btc), vol_usdt=sum(vol_usdt), tradecount=sum(tradecount)) %>%
      ungroup() %>%
      pivot_longer(cols=c(-ymd)) %>%
      ggplot(aes(ymd, value)) + 
      geom_line() +
      facet_grid(name ~ ., scales="free_y")
It's madness if you're not familiar with the tidyverse, but 3 dozen fragments like this is enough to eviscerate a fresh data set. Almost any question you can dream of is a 3-20 line set of transforms away from a beautiful plot or analysis answering your question. Very notably, this includes some of the finest modeling tools available today.

Terseness here is a huge advantage as well because in many data analysis workflows you are rerunning that same 10 line snippet over and over, making small changes, adjusting to eventually visualize the thing you're looking for perfectly. Having all of that in the same small block is ideal.

Finally, for the non-trivial number of folks in this specific scenario, the integration between Stan and R/RStudio is top-notch and makes using both tools very pleasant.

You can replicate all of this in Python, but optimal Python/Jupyter is still a far cry away from R/RStudio for these specific sorts of tasks.


> best graphing calculator attached to an OK scheme.

I discovered "How To Design Programs" somewhere late in my first year of using R. Like most beginning R coders with nominal experience in other languages, I wrote a lot of monolithic scripts in a very imperative style. HtDP gave me a mental framework for decomposing larger problems into bite-sized chunks. The lispy roots of R lent itself particularly well to the model of thinking presented in that book.

Ever since then, I've pined for the graphing calculator parts in a more modern Scheme. When ggplot and then the tidyverse (neé hadleyverse) came on the scene, I was even more convinced that Scheme, especially Racket, was the ideal future for data science. If R could support a large ecosystem like tidyverse, just imagine what the metaprogramming facilities of Racket could do!

But I think those graphing calculator parts are hard to reproduce. Attempts to clone ggplot2 fall short year after year, because most other languages don't have grid graphics to build on top of. R is a deep ecosystem on "an OK scheme," which is damned hard to beat.

Aside: my first year with R, was in an urban planning masters program and I was terrified of my first big kid statistics course (taught in SPSS). I decided I'd give myself bonus work by learning R. While it was absurd to be doing my stats homework in SPSS, then R, then reviewing HtDP on top of the rest of my course load, I did ace that stats course. :-)


There is an better alternate universe where xlisp-stat doesn't fall behind and S doesn't happen.


Interestingly, there appears to be an attempted reboot: https://lisp-stat.dev/

My first reaction, is "why not on a modern Scheme as opposed to Common Lisp" and in so thinking, I have demonstrated exactly why no lisp / scheme has ever achieved critical mass :-)


I would think that the reasons for "why not Scheme" would include the fact that CL has more existing libraries, better built-in support for multidimensional arrays, efficient low-level code (now even with SIMD on SBCL using SB-SIMD), application-controlled JIT compilation of specialized code, almost guaranteed support for native multithreading, etc.


Shocking. I still using the xlisp-stat based social simulation book first edition (20+ years ago). The move to ... does not fit my mind. I do not know there is a re-start!!!


Obvious answer: 3rd party library support. Same thing that makes R ubiquitous.


I really don’t think this is that different from how a lot of us learned to code though. I learned by writing html and PHP specifically to solve a problem of having a website. The only difference I see is that the average CS student has to write reliable and working code as their career, while quants and statisticians tend to think that code is just a means to the end. Both are right I’d say depending on the problem.


> R is a highly optimized, well-oiled machine if you're using it for its highly-optimized, well-oiled purposes.

This hits home for me. We are just starting to use R for risk modeling where I work. R, more than any language I've ever used, makes me appreciate "worse is better". From a theoretical "aesthetic" perspective R is a mess. Yet for data processing all those theoretical concerns don't matter. It just works.

It's honestly kind of humbling that something so theoretically messy can be so practically coherent. It makes me question my assumptions about simplicity.


R "just works" now because a huge amount of effort has gone into improving the language over the last 10 or so years, in part spurred by the tidyverse movement, although not restricted in scope to tidyverse. When I was starting grad school around 2010, if someone sent you some R code, the chances that you would be able to "just run" it were basically zero: there would be weird version mismatches in how functions worked, file paths would be specified in inconsistent ways in different parts of the script, all kinds of crazy impenetrable errors were the norm. Now there are several R code snippets posted in these HN comments that will run without trouble. If I could have gone back in time and told myself that this is how R would develop, I would have been shocked (and happy).


Part of it has to do with strict testing in CRAN as well. Packages have to pass tests and confirmed to compile. This adds reliability to package management across platforms.

That said I still run into trouble with package deprecations. I was trying to install the optmatch package (deprecated but still used by causal inference packages) and had a really tough time getting it to compile on macOS.


I dunno man, python has always seemed a little bit worse on this stuff to me. At least with R if you had a consistent version, everything off CRAN worked together.

I think R 3.0 introduced namespaces which fixed a lot of the really crazy stuff.

Also, I was writing Sweave in 2010 for my thesis, and I definitely wasn't alone.


Namespaces showed up around 2004, so somewhere around 2.0.0. I don't think they were mandatory until much later.


I suspect that has as much to do with the maturation of the data science community as it does the language environment. There have been pockets of R users who put much effort into reproducibility before the era you cite, such as Bioconductor.

When I think back to the era you're describing what I recall was people winging around hacky scripts being the norm regardless of their environment. While still not something I'd think of as software engineering best practices, what I see now is less Wild West.


I hadn't thought about R as a "worse is better" language, but that's a good way to think about it. Makes sense, too, since it came from the place that inspired worse is better.


R comes from New Zealand, no?


R is an implementation of S. John Chambers worked on it at Bell Labs starting in 1975.

https://en.wikipedia.org/wiki/S_%28programming_language%29


I’m trying to figure out if you were actually asking a question or if it was rhetorical and you were calling shots.


Is there any reason you chose R over Python? Is it just because that’s the go to language?


We asked the people are going to be using it what they'd prefer. Many of them are recent graduates, and they told us they mostly used R during their university courses. It's just a pure familiarity play. The alternative was building a huge system on a mainframe (we're a legacy bank).

Really there wasn't a lot of thought put into the language. We figure that if it ends up being a total failure, we can just pivot.


Bravo! This is exactly right.


This is just a quick example - I would be grateful if people could recreate this brief look at UK COVID figures in another language:

  library(tidyverse)
  library(scales)
  
  download.file(url = "https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=covidOccupiedMVBeds&metric=newAdmissions&metric=newCasesBySpecimenDate&metric=newDeaths28DaysByDeathDate&metric=newPeopleReceivingFirstDose&format=csv", destfile = "./data.csv", method = "wget")
  
  read_csv("./data.csv") %>%
  pivot_longer(names_to = "Data", cols = c(newCasesBySpecimenDate,
                       covidOccupiedMVBeds,
                       newAdmissions,
                       newDeaths28DaysByDeathDate)) %>%
  mutate(Data = factor(Data)) %>%
  mutate(Data = recode_factor(Data, newCasesBySpecimenDate = "New Cases",
         newAdmissions = "Admissions",
         newDeaths28DaysByDeathDate = "Deaths",
         covidOccupiedMVBeds = "Ventilated")) %>%
  ggplot(aes(y = value, x = date, colour = Data))+
  geom_point(size = 1, colour = "gray", alpha = 0.6)+
  geom_smooth(type = "LOESS", span = 0.1)+
  labs(y = "Daily rate", x = "Date", colour = "UK COVID-19")+
  scale_x_date(date_breaks = "months", date_labels = "%b-%y")+
  scale_y_log10(labels = comma(10 ^ (0:5),
                 accuracy = 1),
         breaks = 10 ^ (0:5))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


    import pandas as pd
    import matplotlib.pyplot as plt
    import matplotlib.dates as mdates
    import seaborn as sn

    df = (pd.read_csv("/tmp/overview_2022-03-21.csv") # i just used curl beforehand
        .assign(date=lambda x: pd.to_datetime(x["date"]))
        .set_index("date")
        .melt(value_vars=[
                "newCasesBySpecimenDate",
                "covidOccupiedMVBeds",
                "newAdmissions",
                "newDeaths28DaysByDeathDate"],
            var_name="Data", ignore_index=False)
        .assign(Data=lambda x: x["Data"].replace({
            "newCasesBySpecimenDate": "New Cases",
            "newAdmissions": "Admissions",
            "newDeaths28DaysByDeathDate": "Deaths",
            "covidOccupiedMVBeds": "Ventilated"
        }))
    )
    ax = sn.scatterplot(data=df, x=df.index, y=df["value"], hue="Data")
    ax.set(xlabel="Date", ylabel="Daily rate", yscale="log")
    ax.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
    plt.show()

I spend 2 minutes on the pandas part and 20 minutes on the plotting part, which really says it all. Seaborn's support for smoothing is really bad and doesn't play nicely with datetimes for some reason, so if I wanted smoothing I'd need to do it myself. And the other stuff I left out requires going into matplotlib's documentation which I don't want to spend time on.

pandas is as good or better than R's dataframe manipulation, but R's plotting tools are best in class. I hate all the python plotting libraries.


> I hate all the python plotting libraries.

I've given up discussing this. My personal opinion is that MPL is the assembly vs ggplots2's Python.

Yes, maybe there are some things that are doable in MPL but not in ggplot2, but never (and I mean - never -) has any one of my colleagues found a single example that I couldnt recreate in ggplot2 with a more readable code that resulted in a better looking plot.

MPL has this Latex like "once you get the hang of it, you will never use anything else in life for anything" and OOP's "If its good code then it is OOP and if its not, then you're not doing correct OOP" mythisque hanging over it. When confronted with bad MPL code that results in bad plots, its always one of those two.

ggplot2 is the best plotting package out there and imo one of the best "end-user" packages in any language. Also, Hadley is a saint.


Check out plotnine. Really good clone of ggplot for python.

https://plotnine.readthedocs.io/en/stable/


Thanks for providing that. Interesting to see the '.' used like a pipe. Always thought of it used in an OOP context, but interesting that it can be functional too. I also had difficulty fitting a LOESS curve to the plot, but I could do a linear model. The LOESS would have been possible doing it manually I guess.


This is great.

You don't need the curl -- read_csv works with URLs directly.

The lambda can be replaced by passing parse_dates=[“date”]


Thanks


This was fun to play around with. I made some very minor changes and posted at https://gist.github.com/hadley/d54895557fbb0fe0402d2277b9011....

It revealed to me that there's a buglet in `forcats::last()` (https://github.com/tidyverse/forcats/issues/303) and made me wonder if `pivot_longer()` should be able to rename the columns as you pivot them (https://github.com/tidyverse/tidyr/issues/1338)


This might be the most excited I've gotten about a comment in hackernews for a while! So interesting to see your style of code (I was actually on 4.0 so gave me the impetus to upgrade to get the new pipes). fct_reorder - what a hugely useful function I didn't know about. The chicks vignette was nicely illustrative. And label_date_short also super useful. Also curious generally about your bracket style (I just tend to pile them all up when closing, which is something I only do in R and probably shouldn't!).

Renaming factors is one of those things that always seems a bit awkward. I think I've used several methods. Passing a list of named vectors into `levels(x)` allows a many-to-one mapping but was quite dangerous. I've used revalue and mapvalues from plyr. fct_recode is new to me. But yes, renaming while reshaping could be quite convenient. Just looking at fct_recode now, it looks really nice. Seems to support many-to-one and being able to pass it a name vector is very convenient.

Learnt so much today, and this is even before trying out the python and julia examples! Many thanks for this and all your work in R!


If you enjoyed this you might like https://github.com/VictimOfMaths/COVID-19 :)


A Julia solution with Chain and Gadfly might look something like this, although I've translated the R fairly directly so it might not be very idiomatic.

    import CSV
    using Chain: @chain
    using DataFrames
    import Downloads
    using Gadfly
    using Dates

    @chain begin
        Downloads.download(
            "https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=covidOccupiedMVBeds&metric=newAdmissions&metric=newCasesBySpecimenDate&metric=newDeaths28DaysByDeathDate&metric=newPeopleReceivingFirstDose&format=csv",
        )
        CSV.File
        DataFrame
        stack(
            [:newCasesBySpecimenDate, :covidOccupiedMVBeds, :newAdmissions, :newDeaths28DaysByDeathDate];
            variable_name = :Data,
        )
        transform(
            :Data =>
                (
                    x -> replace(
                        x,
                        "newCasesBySpecimenDate" => "NewCases",
                        "newAdmissions" => "Admissions",
                        "newDeaths28DaysByDeathDate" => "Deaths",
                        "covidOccupiedMVBeds" => "Ventilated",
                    )
                ) => :Data,
        )
        subset(:value => ByRow(!ismissing)) # Can't plot Geom.smooth with missings
        plot(
            _,
            x = :date,
            y = :value,
            colour = :Data,
            layer(Geom.smooth(method = :loess, smoothing = 0.1)),
            layer(Geom.point),
            Scale.y_log10(),
            Guide.xlabel("Date"),
            Guide.ylabel("Daily rate"),
            Guide.xlabel("Angle"),
            Guide.colorkey(title = "UK COVID-19"),
        )
    end


Thank you for that, good to see there's an elegant Julia solution! The last time I was using 'pipes' with Julia, I think I was using DataFramesMeta. I also really like this interactive gadfly plot - reminds me of Matlab, but better. It's been a little while since using Julia, so I'd forgotten about the pre-compiling thing, but generally this code looks pretty nice and clear.


I used to use DataFramesMeta.jl, but eventually I found that the mini-DSL that DataFrames.jl has created is really powerful and not overly verbose. Now, going back to the Tidyverse's syntax makes me feel a little uneasy, like there's just too much magic going on behind the scenes, even though I used it for years with no problems.


You don't need the download file command -- read CSV works with URLs directly :-)


Oh nice, that's very useful to know!


I think these kind of common task challenges are great for comparison. You see a few approaches to a single task and you can do a more aligned and detailed comparison.

Unfortunately, it’s also a lot of work. In this case, you’ve posted an intermediate stage artifact from R. If one of the many Python programmers reading this want to produce a comparable artifact they need to understand or run that code. That alone reduces your likelihood of getting any substantial replies.

Maybe add a link to an image of the resulting plot?


Good point, looks like I'm too late to edit, but here's a link [1] (excuse the R style indexing).

Yes, I love these things and very curious to see what hackernews comes up with. Project Euler was an eye opener for how things could be optimised in different languages.

[1] https://i.imgur.com/M8DX98I.png


I also use R for any heavy data manipulation, but I primarily use the data.table package. The efficiency that both of these packages unlock is absolutely unparalleled in any other tabular data manipulation library, in any other language that I have used. And R has the top 2!!

My skin writhes every time I need to type:

table.loc[(table.column > 2) | (table.column2 < 3)].reset_index(drop=True)

when I want to subset a table.


.loc lets you supply a callable, so you can write:

  table.loc[lambda df: df["column"].between(2, 3, inclusive="neither")]
this is useful when your dataframe has a long name, or when you have some long method chain and you need to subset at the end: table.foo().bar().baz().loc[lambda df: ...]

it is still more verbose, but I actually prefer always providing column names as strings. it's more explicit. I don't like R's environment-manipulation metaprogramming magic where you can give column names as symbols.

as for resetting the index all the time, this is something of an antipattern. if you set up your index right beforehand it isn't necessary so often.


Can't get around resetting the index, as far as I know, but for the filtering you can also do,

`table.query("column > 2 and column2 < 3")`


do any IDEs help you autocomplete the string argument? that's a bit part of the non-standard evaluation magic in the tidyverse these days; you can use bare, _unquoted_ names, _and_ get excellent autocomplete, at least in RStudio.


Not to mention the auto complete that comes with RStudio. Is there any way to get equivalent functionality in Jupyter?


If you set up use the Jupyter extension[0] and open your notebooks in VS Code you get Intellisense (code completion, method info and hints etc).

0: https://marketplace.visualstudio.com/items?itemName=ms-tools...


IME this is strictly worse than the RStudio experience; most of the time I hit tab in a VSCode notebook I get way too many options that IMO are clearly not what I want, though at some level it's more pandas' fault (too many methods & attributes even before attaching every column name as an attribute) than VSCode or Intellisense.


In addition to the sibling answers:

If you know the class of an object, let's say Class, but the object has yet to be "constructed" so that IPython can correctly infer its type, you can type `Class.[TAB]` in IPython and look at its methods.

For example, in Sympy, you have a matrix type called Matrix. You can do `(A * B).diagonalize()`, or alternatively you can do `Matrix.diagonalize(A * B)`, which has some advantages because doing `(A * B).[TAB]` does nothing useful because Python can't infer types.

You can also do the same for modules. `ModuleName.[TAB]`

To be honest though, I found the experience smoother in R for some reason.


I use pycharm which has decent autocomplete. Pycharm has its own issue though, it fills out its autocomplete info by looking at the function that created the object, not the object itself. So if a function can return different types, autocomplete won’t work. That’s caused me quite a bit of pain.


My wife is a researcher and started delving into doing her own statistical analysis. It's be fun (and frustrating) learning R with her. I agree that dplyr and the tidyverse are some fantastic packges for a software engineer who thinks about spreadsheets as SQL tables.

I would say the most frustrating part about RStudio is that it is a workbook where you can execute code based on your cursor. For my wife, these workbooks become a total mess because things aren't necessarily run sequentially.


Here are two good strategies:

1. Always run from the top, using the "run previous chunks" button. When this gets too slow, you know that it's time to think harder about your workflow. For a more extreme version of the same idea, regularly restart R using Ctrl-Shift-0, and run from the top. It'll ensure your code is working right.

2. Have a setup chunk that always gets you to the same state. Make sure every other chunk works directly after calling the setup chunk. Then just alternate between "run setup chunk" and "run current chunk".


0. Don’t use notebooks. Neither RStudio nor Jupyter. They prevent people from developing good programming skills and good version control skills, and they encourage making a huge mess.


I mean that's fair, but if you have to write an academic paper, your options are more or less use a notebook, or copy and paste your results. Of course, the question is how much code you should write in the notebook, versus having it in a more organized set of functions and libraries. It's very easy to end up with a huge bloated document which contains thousands of lines of spaghetti.


I absolutely agree with this:

> the question is how much code you should write in the notebook, versus having it in a more organized set of functions and libraries. It's very easy to end up with a huge bloated document which contains thousands of lines of spaghetti.

But this isn't correct:

> your options are more or less use a notebook, or copy and paste your results.

What's wrong with writing scripts that write images to disk? That's how millions of academic papers were written before the advent of notebooks. You could use Makefiles if you like, or you could even use a technology such as Sweave to automatically mix images with LaTeX output.

I mean this as politely as possible but the fact that you think that the options are "use a notebook or copy and paste" I think shows that you've caught a notebook mentality disease! The fundamental point I'm trying to make is that you we don't need to do everything interactively from REPLs. REPLs are great for trying things out, but when it comes to producing the images for your paper, those should be produced by scripts, not by commands entered into a REPL, or notebook. An those scripts should evolve via version control, which is the basis of evolving any good and correct software. And the scripts for producing images for a paper should be good and correct software.


We might be talking at cross purposes. I'm thinking of e.g. a Rmarkdown notebook, which is indeed a script not (necessarily) an interactive notebook like Jupyter. And it can be automatically compiled, via a makefile or something similar, and put into version control.

The point is that it makes sense to mix english prose + code to e.g. produce tables or graphs, even if most of the heavy lifting is done separately in code files.


Ah, OK, yes I was talking at cross-purposes to some extent then, thanks. (Jupyter notebooks are hopeless in version control due to the JSON format, but I'm not familiar with Rmarkdown notebooks -- do you get sane diffs?)

Yep, so what you say makes sense. Isn't it sometimes a bit overly prescriptive to assume all collaborators use Rmarkdown? (Perhaps not! I used to work in biology and statistics and R was very ubiquitous.)


Rmarkdown indeed makes sane diffs, it's just a nice text-based format. You can use python or julia as well (see quarto.org for the latest version of this). Getting collaborators on board... yeah, that varies.


I love the phrase "eviscerate a fresh data set" :D


Great stuff. Could you maybe share a bit of these "3 dozen" scripts? This could be super helpful.


Looks like your notebooks focus on crypto currencies :-) Good use case for data analysis!


I was expecting a rant, but the OP's article is actually very thoughtful. He definitely knows what he's talking about.

The thing about R, for me and many others, is that it's very much an everyday grind language. Especially with Rstudio, its natural domain is as one of "notebook" languages like python, julia, matlab, and mathematica but with a more clear focus towards the tasks of data-analysis. I just tell the BI-tool people that R is excel on 'roids.

R frustrates me a lot, however. But I think the frustration comes out of the fact that when I am using R and get stuck, I am always in the middle of doing something that I need to get done and I don't feel like diving into a long "vignette". Moreover, the documentation is usually too terse and generalized for me to just understand it immediately. Even though I've been using R for years (albeit in fits and starts rather than continuously), there are things about it that I've just never picked up-- I just DON'T KNOW (or care) what F S3 and S4 mean. Unlike the OP, who clearly knows more R than myself, I grit my teeth when I am looking at docs and see the "..." in the arg list.

I suspect that this is part of the heritage from R's beginnings. I once tried to read John Chamber's book but found the presentation complete ass-backwards and impractical for my immediate needs. The Tidyverse has been great, it's far more consistent and ggplot is a kick-ass tool to have in your box. The drawback is that it makes Base-R seem really alien and if you want to be good at R, you have to know more than just the Tidyverse, IMHO.


The tidyverse docs are the only ones with the super frustrating ... of impenetrable gnostic "documentation" that I know of. In general the tidyverse documentation is horrible, almost as bad as typical Python docs, IMHO. Other parts of base R are wonderfully documented in my opinion.


Do you have any specific examples that illustrate the general problem? I'd love to better understand what you're looking for in docs.


Thanks for taking my aggressive comment with such spirit, it really speaks to a good community. (Sleep training an infant has me a bit frazzled)

I should have been more specific, the ... frustration for me comes up mostly in ggplot, Which usually directs you to layer(). Which gets parameter string documentation like:

* geom - The geometric object to use display the data

* stat - The statistical transformation to use on the data for this layer, as a string.

These are two hugely important parameters, with really big concepts and abstractions under them, but the documentation is of the style "foobar(): this is a function that foos the bar", documentation that restates the information in the name, but with more words, and no insight on where to go next.

So now a person is two pages deep into documentation, and it's actually circular documentation because layer() has a ... argument that gets passed back to what? The function documentation that you came from? For a newcomer it's a completely twisty series of passages, and as an experienced user who reaches for ggplot before any other tool, it's confusing.

The other function based confusion is that the list of aesthetics is not connected quite well enough to the mapping argument from aes(). What aesthetic values the function understands is probably one of the most important things about looking up the function. But reading the parameter documentation, it's not clear that there's an entire section below that describes that crucial material, far further down the page. And on a long long page it's easy to accidentally skip over that section when skimming.

(These are the sorts of frustration I have with typical Python documentation, btw, so maybe my brain is just different from typical engineers)


Ah yeah, connecting the dots in ggplot2 docs is hard. It's hard for us to document because, under the hood, the pieces quite decoupled and different pieces are responsible for different arguments. But since we last took a deep dive on the ggplot2 docs, we've gotten much better at generating docs with code, so maybe it's time to have another look. I've filed an issue (https://github.com/tidyverse/ggplot2/issues/4770) so we don't forget about it, but no guarantees about when it might get done.


Is there a tutorial someplace that explains how ggplot actually manages plotting? Or the architecture and layers between the high level code and how a plot is drawn? Meaning, I love being able to express what I want and ggplot figures out a good plot for me. But I know there are many layers that can be manipulated, but I just don’t understand the layers.

One of the best compliments I can think of is that with ggplot, easy things are easy and hard things are possible. But I haven’t been able to figure out how to fully work the system.

(Thanks for all of the work!)


That should be covered in the original A layered grammar of graphics paper: https://vita.had.co.nz/papers/layered-grammar.html

And then there is an entire ggplot2 book (there are many, but this one was written by Hadley): https://ggplot2-book.org/


That's very helpful. I think this chapter was what I was looking for:

https://ggplot2-book.org/internals.html


I think the issue with some of this documentation is that for other packages, the function documentation is largely self contained. If I look up glm() it tells me how to use glm(). However, for ggplot2 there is an assumption that you have some level of knowledge of how the pieces should be strung together. So when I know I want a boxplot, and I find geom_boxplot() documentation it wonderfully describes the options for itself, and gives examples for it's use. But sometimes it doesn't give a good idea of the context of how the other pieces might interact. It makes complete sense if you read the book and just want to refresh your memory, but if you are coming in as a new user it really can be difficult to use the documentation exclusively.


Having read an online book of some sort on ggplot2, on one of the tidyverse sites, I found the per-function documentation difficult to use and difficult to match to the concepts I had learned. This may be because I'm used to using the parameters section of a function as the primary resource for understanding the inputs. But with ggplot it's scattered in other places, and the holes are not apparent unless you know the specific terminology (not concepts) to match up.

All that said, I find the documentation to be saying a lot more than it did in the past, and it sounds like it has been continually improving.


He may be right in specific instances, but I think he's way wrong in general. Tidyverse is generally a triumph of documentation, and part of that is that it doesn't tell you too much. Lots of how-to, not too much implementation detail. It's appreciated.


After 15+ years of shipping open source stuff, I've rather concluded that any given piece of documentation is either going to be too terse or too verbose for any given user and all you can really do is mix judgement and balancing how many of each type of complaint you receive.

It's probably possible at least in theory to structure docs so you have a terse section followed by a verbose section for each thing, but I've yet to develop the discipline or the competence to pull that off remotely regularly.

Maybe in another 15 years.


> almost as bad as typical Python docs

I found numpy, scipy, pandas, and plotly docs to be quite clear and extensive. The only docs I have found to be confusing are matplotlib's and the Python standard library's. Not sure what packages you are referring to?


On the other hand, I'm curious what you've found to be lacking about the standard library documentation. I've found it to be generally very thorough, in some cases fantastic, though there's an occasional weak point.


For me, the stdlib docs isn't lacking in material, but it is hard to navigate. I think that is partly because it mixes different types of documentations together (tutorial, reference, and changelog). They should split each module's docs into two pages: a tutorial with examples of the most common use-case of the module, and a reference listing all of the classes, methods, etc. of the module.

Also, the built-in types are documented in one page, going from boolean to sequence types and even type annotations. Every Ctrl + F gives me 20 different results, which is annoying as hell.


I've used R for 19 years and do not have any other programming language ability. I am curious: What is frustrating to you about R relative to other languages?


In my experience:

- It has a bunch of different types of classes, and they all behave differently. Debugging isn't awful, but it's harder than it should be. Also, the documentation isn't clear about which classes to use.

- A lot of the workhorse functions suffer from parameter glut. Despite having different kinds of classes, almost all functions expect plain vectors. Packages like survival show how objects make it easier to read code, reuse data, and validate data. Without the base packages doing it more, everyone's chosen their own systems. The community's been gravitating to organizing "objects" as rows in tables (i.e. tidy).

- The way a function uses an argument might surprisingly change based on other arguments given (e.g., `binom.test`). And then the documentation won't have examples for the different use cases.

- Most users don't have the time or desire to become better R programmers. They have other work to do. For my own work, I write packages with custom classes, functions, and template documents. For collaboration, I keep things very plain and rarely go beyond dplyr; very often, the script goes between two steps executed in a GUI software.


I'm not OP.

To me, it's the tooling around it. Everything is done in R-studio and it's focus is to generate statistical documents.

The result is a sub optimum solution. It lacks good tooling around installing and running R programs. R programs don't import, they include. It doesn't make it more readable. R-studio is very Emacs like in the sense that it just lacks a decent editor. Due to R-studio being the default, there's not much support for other editors.


> Due to R-studio being the default, there's not much support for other editors.

But RStudio is amazing... Easily best environment I've used for any programming language (well, except Pharo).


I use Vim with the R command line wrapped around Makefiles. I don't even have RStudio installed. Works great.

I can even pop up an interactive R command prompt session and do whatever I want in it, even quick ggplot2 graphs. Help shows up just as you would expect, and plots pop up in new windows. RStudio is much less advanced than people think it is, it's really just managing R's windows for you and doing generic IDE work. R is doing all the heavy lifting.

If you're wanting a more "import" like thing you might want to look into making R packages instead of scripts. You don't have to submit them to CRAN, and you can execute them pretty simply on the R command line as well.

That being said, it's really not an OOP or software development tool. It's definitely geared toward data science, but you can automate that very well for generating graphs automatically and reporting for whatever reason needed.


Do you have notes or documentation on how to get this working without RStudio? Preferably on a Macbook?


I use Linux, but it should be pretty straightforward on Mac as well. Just type "R" in the command prompt and check the manpage for R itself ("man R").

I'd also look into the "knitr" package, which is what all of the Rmarkdown is based around. So for instance, most of my Makefiles are based around a simple command like:

    R -e "library(knitr); knit2html('index.Rmd')"
Then I just code using VIM on index.Rmd. You can probably set this up however you like with the R command line.

For interactive it literally is just typing "R" in the command prompt. For help, things like "?ggplot", "??knitr", or whatever, so you can open multiple interactive sessions like you were using IPython or something. When you print a plot, it just pops up in a new window.

You can also use "R" to just execute R raw if you are trying to do it without Rmarkdown. I just prefer the HTML output. Pretty sure all the RStudio RMarkdown stuff just calls knitr as well.

The output looks the same as anything on RPubs (and there is a way to publish to RPubs, I used to have to do that at one point), random one from the first page:

https://rpubs.com/mnguy1019/881028


Emacs has had excellent R support since much before R-studio came around. In fact to me R-studio has always felt like a stand-alone implementation of ESS (minus Emacs).


FWIW, I do prefer R to all other "notebook" languages. It has everything I need for answering questions about data, working with files, text, and visualization. Libraries that do everything I could imagine are easily available.

For me it's a pragmatic workhorse tool that I use and aside from frequently getting frustrated with the task at hand, it has never failed me in the end.

I think R is much like a handheld power tool. I have no interest in diving deep into the workings of the tool because when I need to use a drill, for instance, I just need to drill holes and anything else is an annoying distraction (I realize that sounds bad!).

I've also worked with Mathematica and JMP in the past. They're very capable, but not as good at general-purpose data-wrangling as R is today (given Rstudio, knitr, shiny, all the specialized libraries, and most especially the tidyverse).


In decades I have not encountered any development environment where the obscurity of error message presentation can touch R. If one uses other languages or build environments the error messages can frequently be used to diagnose the issue.

In R they default to just vomiting some internal exception often with no context, and I can count the times I have encountered helpful or even seemingly deliberately constructed error messaging in single-place base five. Even kernel development is in some sense better because at least there you are in context and the layers are traversable.

Seemingly one of the primary skills of an R programmer is serving as an informal database of “what the fuck does this mean?” when the issue is something trivially detectable like passing a list vs a vector.


I love R more than any other language I have ever used. Perhaps more than any piece of software I've ever used. All of these points are valid, and yes, it's messy, and if you try to write the same type of code that you would in Python, it will frustrate you.

And yet.. it somehow works. It makes data analysis and statistical modelling a pleasure. It somehow gives off a sense of lightness, and makes it easy to investigate and explore. I would guess I am genuinely 2x as productive in R as I would be in Python on similar tasks.

I know it's not a "proper" language, but I think that, maybe, not everything has to be exactly like "proper" software engineering?


I very much agree with this. I use python for (different types of) data analysis too, and in python in particular it feels like the "boilerplate" to "science" ratio is rather high in the direction of "boilerplate". R manages to abstract this away very effectively, as the article highlights.

The beauty of R is that you can write one line of code and use some hot-off-the-PhD-thesis cutting-edge-just-published-in-J.-Stat.-Soft-chunk of statistical analysis in your totally different, completely whacky problem, and it's fast, and (by and large) works.

Of course, that's its biggest problem as well. Scientifically, it will quite happily give you a 150 mm howitzer to aim at your foot, assuming you know best.


> hot-off-the-PhD-thesis cutting-edge-just-published-in-J.-Stat.-Soft-chunk of statistical analysis

I think you mean "poorly-documented-cobbled-together-under-deadlines-never-to-be-maintained by someone who has no idea of software principles". Very few labs have a dedicated software engineer to actually turn this software into a usable/hackable tool let alone maintain it.


thats an unnecessary negative stance. not every algorithm needs to be scalable and over optimized to be useful in most cases. and if something becomes really useful in R it ends up being reimplemented in more effe five ways down the road.


No, but it does need to be tested and reliable.


Coming from Matlab, I have the opposite feeling.

I truly, genuinely dislike the language. I think it's very productive, and I appreciate that Matlab costs an arm and a leg (and god help you once you start paying for some of the nicer packages on top) - but Matlab has spoiled me immensely on the language front.

To me, Matlab feels like a language that was designed with an intent to appeal to folks with some understanding of traditional procedural programming, but nudged into treating matrices as first class citizens.

R feels like a language that was built for people who were using excel, and have never written a line of code in their life - it's riddled with completely unintuitive, frustrating, intentionally obtuse operators and terms for things that have perfectly fine definitions in normal programming.

The difference is that I have 20+ years of programming experience (including quite a bit of functional programming) that I can easily port over to Matlab, and which becomes literal baggage trying to use R. The end result is that I will use R, but I basically always walk away frustrated and infuriated, even when the problem is solved.


> R feels like a language that was built for people who were using excel

The S language predates the first release of Excel by 11 years.

> and which becomes literal baggage trying to use R

I've had the opposite experience. My experience was that having a broad array of programming experience made it easier to pick up the weirder corners of R. It became more likely that I'd seen *something* similar to that construct in the past. The converse has also been true. Seeing all the weird corners in R has made it easier to pick up new concepts in other languages & paradigms as it's been more likely I've seen *something* similar from R.


Using pipes and tidyverse/data.table allows for great things in R, and has a strong functional feel. It can be quite beautiful reshaping data, splitting, map, recombining and plotting it.

It doesn't go well at all with a procedural method.


> R feels like a language that was built for people who were using excel,

I don't think so. Most people who come to R after years of Excel find it just as alien as you do.


I recall when the pipe operator was first being proposed the argument for it was that it'd enable workflows that felt more like Excel. The implication being that indeed, base R is alien to an Excel user.

I also recall my pushback was along the lines of "who on earth would want that". Yeah, it's a good thing I'm not the person coming up with these things :)


> I recall when the pipe operator was first being proposed the argument for it was that it'd enable workflows that felt more like Excel.

Where are you getting that from? To start with the pipe operator has been independently reinvented multiple times in R, and neither ‘magrittr’ nor ‘dplyr’ were the first to introduce the pipe operator into R. And (at least when I was exposed to it), the pipe operator had nothing whatsoever to do with Excel. Instead, it was an attempt to introduce the composability concepts from the UNIX shell and Haskell composition into R.


You have me second guessing myself, that perhaps I’m conflating it with the convo around dplyr in it’s early days

EDIT: I found the conversation in question but it involved deleted tweets. And those deleted tweets are the one that reference the package name. Sigh. It was just after the release of magrittr and several months after dplyr


> I recall when the pipe operator was first being proposed the argument for it was that it'd enable workflows that felt more like Excel.

I have no idea where you get that impression, most Excel power users I have met take a long time to understand how to use the pipe operator in R.


How do you feel about the pipe operator these days?


I haven't used R enough in the last 10 years to have an R-specific opinion. And to be honest it was more an unlearned statement on my part as it was an "ew, Excel" response and not thinking about the underlying workflow.

In the intervening time I've become a large advocate for the pattern of chained operators. So I'd imagine I'd enjoy piping in R. And if that means I'm emulating a common Excel workflow, that's fine. I won't have the childish response of "ew, Excel" :)


100% this :)


Ha ha, I love that this is your only comment here! Thanks for all your work on R.

I came here with sleeves rolled up to defend the language, but was pleasantly surprised to find it was already being done much better than I could have.

It's interesting to see how R elicits such a reaction to some programmers. I think it's frequently misunderstood, and R needs to be used in a particular way to allow it to fly.

When I've tried to recreate analyses in Python or Julia, they have nowhere near the fluency of R. It isn't possible to know this if you're messing around with if statements and other procedural methods of achieving things which are better suited to other languages, but rather when crunching data for analysis and graphically visualising the results.

I also understand that it's due to R's lisp-y-ness that allows us to have tidyverse in the first place.

Question for Hadley - there have been a couple of projects to fuse the speed of data.table and tidyverse. What do you think of this aim and are you tempted to change tidyverse to get to the speeds of data.table, or would that require too much of a fundamental change?


Have you seen https://dtplyr.tidyverse.org? It gives you the syntax of dtplyr and (almost all of) the speed of data.table.


Oh, I feel a bit silly now! I'd seen a couple of attempts to combine the two, but didn't realise this one was official. It looks great. I have an analysis coming up that borders between dplyr and data.table in terms of size, so will check it out then!


There is also the tidytable package. But dtplyr works really well. Have used it in a couple of shiny apps that wrangle some heavy input files.


Before downvoting this short one liner make sure you to check who wrote it!


Thanks for all your work on the tidyverse!


Not a lot of people would realize that Hadley is a minor celebrity in many scientific fields due to Tidyverse! Thanks for all the work.


Bravo!


Yeah, and the growing user base, widening ecosystem, and continual stream of analysis packages being written only in R suggests that lots of others agree.

An important factor not often mentioned is that I think R really helps individual developers/very small teams to be productive.


I feel the exact same way! I've used R for the past decade. Once you learn the philosophy behind it, it just works. Yesterday my boss asked me a question about a dataset and I wrote code to analyze it while talking through the problem in real time.


The main issue I've had is speed. As soon as you have problems that can't be vectorized, models that take 30 hours to run in R take 30 minutes in python.


In my limited experience, problems that cannot be vectorized really shouldn't be written in python either (assuming you mean python loops). But indeed the edge that Python has is the ease of use of drop-in solutions like Numba allowing you to continue to write in Python but not Python


Mind giving an example ? The only time I faced this was due to an autoregressive model, which was super easy to delegate to c++.

I've been working with Python for the last year and appreciate how much it helps with general IT problems, but I would still stick to R for statistical/data analysis.


Example, please?

This seems highly unlikely, based on my 20+ years with R. Yes, using wrong data structures/algorithms can lead to slow code, but switching languages won't fix this.

rprof and microbenchmark are your friends if you really need to optimize your code.

and (as in python, and as several others have pointed out), if you have something especially challenging, write it in C/C++/fortran instead, and link it to R.


In both languages, you can write/use C extensions.


you can insert C code very easily in R for when you need more speed.


Thanks for expressing how I feel about R so succinctly.


The common trope with R is that statisticians and love it and developers hate it.

The the main reason that statisticians love it is that the libraries useful to them are much better in R than elsewhere (though Python keeps encroaching in that turf, and "real developers" dislike Python a lot less than they do R).

The main reasons that developers hate it is that it is very unlike almost all other languages that they're used to. This is very valid since outside the narrow domain of statistics, there's probably nothing that R does better than other languages. So for a dev who occasionally dabbles with R by necessity, the otherness serves nothing but frustration.

Still, I wonder how much criticism there is against R as a programming language, that is not some variation on this works very differently from other languages. IMHO the sub-setting syntax, and countless x-apply variations are big warts. I'm not a big fan of Tidyverse, and even less of the schism between base- and Tidy-R. I read some seemingly fundamental criticism about R's deficient scoping rules, but I'm not nearly knowledgeable enough to judge their merits.

I guess it doesn't help that almost nobody learned R as their first computer (as opposed to statistics) language. Personally, I learned C, Matlab, Python/numpy, SQL, R in that order. R does seem to be quirkier than all the others, except maybe SQL. But I don't dislike working in R any more than working in any other language.


> it doesn't help that almost nobody learned R as their first computer (as opposed to statistics) language.

Aside from two statisticians I had as professors, I am yet to meet someone with deep understanding of statistics who doesn't speak R as first language ...

I found it way easier to grasp the meaning of statistics by playing with R than by reading the maths.


I used RStudio to work through problem sets, textbooks, and ideas constantly during my time as an applied math student. My concentration was stats, but still found RStudio invaluable for pretty much every math class I took. I say RStudio specifically because it offered the complete package for what I needed at the time. Built-in graph viewing, workspace management, etc. As another commenter said, pretty much the best graphing/scientific calculator I could ask for.


You’re right, but I’d would think in recent years most stats people would have python as their first language. It probably comes preinstalled on their OS at this point (not sure if this applies to windows outside of WSL yet.)


Python has a big advantage in deep learning but R still has an edge in classical statistics.


I had some Matlab experience about 3 decades ago. What's you take in Matlab vs R as programming languages?


If signal processing and matrix algoritms are your thing, you should (and probably would) be using Matlab. Most statisticians don't really do much of that, what they mostly do is data management and try to mold their tables in some form accepted by an existing R package (or even Stata). As far as I remember, Matlab was pretty horrible for data munging, even worse than R for anything non-numerical. But my Matlab experience is also almost 2 decades old, so I don't know if it's any better today.


My take is that Matlab is better in basically every regard, with the exception that the same functionality will cost you a considerable amount of real world money.

Having used both in a professional setting - and coming in with a fair bit of programming experience - Matlab is generally a pleasure to use. It's different where it needs to be in order to treat matrices as first class citizens, but otherwise you can apply many of the same intuitions and paradigms that you would in any other language.

R on the other hand... R is a fucking disaster of inconsistency. I find myself incredibly frustrated attempting to do simple and sane things - things that I know are only a line or two in Matlab (or even python) and instead fighting with "which version of the 12 different slight variations of this operator are you attempting to use today!" hell scape.

My strong guess is that if you have no coding experience, and you learn R fairly thoroughly - it will feel very nice. My problem is that for anyone with actual coding experience, it's like being given a keyboard with a qwerty layout, but which is actually using dvorak. All your intuitions are pointlessly wrong - not because they are actually problematic, but because R has decided that the A key is really on the other side of the fucking keyboard.


But isn’t the consensus that R is a better language? At least that is what the cool kids said when I went to college (I never used Matlab except maybe a handful of times.)


define "better"?

They each have a few strengths over the other, but generally speaking, I much prefer the language consistency of Matlab.

My general experience is - industrial shops will be using Matlab. Almost all of my Matlab work was aerospace related (think sigint/radar/signal processing/modeling).

R is more popular in education environments - but I strongly suspect that's just because it's free. Post-grads don't have much lab funding to work with at the best of times, and Matlab with an associated set of plugins/libraries specific for your task can easily run 30k a seat.

Personally - I find it pretty telling that most places with money choose Matlab. Doesn't inherently make it better, but it does mean Matlab is getting used in places where mistakes are expensive, and there's a focus and consistency to the tooling that I just think is desperately lacking in R.


I've had to translate a lot of Matlab to R in college (physics and econometrics).

I rarely found an important difference between the languages besides having to transpose some matrices here and there.


Not the grandparent. Matlab is much simpler for matrices than R, approaching Python's numpy is ease of use (and very reminiscent of Fortran and Julia).


> "real developers" dislike Python a lot less than they do R

I thought that was real Scottmen.

Because real Scottsmen prefer:

table.loc[(table.column > 2) | (table.column2 < 3)].reset_index(drop=True)

to

table[column > 2 & column2 < 3, ]

and everyone knows this!

not to mention, if you aren't managing 100 virtual environments and 100 conda environments (with different syntax for requirements), you aren't a real scottsman!


You can do

    table.query('column > 2 or column2 < 3')
If you want. I'm not sure why you're dropping the index there.


Yes, I definitely put that up as a real Scotsman. The "real (Java) engineers" at my company scoff at the loosey-gooey attempts of the Python engineers trying to productionalize the numpy mess produced by our ML-engineers. I mean, how can you "productionalize" anything without Builders and Factories?


I think this is really interesting. The author certainly isn't an expert, for example `result[which(result < 0.5)] <- 0` is a mistake for `result[result < 0.5] <- 0`.

But that's just why it's useful - R is great when you are an expert, but becoming an expert takes years. The perspective of new users is really important. (I've been using R almost 20 years, have written several packages, and still feel like an amateur. Indeed, I'd never heard of `**` as an alias for `^` until today; nor `sequence`, which apparently has always been in base; and I still can't remember what `sweep` does.)

I thought some of these arguments were better than others. True that base R regex is confusing and messy (and that stringi/stringr are improvements). False that allowing string concatenation with `+` would be a good idea. That's just a footgun waiting to go off, given that R also is weakly typed. Expecting `nchar(1000)` to magically work seems naïve. `<<-` (roughly, global assignment) is an ugly necessity and a code smell, not a cool language feature.

An awful lot of these problems are fixed, or try to be fixed, in the tidyverse. Not using tidyverse is a bit unusual because most beginners nowadays, I think, start with the tidyverse more than with base R.

For me the worst part of R is simply it fails silently. This is really deadly, especially when you are producing scientific results. There are so many places where R will plug gamely on after you have done something deeply inappropriate. Given how badly scientists code, one has to worry.

I don't agree that "R won’t change" is the base problem. It's not so simple. R is used for science. I like very much that my code from 2008 will probably still work if someone wants to replicate my results. I appreciate the R-core team's work in making this true. There are genuine trade-offs here.

If you want emotional relief, it's worth following https://twitter.com/whydoesR.

Maybe Julia is the way forward? Or is R "worse is better"?


> Maybe Julia is the way forward?

Julia is well worth learning, if you do computationally-expensive work. It is kind of a pain to use interactively, though. I use both R and Julia in my research. Think of Julia as the new Fortran, though, not the new R.


Isn’t Julia a compiled language “pretending” to be interpreted? I thought it was interpreted or JITd up until it’s last release or so when they mentioned it has to actually compile. It’s fast, sure, but is it faster than other compiled languages?


I do not have deep experience with this, but the word on the street is that Julia can be faster than other compiled languages. I think that's partly because it can find a good algorithm, based on your data structure; think of loop unrolling, etc.

You can inspect the assembly code for anything you're working on, and that can be quite helpful at times; see e.g. https://youtu.be/wU6c8CDRXJE?t=3887.

I think the reason why quite a few high-performance people (I mean in the science community -- I don't know much about other communities) are excited about Julia is simply that well-respected experts are also excited. An example is the Julia implementation of the MIT GCM (general circulation model) for the ocean; for similar projects, see https://github.com/CliMA.

Programmer effort is also a factor in scientific computation. If the system can do some of your work for you, so much the better; see e.g. https://www.youtube.com/watch?v=rZS2LGiurKY for a lecture that touches upon how the framework of Julia eases the burden of machine-learning tasks.

As I say, though, I do not have deep experience with Julia. I've rewritten one of my numerical models in Julia and the speed is about the same as before, but my code is much shorter and easier to understand. I would not burn up 6 months translating a complex code, but nor would I start a 6-month coding project in Fortran anymore.


In general, Julia has similar performance characteristics to other fast compiled languages (Fortran/C++). There are some performance differences do to different semantics (eg bounds checks by default), but Julia is good about giving you the ability to opt out of these easily. There are also definitely places where Julia makes it a lot easier to get better code by making it a lot easier to use better algorithms, so in general, simple Julia code tends to be similar/better performance than Fortran/C++, and optimized Julia code tends to be about the same speed as Fortran/C++, but with 10x less code.


I agree that Julia isn't yet an R replacement, but I think that in addition to being a new Fortran, it also does well as a new Matlab/numpy.


I think a lot of the problem with comparing R to other languages is that a lot people don't get the problem space that R is working in. Science deals a lot with categorical variables, missing data and high dimensional data, and the 'table' or 'dataframe' is adept at storing and working with this information. Under the hood it's just a load of optimised fortran code working matrices, but the code clearly shows what kinds of data manipulations and transformation you are doing to eek the right information and visualisations out from the dataset.

I see problems when people take an imperative approach to solving numerical problems, and something like Python is better suited to that. Also, R isn't really set up to work with matrices like Matlab/Julia are.


The points the author mentions are fair but something feels amiss. I have used R heavily and still use it from time to time and I never use most of the functions mentioned in the post. For instance I have never used switch().

R is for data manipulation. 90% of what I do in R is manipulate dataframes or matrices and then run machinelearningmodel(mydataframe) or ggplot(mydataframe). And for this it is incredibly efficient. You can rightly argue that some elements of the language are quirky but that's missing the point.

> Asked over 100 Stack Overflow R questions.

As a tangent I find a hundred questions asked the first year for a very mature language is a lot.


Yes, I think if you are using switch() for an analysis in R, you're either using the wrong language, or you're doing R wrong.


What always surprises me is how many people make beautiful, lovingly-crafted band-aids to the language's warts and problems. Not just code + packages but social band-aids too.

In a way, you could argue that the entire tidyverse is a huge effort of a band-aid.

So for all the irritating design choices and idiosyncracies, R is still a network of islands that work incredibly well for people, as long as they don't ever go to sea.


I've written an interpreter for R (a subset; it was for school and I left out some features like S4 and the condition system) so I have done some pretty deep dive into the language reference and GNU R source.

I agree with the author's sentiment - I love a lot of what R has, but there is a lot of small madnesses.

There are so many unique PL ideas in R (may not actually be unique but certainly unique among common languages today)

- first class environments - named, default parameters and even the ... parameter which encourages the pattern of hierarchical library functions - there's one large customizable main workhorse function, and many wrapper functions that specify some defaults or add some behavior, but all the underlying customizations are exposed through ... - copy on write as a default - ability to choose evaluation strategy

But I also wonder how many of these cool ideas would actually work well in a saner language


Is the subset R interpreter you wrote available on the net?


Sorry for getting back to you so late; I'd rather not link my github here as I'd rather remain as anonymous as possible on public forums


R is a truly terrible language with a handful of bright spots, such as it's visualization libraries.

The boost you get from the slightly better expressiveness of R over something like Julia or Python is not worth the headaches you'll run into down the road in trying to maintain whatever you wrote 6 months later, or God forbid, trying to integrate your code into someone else's work.

R was my first language and in hindsight that was a HUGE mistake. So much of the R code out there is horribly written, and even when it isn't you still have to deal with all of the issues the author here points out. If you pick up R as your first language, you will end up picking up all sorts of bad habits;

R is fine if you're working solo and you don't plan on maintaining or reusing or reusing your code. For everything else, R is garbage. It took me a year or more to undo all of the bad habits I picked up learning R.

I don't agree with the "worse is better" comparison in the comments here. "Worse is Better" was meant to refer to the idea of "Don't make the perfect the enemy of the good", among other things. It was not meant to be used as a justification for poor design. If anything, python for data analysis fits the "worse is better" philosophy much better than R. It's not as well optimized for data work compared to R, but it's much simpler, more consistent, less error prone, and it plays well with others.


Counterpoint: a lot of work in R is not "development" in the ordinary sense. The outcome is not a piece of maintainable code that needs to be built on later or be generally useful in any way other than copying an occasional snippet.

In some research fields (e.g. scientific fields that use R) the ground rules are that the code needs to be understandable and it needs to be clear that the libraries involved were used correctly. That's basically it. Even hardcoded directories are common. Good development practices are not widely understood to be important and in general many people are just starting to get the hang of version control and might not use it at all.

If R enables you to solve a statistical problem you have right now and it does this in a way that is better or more comprehensible for the people who use it, that means it has a niche. As someone with software development experience in a bunch of other languages, I agree with you that R is full of weird warts, but let's not forget that there are areas where its value is still obvious.

Citation: my partner works in a scientific field where R is predominant.


> I agree with you that R is full of weird warts, but let's not forget that there are areas where its value is still obvious.

For sure. As much frustration as I had with R, at the time it was an enormous improvement over the stuff that came before it. And its emergence and success led to other languages improving their data and analytics capabilities.


> R has two types of empty string: character(0) and "".

I understand it's frustrating trying to use a language you don't understand. And instead of reading the language manual you go on rambling.

"" is an empty string (almost) as you know it from other languages.

character(0) is an empty vector of type character (i.e. a vector with no elements). This vector doesn't even contain an empty string.

R is a vectorized language. You almost always deal with vectors. "" actually is a character(1), a character vector of length 1. Once you understand this, there is a chance for you to enjoy R.


I'm going to side with the author here: if he read "Advanced R", "R for data Science", "The R Inferno", "Rtips. Revival 2014!", the official "An Introduction to R", "R Language Definition", and "R FAQ", and yet he still has problems with the language, then maybe the language is to blame.

And even if the author is the problem, I wouldn't accuse them of not reading enough.


Ok, but if someone claims to have read all the Python manuals and wrote something like

> Python has two types of empty string, array('u',) and ""

you'd probably conclude that hasn't really understood what he read.


For Python (at least Python3) a better example might have been b"" and "". They are not equal, and they are empty. You have to decode or encode one, for instance, and different functions return different things. Then different functions might return False, None, (), {}, etc.

This OP complaint seems like weird nitpicking about R. Many languages have different empty/null-types for different variable-types. Also, don't get me started on "nulls" in C, C-strings, C++ strings, or memory allocation.

All languages are complicated.


does f"" also count as an empty string? Because oddly enough, I don't see anything in that syntax which suggests that it is really a function which returns a formatted version of whatever is in the "".


Or maybe the language just isn't for every one and for every usecase. I would be hesitant to write something customer-facing in R. But it's great for doing statistics. The main problem with R is that people underestimate how different it is and thus don't care to learn practices for writig robust R code.


The issue isn't so much that character(0) is a zero-length vector and "" is a length 1 vector containing the empty string, it's that you can't necessarily rely on other people's code returning one or the other: things that 'nearly' always return a one element vector (which may contain the empty string for 'nothing') can vary unexpectedly if it fails on an edge case. And, unless you catch it correctly, this can cause downstream failures with little in the way of warning (because a zero-length vector is obviously a 'sensible' thing to return from a function in the usual case).

In that sense, it's similar to the problem many languages have with NULL, but on steroids: you can have NULL, NA, character(0) (or anythingelse(0)), or '' as your null result and each of them are tested for in different ways.

Obviously this won't be a problem for the various battle-tested standard libraries, but a lot of my work in R at least is assembling somewhat-novel analysis pipelines based on quite new statistics code.


To some extent, this is rather a general problem with quality of code in dynamically typed languages. Public functions should return predictable results. This is a matter of testing. In my experience, packages on CRAN are well tested.

With respect to dealing with return values, you can circumvent some pain points by using identical(), isTRUE(), isFALSE() in if conditions instead of, e.g., `==` which many people use because this is what they know from other languages. The assertive package is also nice.


That could have been expressed more diplomatically, but I think you're right. IMHO, what people should try to understand about new languages first most thoroughly, are its native data types. This is more fundamental than the syntactical constructs.

R's data types are one of its most alien part, and that's why I think if you're coming from another language, chapter 20 of Hadley Wickham's book[1] is the most important one.

[1] https://r4ds.had.co.nz/vectors.html


> I understand it's frustrating trying to use a language you don't understand. And instead of reading the language manual you go on rambling.

I'd agree with this assessment. If you start doing R and it feels weird to you then -- in my opinion -- you're probably in the wrong place. Meanwhile, for the cognoscenti -- the researcher, the statistician -- R behaves just as you'd expect. That is the draw -- a language developed around statistics.

R is not a great computing environment for computer science. E.g. writing iterative algorithms. Almost everything worth a damn in R is written in C++ and then FFId in. Those who do not want to use C++ can write their algorithms in Python or Julia -- and they often do. Arguably the defacto for computing oriented machine learning is Python, not R.


The popularity of the Tidyverse is a major blow to your motivation to learn R. Why would anyone want to learn a language that is treated as secondary to some packages? Worse still, if that turns out to be the best way to use R, then you’re forced to admit that R is a polished turd with a fragmented community.

As others have mentioned, just use tidyverse. I picked it up 4 years ago, and last week I went back to the code I wrote then.

I was productive in minutes. I could read the code, modify it, and easily test it in the REPL. The docs for dplyr are good.

ggplot2 is still awesome and the docs are good there too. ggplot2 is the fastest way to figure out what you want and make a pretty plot.

(However one thing that still annoys me is that R moves faster than Debian. So it's possible to do install.packages() in R, and it will break telling you your Debian R interpreter is too old. There is no easy solution for this, just a bunch of workarounds)

-----

OK, sure you can call it a polished turd, and to some degree that's true. But a polished turd is better than just using ... a turd!

The error messages in R are not quite as good as Python, but I wouldn't call it a problem. I'm able to localize the source of an error, even when using tidyverse.

My article comparing tidyverse to some other solutions:

What Is a Data Frame? (In Python, R, and SQL) http://www.oilshell.org/blog/2018/11/30.html

----

But would I recommend learning it to anyone else? Absolutely not. We can do so much better.

I would recommend with the caveat that it's one of the hardest languages I've had to learn. However that is partly because it changes how you think. But if you have a certain type of problem then you have to change how you think, or you'll never get it done. Data analysis is surprisingly laborious even for people who have say written compilers and such.


Lot of useful insights in the comments here. I wanted to address one specific comment -

>can't remember the last time I saw a project someone did in R get very much traction anywhere...the only time people talk about R on the internet is to discuss the language itself which is definitely frustrating

There is a lot of R deployed in industry, even in silicon valley, but you have to be in-the-know. R gets plenty of use in statarb & model checking in finance - speaking from personal experience at GS & BofA/ML. My one non-trivial project at Twitter involved working with this team building a model & I remarked - hey this can be done rather easily if you use this library in R - and the teamlead says, yeah that's how we're doing it! But I thought we are a Scala shop, I said. So he says, yeah but imagine building that entire library in Scala from scratch, it'll take forever! So I enquired how he gets it done - you basically spin up a socket server & the jvm sends R commands plus data as payload over the socket, the server runs R and returns the result of the model back as a string, boom done! I said it was kinda janky & he says - I won't tell if you don't ! So that's R for you - it gets the job done & its fast & somewhat messy, but it is used everywhere, yet people won't openly admit to it because its a 30 year old language & we all want to be using the latest & greatest tool.

I now work at a news startup with a few million users, & all of the news personalization is done in R. So when these millions of viewers watch TV, the piece of code that decides which news clip should be shown ahead of which other news clip & which clip comes after - all of that is decided by a block of R code that I wrote. ~ 300 lines of R, uses quanteda, tidytext & parallel under the hood. Pretty much everything I do involves mcmapply, which parallelizes your compute & uses as many cores as you specify. But that's sort of the thing with R - you have to know which functions/libs to use & which ones to avoid. Just switching from tm to quanteda got us a 200% bump in perf. Switching sapply's to mcmapply was another winner. These things aren't documented cleanly - you have to keep up with cran, experiment & see what works best for you.


I would say about 90% of the posts / articles / comments I see on the internet which discuss R are usually of the "meta" format. They talk about R's strengths or weaknesses, about the difference between R and Python, about how much they love or hate R, or any other high level subject.

I can't remember the last time I saw a project someone did in R, or a tutorial on how to do something in R, get very much traction anywhere. It seems the only time people talk about R on the internet is to discuss the language itself (which is definitely frustrating), and it's getting old. Even this awesome, comprehensive document, which I would usually be foaming at the mouth to read, has me going "meh". I'm tired of the subject.


> I can't remember the last time I saw a project someone did in R, or a tutorial on how to do something in R, get very much traction anywhere.

well, you know, I'm not very active in C++ any more, and I haven't seen an article in over a decade on C++ which received any traction at all. So I guess C++ isn't getting any traction any more either.


They don't often come with code, but one of my recent sources of R programming joy is the folks posting their generative art to twitter: https://twitter.com/search?q=%23rstats%20%23generativeart&sr...


This is definitely true on HN, at least. I think the vast majority of R-users are just plugging away on their domain specific problems daily, and tend not to participate in these conversations.

Dark-matter statisticians, I guess?


What needs to be added is that before R the reproducibility problem in science was compounded by the fact that analyses were done with proprietary software limiting communication and replication of those analyses. This was and continues to be a major problem, particular in some fields, but at least now there is a common widely used language that can be used to overcome this. I wouldn't focus on idiosyncrasies but rather on the major problem it addresses. Any large system will grow over time and have some inconsistencies but after a while you learn the workarounds so they are less important than the big picture.


On the contrary, R packaging system is too broken for R to be reliably reproducible. No one specifies package versions or R versions. Base R has no way to install a specific version of a package. There’s a package that lets you do that, but well, you might need a specific version of it. Particularly if you need to run an old version of R for reproducing an old script it may be impossible to use any standard tool to install the correct packages thanks to this problem - the version of devtools that install.packages gets won’t be compatible with your old R but you need that package to request another version. Instead everyone just ignores it and hopes package versions don’t matter.


I don't see how R specifically addresses the reproducibility problem, It's been around for almost 30 years and before its recent rise in popularity, lots of science was done in C, perl, fortran etc. Not to mention that actual dependency versioning is pretty poor. I struggle to run other people's R code after about 6 months (especially if they used the tidyverse as it pulls in hundreds of unstable dependencies) and nobody records what package versions are used and functions are seemingly deprecated every week.


1. Before R commercial statistical packages were mainly used. You can, in principle, just use assembler too and develop everything yourself but it isn't practical. Regarding C/C++ and Fortran, many R packages are, in fact, wrappers around code in those or other languages making it easier to access them. From that point of view R can be regarded as a glue language. 2. Regarding keeping versions straight, all past versions of packages in the CRAN repository are kept on CRAN. Microsoft MRAN repository also maintains histories of packages that can be accessed via the checkpoint package which will install packages as they existed on a given date. Furthermore, install_version in the remotes and devtools packages can install specific versions. 3. Regarding tidyverse dependencies you can reduce the number of packages you load by not using library(tidyverse) and instead load the specific packages you need. This will result in fewer packages being loaded.


> Before R commercial statistical packages were mainly used.

Maybe in your field, I work in bioinformatics - before R, perl was widely used as a high-level language.

> Regarding keeping versions straight, all past versions of packages in the CRAN repository are kept on CRAN...

This is woefully inadequate if you need to replicate somebody else's environment. Nobody should think manually guessing and then typing in each package version and hoping they're compatible is a viable option. Not to mention even if you specify an older version of a package it doesn't pull in compatible dependencies, it just pulls in the latest version. There's renv but it's not reached widespread use.

> Regarding tidyverse dependencies you can reduce the number of packages you load by not using library(tidyverse) and instead load the specific packages you need. This will result in fewer packages being loaded

We're talking about replicating other people's work. We don't have any control over their code, and R users are largely ignorant of best-software practices.


Totally agree. I find it frustrating trying to reproduce other people's work in R. How has this situation has been allowed to continue for so long? It's unacceptable, especially when used for science. It's impossible to replicate anything unless you are lucky enough you manage to find which package version introduces breaking changes and even then this is something you have to do repeatedly for every code break. Even with _renv_ it's a library you have to install within your R environment which is pointless. Where is a dependency solver like conda for R? - Not that it's perfect, but I've been happy with its drop-in replacement - mamba recently.


The packages that were used in statistics were SAS, SPSS and Stata. perl is not a statistical package and has nowhere near the depth of statistical capabilities of R.

Don't forget that I also mentioned the checkpoint package in my post. You only need to know the date for that, not the version of each of the packages.

In your last paragraph I think you are referring more to software development practices than what is available through R. Simply using R or any language doesn't guarantee this.


That's a very roundabout way to solve an actual problem. In many cases you don't pin your package version to _latest_ (whatever that date is) and you need a more fine-grained solution to keeping package versions. I don't think that solves this and I don't know if you can do it with checkpoint.


Of course it is possible to screw up but if you don't update your packages and record the date that does not seem to be R's fault.


um... this is about statistics, before R people should finish analysis in MATLAB


I don’t like this. Much of this is:

1. pointing out that, like every other language, base R has idiosyncrasies

2. how use of R is more complex when you’re largely ignorant of the tidyverse, which is crucial for the vast majority of tissue today’s use of R

3. frustration because you’re using a language/ecosystem, that’s targeted for a few specific uses, as a general purpose programming language


> how use of R is more complex when you’re largely ignorant of the tidyverse

This.

I'm interested in non-flamewar non-religious reasons that the tidyverse is bad. He does give some. I think his complaints about inconsistency and a moving target have some validity. However, the price of not using tidyverse is (roughly) paid in the rest of the article. I would definitely not use R without it.

Read his Section 5 on the tidyverse... and see how absolutely minimal his complaints in that section are. E.g. to "purrr" his objections are "largely philosophical"... but he's complaining in the previous section about the annoyance of writing lambdas (which purrr makes even easier).

Yes, R has a big community and there's a lot of quirks in individual packages, especially less-used ones. Yes, there are packages presenting unified interfaces to other quirky outputs (e.g., broom). The necessity of this is not good. The existence of it is good.

HN readers - do you have an "up and coming" language that you think has better structured the fundamentals from R, that you hope will someday have enough capabilities you can use it instead of R? I've tried Julia, which is beautiful but the startup/compilation times were difficult to get over. Is it reasonable to hope Julia will be good for interactive usage someday? Is it already? Are there other candidates in this area?


> Yes, R has a big community and there's a lot of quirks in individual packages, especially less-used ones

Most of his examples of WTF's are from base-R. And he's definitely not wrong, as many of these have bitten me a bunch over the years.

> I'm interested in non-flamewar non-religious reasons that the tidyverse is bad.

For the very reason that it's great to use, it's a nightmare to develop with. NSE is super handy as a user, but it's an absolute nightmare to build new functions on top of (dplyr specifically). Like, I now know 2-3 different ways in which quoting/substituting etc can be done for the tidyverse, and I've had to maintain code using them a bunch of times.

It's incredibly annoying, and every time I do it I need to look up Hadley's new approach to NSE (don't get me wrong, I adore using the tidyverse, but I absolutely despise programming with it).


> HN readers - do you have an "up and coming" language that you think has better structured the fundamentals from R, that you hope will someday have enough capabilities you can use it instead of R?

Hope is the operative word here!

I'm writing a language to compete in this area. It's called Mech and I'll be releasing the first beta in October. You can think of it like Matlab + Excel. It's very fast, has default-parallel semantics for operators and functions like Matlab, reactive dataflow like Excel, and supports full interactive coding with no startup/compilation latency issues. It's meant for robots, but I've also designed it to be a better Matlab, and I think it should take on R handily. Fair warning, it's public alpha now so error messages are sparse and the happy path is narrow.

https://github.com/mech-lang/mech


> I'm interested in non-flamewar non-religious reasons that the tidyverse is bad.

Going to answer with a question: Why is tidyverse == R considered true?

I use ggplot frequently, but for data manipulation data.table is orders of magnitude more powerful. And more stable.


data.table also uses OpenMP to parallelize operations, so it tends to be much faster


Tidyverse is so much more verbose than data.table, it’s painful. I don’t see the draw to it, to be honest.


Worst part of the tidyverse is learning it, and then looking up how to use specific functions. The bad documentation is mostly in the ggplot lib though.

It's a pleasure to use, though!


The “bad” ggplot documentation is mostly a function of one’s own understanding of the grammar of graphics. That’s what the “gg” in ggplot stands for.

If you don’t understand the GG, then ggplot will seem opaque, and no goodness of documentation will suffice.

I don’t mean to blame the user. Perhaps the ggplot documentation could improve by reinforcing the need to understand that or referencing it more frequently?


It's a good reading but some of the complaints are hard to understand.

For example, in 4.5.1:

  Selecting and deleting at the same time doesn’t work either. For example, data[c(-1, 5)] is an error.
What would it mean for that to work? He seems to acknowledge that "selecting and deleting at the same time" doesn't make sense in 4.11.1

  Can you guess what data[-1:5] returns? I can’t either, so don’t ever try it. If you must know, it’s actually an error.
Also in 4.11.1:

  The : operator is absolutely lovely… until it screws you. The solution is to prefer the seq() functions to using : [....] As I’ve said, seq() and its related functions usually fix this issue.
Maybe the "related functions" fix some issues but seq(a,b) is not different from a:b

In 4.11:

  Now what do you think names(foo) <- names(bar) does? Seriously, can you guess? I can think of roughly four realistic guesses. Is it even valid syntax? 
How is that surprising? Can the author also think of four realistic guesses about the effect of A[1,2] <- B[3,4] for example?

In 4.13:

  The index in a for loop uses the same environment as its caller, so loops like for(i in 1:10) will overwrite any variable called i in the parent environment and set it to 10 when the loop finishes. [...] This sounds awful, but I’ve never encountered it in practice.
Is it awful? The same happens in other languages like Python or C if I'm not mistaken.

  The plot() function has some strange defaults. For example, you need to have a plot before you can plot points [...]
I have no idea what that means. You can plot points using plot() without having a plot beforehand.

Edited to add: In 4.5.3:

  The $ operator is another case of R quietly changing your data structures.
Is it unexpected that when we extract an element from a data structure we get a different kind of data structure? Is A[1,1] another example of silently changing one data structure (matrix) to another (number)?


> Is it awful? The same happens in other languages like Python or C if I'm not mistaken.

It annoys the shit out of me in python. I much prefer perl's

    for my $x (@array) { ... }
or ES6's

    for (let x of array) { ... }
Note that given python and ruby only do function level scoping rather than block level, I can -understand- why they work the way they do even if it annoys me. R already has the necessary granular scoping to do the (IMNSHO) sensible thing so it seems like a pointless wart.

My -guess- would be that if it was intentional, it came about in R because for loops are rare enough that you want to know the last index more often than you don't because if you don't care about the index presumably you'd've written something else.

(also you could argue the ES6 version would be better written using 'const', but I've lisped sufficiently my fingers invariably generate 'let' when left to their own devices - caveat emptor)


> Here’s a challenge: Find the function that checks if "es" is in "test". You’ll be on for a while.

grepl(“es”, “test”)


I haven't used R at all in years and only used it a couple times in passing many years ago to try it out.

I searched "R string functions", saw "grep" and wondered if there was something I was missing in the author's challenge.

Is it because it's using regular expressions they don't consider it the correct answer or is it because they aren't as familiar with regular expressions as some other people are, I wonder?


Personally, I feel like the biggest problems with Python for math is the lack of a native vector datatype and our subsequent reliance on Numpy which really disrupts the elegance/terseness of working with vectors/matrices.

First, there's the constant inelegance/clutter/inefficiency of having to cast into and out of arrays and lists, even when doing basic list comprehension. R, Julia, and Matlab are all vector-based languages (I think), so you avoid having this casting as much.

Secondly, having a native vector type means you don't have to worry about the performance penalty of operating directly on arrays if an existing prebuilt method exists. Since the efficiency of Numpy comes from calling it's underlying C library, you're forced to memorize and use prebuilt Numpy functions rather then just use the more obvious and elegant array manipulations. For example rather than calculating the the cumulative sum like this:

  cumsum = reduce(lambda a, b: a+b, [1,2,3,4]) 

 We have do this: 

  cumsum = np.cumsum([1,2,3,4])
(There are better examples, but this is all I can think of right now).

And once you add something like Pytorch tensors on top of this, we now have an additionally layer of casting/redunduncy/memorization of prebuilt functions!


Excellent read. I agree with a lot (after only a cursory read). One thing the author seemingly forgives R by not mentioning is how harsh and discouraging of beginners the community was at some early stage. That was my experience around 2002-2006.


I'm not sure that's true anymore. The one thing that I find interesting about the R community (being at home in Pythonland myself) is the availability of great teaching and learning resources. At least that's the case in the social sciences, where R has pretty strong adoption.


Yeah I tried to publish a package back in like 2015 and was dealt with very harshly, got banned for a week from submitting for asking questions about the process after the first attempt didn't work. It really turned me off R and frankly I haven't looked back.


R is designed for data analysis, not for general computing. Its syntax differs from that of other systems. Python's syntax also differs from other systems. Same for Matlab. And so on.

Non-uniformity imposes a burden that will be too much to bear, unless the system offers particular advantages. The fact that several systems co-exist is proof that the advantage-burden balance is favourable in each case.

There is no need to converge on a single tool. Carpenters need both saws and hammers.

In practical applications, language syntax is just part of the story. One must also consider the issue of available libraries. One thing that really stands out with R is its immense collection of well-vetted and well-documented packages. Python and Matlab -- the two main alternatives in my discipline -- fall far behind R in this respect. If there's a journal article on a new statistical technique, then there's a pretty good chance of a package written by same author. And, if that package is on CRAN (the repository for such things) then it has undergone quite rigorous testing on several types of computer, with several versions of R.


AND! packages don't update every 3 weeks breaking things!

My diety! someone was complaining about inconsistent syntax but doesn't recognize inconsistent dependencies?


My favorite operator is the pipe operator. When I first found out you could do a simple `ls | more` to read long outputs, it was an eye opening experience. In Clojure, we have the threading macros, `->` and `->>` that do a very similar thing. In R, we have `%>%` and now the native `|>`. Whenever a language has this operator and it is widely used, I know I am going to love it.


I credit F# with much of the popularity of the forward pipe operator. Unlike Haskell etc. which emphasize function binding (>>), idiomatic F# has pipes all over.

    [1..10]
    |> Seq.filter (fun x -> x % 2 = 0)
    |> Seq.map (fun x -> x * x * x)


Looks like I need to check out F#! Thanks!


I only briefly knew R in grad school (circa 2007) but I lived inside Stata for about 7 years, and yeah, while specific frustrations vary, the general tenor...

Then -- I once made a meme to that Oliver Stone Vietnam movie that said "This is my copy of Stata. Without me it is useless. Without it, I am useless. I must cherish it as I cherish my life..." (In the original said of a rifle.) I was good with Stata, fast and precise and never "ugh, okay, let's open a quickie notebook... there goes my morning..."


I support bioinformatics researchers and my R problem isn't the language itself but the increasing fragile tower of packages that users cobble together.

At this point, I see R users (typically PhD students and post-docs) doing "science" in R by playing with parameters to functions in poorly-understood packages and publishing papers on which parameters are "best" for data generated from some specialty source.

A very common situation for me is to be pulled in only after a package has been created with some vague hope of fixing performance problems (which R, Rcpp, and RcppParallel make fun to do for me, but I have some C++ background for scientific computing, ymmv). It is extremely common to find that these packages contain fundamental logic errors that probably should invalidate the (already published) results but never got caught because the code ran without actually failing. I guess I'm complaining that people are using buggy packages to write more buggy packages and it just bothers me.

Library-driven development is just how the world works these days. And it should! But I'm not confident that the R bioinformatics world has the kind of guardrails I would prefer to see. I mean, I am reasonably confident tensorflow is functionally correct. Any R package that pulls in too many other R packages to begin with is probably not.

As for the language itself - I guess it is ok. I have some lisp in my background and a fair amount of love for non-traditional array languages. But I don't see much R code that seems to stick to the R "standard library" rather than pulling in a million packages to do anything . . .


> I see R users (typically PhD students and post-docs) doing "science" in R by playing with parameters to functions in poorly-understood packages and publishing papers on which parameters are "best" for data generated from some specialty source.

That's a lot of bioinformatics, and not specific to R. It is a huge issue with anything vaguely pushbutton in the bioinformatics domain.


“I don't see much R code that seems to stick to the R "standard library" rather than pulling in a million packages to do anything.”

People teach the tidyverse to new r users. It makes them think that it’s standard practice to pull in lots of unnecessary but possibly convenient packages. Simple string manipulation should not require an extra package like stringr, but for many users it does. Often, they were taught this way.


I am not developer by profession but have been programming since early 90s starting with basic, fortran, C++ and Matlab. Learnt JavaScript, Python, Lua as well along the way for various reasons.

Found R when I was looking for something free alternative to Matlab and chanced upon R in 2010/11.

Now a days R is my goto scripting language anytime when I just want to get to the results and don't care about reproducibility.

I also use Shiny as alternative to multiuser scenarios involving spreadsheets since I when work in a financial firm where excel and VBAs still dominate most of the front office functions.

Sure Python could be good tool but once you become fluent with R ecosystem, moving to python just feels too much that can be done with few lines of R code.

For me the conciseness of data.table and ability to cook up shiny web apps with very few lines of code is biggest pull.


I am looking forward to reading this but I need to point out that section 1.3 Ignorance is both interesting and necessary. What 1.3 basically outlines is, "I used R for a year, but the way I used R is very different than what most R people use R for."


I find R's strengths lie in its unmatched collection of statistical libraries, but I dislike R's syntax so much that, if forced to use it, would call an R package from Python (using RPy2), or just use a Python alternative (e.g. Plotnine).


> [discussing how c(list(1, 2, 3), LETTERS[1:5]) is not what the author would expect] To get list(1, 2, 3, LETTERS[1:5]), you must do something like x <- list(1, 2, 3); x[[4]] <- LETTERS[1:5].

The following works and it looks quite natural:

c(list(1, 2, 3), list(LETTERS[1:5]))


A minor tip : I use the hashtag #learning to annotate pieces of code which I can then revisit using a search (could be just grep '#learning' *.R | grep data.table, say) in case I'm stuck. This could work with any language, of course, but in the initial days, I found it very useful with R in particular given it's idiosyncrasies. (For the oldies on here, I grew up in the era of del.icio.us, so hashtag-ging code felt like a natural-but-novel idea :-) )

Examples:

#learning : a data.frame is a list. x = df with 10 rows, 21 columns, say. as.list(x) gives you a list with 21 elements, one per column

#learning : Above = getting a row and it's previous row using .I() in data.table


I am surprised no one has mentioned the awful [garbage collector](https://stackoverflow.com/q/14580233/850781).

The R garbage collector is imperfect in the following (not so) subtle way: it does not move objects (i.e., it does not compact memory) because of the way it interacts with C libraries. (Some other languages/implementations suffer from this too, but others, despite also having to interact with C, manage to have a compacting generational GC which does not suffer from this problem).


Honestly, probably because so many languages have non-compacting collectors that people just accept it as a trade-off - in the sense that compacting collectors are non-trivial and without lots of work can produce significantly higher GC pauses, and so doing it really well requires a lot of engineering effort that you might prefer to be spent elsewhere adding features you want more.

golang's collector isn't compacting either - though it uses per-size-class arenas for allocation so you don't end up with fragmentation bloat to nearly the same extent. Part of me wonders if simply building R against jemalloc would get a decent chunk of the same advantages.


I've recently started getting into computational archaeology and found the entire ecosystem is built around R, meaning I am now starting to learn about it. Anyone have a suggestion of the standard books/courses one should start with?

I found it pretty interesting that the alternative to R is Haskell for general CLI tools! Seeing some open issues in a popular tool for dealing with ancient DNA (aDNA) about making invalid states impossible within the type system made me genuinely laugh out loud in amazement. I didn't expect that level of technical knowledge within the world of archaeology.


We used the books Hadley Wickham has published for R courses in my stats program [1].

I supplemented the theory parts of my other courses with some of these [2] R books about using the methods instead of deriving and proving properties about them.

There are also some R studio cheat sheets [3].

[1] https://hadley.nz/

[2] https://www.routledge.com/Chapman--HallCRC-The-R-Series/book...

[3] https://www.rstudio.com/resources/cheatsheets/


Any chance of sharing which tool(s)? I know some people who're involved in making the haskell ecosystem a better place for less mainstream users and I'm pretty sure they'd want to know more about this.

(and if it turns out they already know, -I- don't, and that sounds pretty cool and fun to read up on :)


Sure! It’s the department of archaeogenetics at the Max Planck Institute for Evolutionary Anthropology

https://poseidon-framework.github.io/#/


Much appreciated, passed on in the hopes it's useful, and when it's not 0030 and I'm awake enough to actually understand any of it I'll be having a read through myself.

Cheers!


I’m a total noob but offered my development skills in trade of archaeological domain knowledge, so just getting familiar with the code base myself atm :) if you/your friends wanted any sort of intro to the team I could arrange that as I’m in a slack channel with some of them!

Fwiw, it’s mostly a tool for parsing and processing this ‘Jan no’ data schema https://poseidon-framework.github.io/#/janno_details


To rephrase the famous quote attributed to G. Box,

`All languages are wrong, some are useful.`

And R is one of them.


I first learned how to code in R before moving on to Python, then some C and Go. I think a big cause for the SWE hate of R is that it's not OO programing. R is a functional language for data analysis. If you don't grok that, then I can understand why looking at it would make you barf. Going the other way, from functional to OO, caused me physical pain as well.

R is amazing for data analysis. Also, RStudio is a much more efficient solution for iteratively exploring data than Jupyter. Don't make fun of a screwdriver for not being a hammer.


I made the same transition from R to Python and I still resist using OO. I never understood why people would use it instead of functions.


Considering how loved Elixir is, functional definitely isn't a problem.


I agree with the concerns about lists. They're a poor substitute for structs in a language with static typing. By now I've accumulated lots of knowledge and helper functions.

You can dramatically simplify your life by using lists with lapply and related functions. I teach students with no previous programming experience to do some things that would otherwise be far too complex, but I also have to write a helper function to convert the output into a usable form for further analysis like plotting.


I think (and I usually anger at least some people when I say this) that it's wrong to see R as a 'programming language'. I mean, it looks like one, and it's Turing complete so if you use that as a criterion it is a programming language, but I think it's more useful to see it as a stats software package with a text-only user interface. Approaching it this way instead of as just another language to 'pick up' makes using it much less frustrating.


I supposed it depends on how you define a 'programming language'.

I'm not angered, I'm more wondering about the usefulness of the arbitrary line you've drawn in the sand, and even the shape of the line.

RShiney lets you build interactive webpages with advanced GUIs; if <webdev stack> counts as a programming language, why not R?

Many people like python because it lets you script things, and you can even make your script executable with a shebang at the start (!# /bin/python) -- and while true, that isn't built into R, you can run R scripts programatically (> Rscript myfile.R), or make this executable by putting it in a standard shell script.


I just got done making a DockerLambda in R. The lambda itself is simple. It takes a few inputs, gets data from S3/Files into dataframes, and passes it off to the real calculation. The actual math is done by another team.

I approached it like you described. I wouldn't want to do a really complex REST API in it, but as a wrapper for calculations, we've got a repeatable pattern to run them in a cost effective manner.


I've grown to really rely on R pipes, it's just how I like thinking about problems with small short lambdas. I wish python had better support (and better lambdas, without typing out 'lambda' and with tuple arguments) but for when I need to use it, toolz's pipe is pretty ok as long as you use lazy map/filter/etc

It's just so close to being as good, but not quite


this comes immediately to mind http://arrgh.tim-smith.us/


As somebody who also used to enjoy bashing R for similar reasons, I think the new preamble to that document is an important and nice addition. (Wasn't there last time I read it.)


There are a few things the author did not mention, such as RStudio Server, Shiny. If you get to know them, you will find they do certain tasks extremely well, and there is simply no equivalent of these in other data programing ecosystems.

Comparing R with others as merely a language is close to meaningless. You have to take the whole ecosystems into account.


There's no greater joy than whipping up a POC in a few hours using R Shiny and showing it to senior leaders when they were told that it will take months to get a POC ready.


“To put it plainly, R won’t change. If something about R frustrates you today, it always wilL.”

LMFAO. I only read the above line and the bit about how tidyverse nearly fixes all the badness that is r. Max lol.


It would help to know this person's background a bit more. Is this a former software engineer, or a manager trying to pick up a technical skill?

That info could help contextualize the entire piece.


I gave up on R immediately when installing on a Macbook was a nightmare and especially when I was already proficient in Pandas and Numpy.


OT: does anyone know what happened to rweekly.org? It hasn't been updated in a long time and I don't know a good substitute.


I should write Frustration: one year with Swift & SwiftUI or my favorite Frustration: years with PHP ;-)


For my use case, R is absolutely terrible compared to some for profit statistical package / language. Using R feel like using an outdated, complicated and messy tool.

But guess what: it's free.


> For my use case, R is absolutely terrible compared to some for profit statistical package / language.

Which one? I've switched most of my work over to Julia, but I'd much rather use R or Python than Stata or SPSS.


What is your background? I'm assuming you either come from a programming background or at least enjoy programming and are pretty good at it. Most people I know who are either statisticians or scientists first and programmers only reluctantly love Stata and SPSS.


> What is your background?

Academic research, so more econometrics/data science work, but I have some experience with application programming that I've managed to leverage.

> Most people I know who are either statisticians or scientists first and programmers only reluctantly love Stata and SPSS

They are good to the extent that a lot of published - social science - research uses terms and methods that assume you're using one of the two. Of course, having an ok point and click interface also helps.

However, data access, aggregation, and cleaning are easily ninety percent of what's involved in even basic econometric(y) research. It is orders of magnitude easier to do all of this programmatically in R. Once you start working with larger datasets, or once performance becomes an issue, you pretty much have to transition to Python, Julia, or something similar by default.


I don't understand the use case for SPSS. My local university is training their neuroscience researchers on it, which seems so odd in 2022 with Julia or python sitting right there.


Have been using SPSS decades and I think it is a good statistical tool for those who do not want to do programming much. It gives me something Iu can easily explain to and teacher other statistical users (e.g. hand in many papers with some regression analysis in it and team members who have no programming experience and do NOT want to learn much other than absolutely minimum and necessary; they are social scientists and that is it). I knew R, python, SPSS, SIR/DBMS and most of SAS all can do this. Frankly only SPSS they can use. And I believe they can use it after I am not in the picture anymore still. That is the use case of SPSS. There are more people in that hole than your think.


It's 80% about having a point and click interface, 10% about path dependency effects, and 10% about whether or not they have a paid license.


Teaching someone who knows a bit of Excel and very little programming how to do statistical analysis in SPSS is easy and lets you focus on the statistics.

Teaching them to do statistical analysis in Julia will involve you spending 80% of your time teaching them Julia and maybe 20% of your time teaching them statistical analysis.


> Teaching them to do statistical analysis in Julia will involve you spending 80% of your time teaching them Julia and maybe 20% of your time teaching them statistical analysis.

This works until they run into a use case that doesn't involve running various forms of regression analysis on panel data.

In the parent comment's case, I could imagine that there's an expectation that someone doing neuroscience research will eventually have to expand beyond what's possible in SPSS. In this case, it may make sense to go through the effort of teaching them how to program in Python or Julia.


I agree with you, though I wonder what's included in your definition of "do statistical analysis" ? Is it just using the stats functions as blackboxes without understanding what is going on under the hood?

I find using Python and/or R to be very helpful for teaching, since you can implement the stats procedures using primitives (prob. calculations), so you get some experience with how things work.

Sure it requires some "coding" but nothing harder than using a calculator, so I think it's worth learning.

Julia is a bit more involved (need to learn something about data types), but still would be manageable.


SPSS was originally created for social scientists and psychologists. It allows people who usually don't have really a clue what they are doing, to create something that looks like science. Later on it was marketed as a predictive analytics suite for business minded people.

From time to time, I still have to use SPSS. Again and again I'm flabberghasted how bad this overpriced piece of software is.


Ease of use. I haven't used it personally, but I'm pretty sure you can do everything with a mouse - no need to learn actual code.


JMP and stata are really solid tools, but I feel like the advantage stems from the fact that they help me discover new statistical methods. In the GUI I can see an option for something I never seen before, read the documentation, and therefore expand my stats skillset. This type of discoverability is a bit harder with R/Python just due to the nature of it being purely script driven.


The language itself can be criticized, but when academic statisticians publish a new method, they often post an R package, too, so it has recent functionality unavailable elsewhere.


What makes R interesting is the amazing libraries for statistical ideas that are not completely run of the mill.

You almost literally can't come up empty on CRAN.


What for profit statistical package feels better? SAS literally has 8 character name limits in places. The data input command is literally called CARDS. It feels ancient. Minitab and SPSS aren't much better in syntax with regards to scripting which is important for reporoducability.


Maybe you can switch to Python? Its also free and has a lot of statistical packages.


Long-time R user here. Yes, many of these points are valid but I still think R is unbeaten when it comes to speed in (tabular) data exploration. In the article you mention that you missed using data.table - a significant portion of the problems you named would be solved or at least weakened by using data.table. Started working with it many years ago and never looked back. It's easy, powerful and efficient to use. I also find its Python pendant, pandas, much less intuitive to use (.loc() anyone?) although they compare performance wise.


>.loc() anyone?

What is wrong with .loc, in your view? Genuine question. I used to dislike it but I've been using pandas for a while and I've gotten comfortable with it, and I've forgotten the reasons I used to dislike it.


.loc works smashingly. %>%?


Well thats the thing with pandas, which one is it? [], loc, iloc, . ? Why do I have to reset_index so often? I agree with OP R has Pandas beaten when it comes to accessing data.


I will agree about [] being overloaded, but .loc and .iloc are distinct for a good reason. .loc is for operations according to the index, .iloc is for operations according to position. You have to reset_index() so often probably because you are not using the index properly. Effective use of Pandas means effective use and consideration of the indexes on your dataframes and series.

Recommended watching (32:00 onwards): https://www.youtube.com/watch?v=mWtfZaT7iSc


iloc - numerical location, loc - location in table

I rarely reset index -- perhaps its a difference in familiarity? (I use R but it isn't my background, perhaps there is a forced R pattern that is a general antipattern for indexes?)


You can and should use Data.table without pipes.

You will be hard pressed to use pandas without .loc and resetting your index.


I have been using R for almost 20 years now. I work on a medium-sized quant team at a large asset manager and we run several $BN off R - we mostly trade equities and vanilla derivatives. Our models are primarily statistical/econometric-based. In aggregate, we probably have about a hundred scheduled jobs associated with a variety of models and on the order of 15 shiny applications to facilitate implementation. We have an internal CRAN-like repo and everything we produce is packaged/versioned with gitlab CI/CD. We have RStudio Server at my firm and half my team uses that for development, the other half, including myself, uses emacs/ess. All of us use RConnect for scheduling & application hosting - it has it's quirks, but it's excellent in a constrained IT environment.

I often chuckle when people complain about R in production and how it isn't a good general purpose programming language, my experience has been the polar opposite. You can write bad code in any language, and R is no exception, but R allows you to write so much less code and R-core is truly exceptional at backwards compatibility. Our approach to R is basically:

- Don't have a lot of dependencies, and when you do have dependencies, make sure they themselves don't have a lot of dependencies. While we do use shiny as mentioned above, our core models are very dependency light and shiny is just a basic front end.

- data.table (which was designed by quants) is a zero-dependency package that is by far the best tabular data manipulation package that has ever been created since the dawn of time. We generally work on an EC2 instance running linux with a ton of memory. In the < .01% of cases where a dataset doesn't fit in memory (e.g. tick data), we do initial parsing with awk if file based or SQL if DB based and then work in R.

- Check/coerce argument types and lengths on function input to catch and avoid all the quirky edge cases that drive people nuts - it's so easy!

- I hate OOP and I love that R doesn't encourage it. Mutable state, especially for non-software engineers, is the devil. Don't get me wrong, OOP has its place, but the fact that R encourages functional programming is one of the best things about it. The slight inefficiency this produces is almost never a problem.

- R is not slow at all when used correctly. Additionally, the C API is a joy to use when necessary.

- Stick to the base types: vectors, matrices, lists, environments and data.tables (only exception). The fact that you can name, and then use names to index all of the above is stunningly powerful. The only "objects" we really create are lightweight extensions of lists with an S3 print method.

- We have an internal version of renv/packrat that creates a plain text "dependency file" for projects and we pin package versions in docker containers. RConnect doesn't use docker right now, but they do have a versioning system that works quite well in my experience.

I definitely wouldn't want to build something like a company website in R, but I also wouldn't want to build that in C either. R definitely has it's place a server-side language even outside it's assumed domain of statistics.

Haters gonna hate, but joke is on them.


Never liked R. Always felt like a cheap alternative to either proprietary solutions (Stata and SAS) or simply a better experience in Python.


Fun fact: there exists a "Why R? Foundation" which holds yearly conferences, because after all these years the R community just cannot find a sensible reason to use the language.


I used to do R intensly. What I found, after moving to more "SE"-centric languages such as Python or C++, was that R becomes quite frustrating when you need to build something maintainable. Which is basically a definition that should apply to every package. As soon as S3, R6 and whatnot comes into play at multiple levels, I'm better off moving to C in order to minimize interacting with R's class system and having that done only at the topmost layer.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: