One Year with R

tel · on March 22, 2022

R, and by R I mean R+tidyverse, is the world's best graphing calculator attached to an OK scheme.

To which I mean R is a highly optimized, well-oiled machine if you're using it for its highly-optimized, well-oiled purposes. I tend to have notebooks full of tiny fragments like this

    dat_min %>%
      group_by(ymd = make_date(year(date), month(date), day(date))) %>%
      summarize(vol_btc=sum(vol_btc), vol_usdt=sum(vol_usdt), tradecount=sum(tradecount)) %>%
      ungroup() %>%
      pivot_longer(cols=c(-ymd)) %>%
      ggplot(aes(ymd, value)) + 
      geom_line() +
      facet_grid(name ~ ., scales="free_y")

It's madness if you're not familiar with the tidyverse, but 3 dozen fragments like this is enough to eviscerate a fresh data set. Almost any question you can dream of is a 3-20 line set of transforms away from a beautiful plot or analysis answering your question. Very notably, this includes some of the finest modeling tools available today.

Terseness here is a huge advantage as well because in many data analysis workflows you are rerunning that same 10 line snippet over and over, making small changes, adjusting to eventually visualize the thing you're looking for perfectly. Having all of that in the same small block is ideal.

Finally, for the non-trivial number of folks in this specific scenario, the integration between Stan and R/RStudio is top-notch and makes using both tools very pleasant.

You can replicate all of this in Python, but optimal Python/Jupyter is still a far cry away from R/RStudio for these specific sorts of tasks.

peatmoss · on March 22, 2022

> best graphing calculator attached to an OK scheme.

I discovered "How To Design Programs" somewhere late in my first year of using R. Like most beginning R coders with nominal experience in other languages, I wrote a lot of monolithic scripts in a very imperative style. HtDP gave me a mental framework for decomposing larger problems into bite-sized chunks. The lispy roots of R lent itself particularly well to the model of thinking presented in that book.

Ever since then, I've pined for the graphing calculator parts in a more modern Scheme. When ggplot and then the tidyverse (neé hadleyverse) came on the scene, I was even more convinced that Scheme, especially Racket, was the ideal future for data science. If R could support a large ecosystem like tidyverse, just imagine what the metaprogramming facilities of Racket could do!

But I think those graphing calculator parts are hard to reproduce. Attempts to clone ggplot2 fall short year after year, because most other languages don't have grid graphics to build on top of. R is a deep ecosystem on "an OK scheme," which is damned hard to beat.

Aside: my first year with R, was in an urban planning masters program and I was terrified of my first big kid statistics course (taught in SPSS). I decided I'd give myself bonus work by learning R. While it was absurd to be doing my stats homework in SPSS, then R, then reviewing HtDP on top of the rest of my course load, I did ace that stats course. :-)

jdougan · on March 22, 2022

There is an better alternate universe where xlisp-stat doesn't fall behind and S doesn't happen.

peatmoss · on March 22, 2022

Interestingly, there appears to be an attempted reboot: https://lisp-stat.dev/

My first reaction, is "why not on a modern Scheme as opposed to Common Lisp" and in so thinking, I have demonstrated exactly why no lisp / scheme has ever achieved critical mass :-)

jhgb · on March 22, 2022

I would think that the reasons for "why not Scheme" would include the fact that CL has more existing libraries, better built-in support for multidimensional arrays, efficient low-level code (now even with SIMD on SBCL using SB-SIMD), application-controlled JIT compilation of specialized code, almost guaranteed support for native multithreading, etc.

kwccoin · on March 23, 2022

Shocking. I still using the xlisp-stat based social simulation book first edition (20+ years ago). The move to ... does not fit my mind. I do not know there is a re-start!!!

jdougan · on March 22, 2022

Obvious answer: 3rd party library support. Same thing that makes R ubiquitous.

edgyquant · on March 23, 2022

I really don’t think this is that different from how a lot of us learned to code though. I learned by writing html and PHP specifically to solve a problem of having a website. The only difference I see is that the average CS student has to write reliable and working code as their career, while quants and statisticians tend to think that code is just a means to the end. Both are right I’d say depending on the problem.

delusional · on March 22, 2022

> R is a highly optimized, well-oiled machine if you're using it for its highly-optimized, well-oiled purposes.

This hits home for me. We are just starting to use R for risk modeling where I work. R, more than any language I've ever used, makes me appreciate "worse is better". From a theoretical "aesthetic" perspective R is a mess. Yet for data processing all those theoretical concerns don't matter. It just works.

It's honestly kind of humbling that something so theoretically messy can be so practically coherent. It makes me question my assumptions about simplicity.

canjobear · on March 22, 2022

R "just works" now because a huge amount of effort has gone into improving the language over the last 10 or so years, in part spurred by the tidyverse movement, although not restricted in scope to tidyverse. When I was starting grad school around 2010, if someone sent you some R code, the chances that you would be able to "just run" it were basically zero: there would be weird version mismatches in how functions worked, file paths would be specified in inconsistent ways in different parts of the script, all kinds of crazy impenetrable errors were the norm. Now there are several R code snippets posted in these HN comments that will run without trouble. If I could have gone back in time and told myself that this is how R would develop, I would have been shocked (and happy).

wenc · on March 22, 2022

Part of it has to do with strict testing in CRAN as well. Packages have to pass tests and confirmed to compile. This adds reliability to package management across platforms.

That said I still run into trouble with package deprecations. I was trying to install the optmatch package (deprecated but still used by causal inference packages) and had a really tough time getting it to compile on macOS.

disgruntledphd2 · on March 22, 2022

I dunno man, python has always seemed a little bit worse on this stuff to me. At least with R if you had a consistent version, everything off CRAN worked together.

I think R 3.0 introduced namespaces which fixed a lot of the really crazy stuff.

Also, I was writing Sweave in 2010 for my thesis, and I definitely wasn't alone.

jghn · on March 22, 2022

Namespaces showed up around 2004, so somewhere around 2.0.0. I don't think they were mandatory until much later.

jghn · on March 22, 2022

I suspect that has as much to do with the maturation of the data science community as it does the language environment. There have been pockets of R users who put much effort into reproducibility before the era you cite, such as Bioconductor.

When I think back to the era you're describing what I recall was people winging around hacky scripts being the norm regardless of their environment. While still not something I'd think of as software engineering best practices, what I see now is less Wild West.

bachmeier · on March 22, 2022

I hadn't thought about R as a "worse is better" language, but that's a good way to think about it. Makes sense, too, since it came from the place that inspired worse is better.

dash2 · on March 22, 2022

R comes from New Zealand, no?

bachmeier · on March 22, 2022

R is an implementation of S. John Chambers worked on it at Bell Labs starting in 1975.

https://en.wikipedia.org/wiki/S_%28programming_language%29

edgyquant · on March 23, 2022

I’m trying to figure out if you were actually asking a question or if it was rhetorical and you were calling shots.

edgyquant · on March 23, 2022

Is there any reason you chose R over Python? Is it just because that’s the go to language?

delusional · on March 25, 2022

We asked the people are going to be using it what they'd prefer. Many of them are recent graduates, and they told us they mostly used R during their university courses. It's just a pure familiarity play. The alternative was building a huge system on a mainframe (we're a legacy bank).

Really there wasn't a lot of thought put into the language. We figure that if it ends up being a total failure, we can just pivot.

deehouie · on March 22, 2022

Bravo! This is exactly right.

dm319 · on March 22, 2022

This is just a quick example - I would be grateful if people could recreate this brief look at UK COVID figures in another language:

  library(tidyverse)
  library(scales)
  
  download.file(url = "https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=covidOccupiedMVBeds&metric=newAdmissions&metric=newCasesBySpecimenDate&metric=newDeaths28DaysByDeathDate&metric=newPeopleReceivingFirstDose&format=csv", destfile = "./data.csv", method = "wget")
  
  read_csv("./data.csv") %>%
  pivot_longer(names_to = "Data", cols = c(newCasesBySpecimenDate,
                       covidOccupiedMVBeds,
                       newAdmissions,
                       newDeaths28DaysByDeathDate)) %>%
  mutate(Data = factor(Data)) %>%
  mutate(Data = recode_factor(Data, newCasesBySpecimenDate = "New Cases",
         newAdmissions = "Admissions",
         newDeaths28DaysByDeathDate = "Deaths",
         covidOccupiedMVBeds = "Ventilated")) %>%
  ggplot(aes(y = value, x = date, colour = Data))+
  geom_point(size = 1, colour = "gray", alpha = 0.6)+
  geom_smooth(type = "LOESS", span = 0.1)+
  labs(y = "Daily rate", x = "Date", colour = "UK COVID-19")+
  scale_x_date(date_breaks = "months", date_labels = "%b-%y")+
  scale_y_log10(labels = comma(10 ^ (0:5),
                 accuracy = 1),
         breaks = 10 ^ (0:5))+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

_dain_ · on March 22, 2022

    import pandas as pd
    import matplotlib.pyplot as plt
    import matplotlib.dates as mdates
    import seaborn as sn

    df = (pd.read_csv("/tmp/overview_2022-03-21.csv") # i just used curl beforehand
        .assign(date=lambda x: pd.to_datetime(x["date"]))
        .set_index("date")
        .melt(value_vars=[
                "newCasesBySpecimenDate",
                "covidOccupiedMVBeds",
                "newAdmissions",
                "newDeaths28DaysByDeathDate"],
            var_name="Data", ignore_index=False)
        .assign(Data=lambda x: x["Data"].replace({
            "newCasesBySpecimenDate": "New Cases",
            "newAdmissions": "Admissions",
            "newDeaths28DaysByDeathDate": "Deaths",
            "covidOccupiedMVBeds": "Ventilated"
        }))
    )
    ax = sn.scatterplot(data=df, x=df.index, y=df["value"], hue="Data")
    ax.set(xlabel="Date", ylabel="Daily rate", yscale="log")
    ax.xaxis.set_major_formatter(mdates.DateFormatter("%b"))
    plt.show()

I spend 2 minutes on the pandas part and 20 minutes on the plotting part, which really says it all. Seaborn's support for smoothing is really bad and doesn't play nicely with datetimes for some reason, so if I wanted smoothing I'd need to do it myself. And the other stuff I left out requires going into matplotlib's documentation which I don't want to spend time on.

pandas is as good or better than R's dataframe manipulation, but R's plotting tools are best in class. I hate all the python plotting libraries.

LosWochosWeek · on March 22, 2022

> I hate all the python plotting libraries.

I've given up discussing this. My personal opinion is that MPL is the assembly vs ggplots2's Python.

Yes, maybe there are some things that are doable in MPL but not in ggplot2, but never (and I mean - never -) has any one of my colleagues found a single example that I couldnt recreate in ggplot2 with a more readable code that resulted in a better looking plot.

MPL has this Latex like "once you get the hang of it, you will never use anything else in life for anything" and OOP's "If its good code then it is OOP and if its not, then you're not doing correct OOP" mythisque hanging over it. When confronted with bad MPL code that results in bad plots, its always one of those two.

ggplot2 is the best plotting package out there and imo one of the best "end-user" packages in any language. Also, Hadley is a saint.

benchgoblin · on March 22, 2022

Check out plotnine. Really good clone of ggplot for python.

https://plotnine.readthedocs.io/en/stable/

dm319 · on March 23, 2022

Thanks for providing that. Interesting to see the '.' used like a pipe. Always thought of it used in an OOP context, but interesting that it can be functional too. I also had difficulty fitting a LOESS curve to the plot, but I could do a linear model. The LOESS would have been possible doing it manually I guess.

nuclearnice1 · on March 22, 2022

This is great.

You don't need the curl -- read_csv works with URLs directly.

The lambda can be replaced by passing parse_dates=[“date”]

_dain_ · on March 22, 2022

Thanks

hadley · on March 22, 2022

This was fun to play around with. I made some very minor changes and posted at https://gist.github.com/hadley/d54895557fbb0fe0402d2277b9011....

It revealed to me that there's a buglet in `forcats::last()` (https://github.com/tidyverse/forcats/issues/303) and made me wonder if `pivot_longer()` should be able to rename the columns as you pivot them (https://github.com/tidyverse/tidyr/issues/1338)

dm319 · on March 22, 2022

This might be the most excited I've gotten about a comment in hackernews for a while! So interesting to see your style of code (I was actually on 4.0 so gave me the impetus to upgrade to get the new pipes). fct_reorder - what a hugely useful function I didn't know about. The chicks vignette was nicely illustrative. And label_date_short also super useful. Also curious generally about your bracket style (I just tend to pile them all up when closing, which is something I only do in R and probably shouldn't!).

Renaming factors is one of those things that always seems a bit awkward. I think I've used several methods. Passing a list of named vectors into `levels(x)` allows a many-to-one mapping but was quite dangerous. I've used revalue and mapvalues from plyr. fct_recode is new to me. But yes, renaming while reshaping could be quite convenient. Just looking at fct_recode now, it looks really nice. Seems to support many-to-one and being able to pass it a name vector is very convenient.

Learnt so much today, and this is even before trying out the python and julia examples! Many thanks for this and all your work in R!

pbowyer · on March 23, 2022

If you enjoyed this you might like https://github.com/VictimOfMaths/COVID-19 :)

Hasnep · on March 22, 2022

A Julia solution with Chain and Gadfly might look something like this, although I've translated the R fairly directly so it might not be very idiomatic.

    import CSV
    using Chain: @chain
    using DataFrames
    import Downloads
    using Gadfly
    using Dates

    @chain begin
        Downloads.download(
            "https://api.coronavirus.data.gov.uk/v2/data?areaType=overview&metric=covidOccupiedMVBeds&metric=newAdmissions&metric=newCasesBySpecimenDate&metric=newDeaths28DaysByDeathDate&metric=newPeopleReceivingFirstDose&format=csv",
        )
        CSV.File
        DataFrame
        stack(
            [:newCasesBySpecimenDate, :covidOccupiedMVBeds, :newAdmissions, :newDeaths28DaysByDeathDate];
            variable_name = :Data,
        )
        transform(
            :Data =>
                (
                    x -> replace(
                        x,
                        "newCasesBySpecimenDate" => "NewCases",
                        "newAdmissions" => "Admissions",
                        "newDeaths28DaysByDeathDate" => "Deaths",
                        "covidOccupiedMVBeds" => "Ventilated",
                    )
                ) => :Data,
        )
        subset(:value => ByRow(!ismissing)) # Can't plot Geom.smooth with missings
        plot(
            _,
            x = :date,
            y = :value,
            colour = :Data,
            layer(Geom.smooth(method = :loess, smoothing = 0.1)),
            layer(Geom.point),
            Scale.y_log10(),
            Guide.xlabel("Date"),
            Guide.ylabel("Daily rate"),
            Guide.xlabel("Angle"),
            Guide.colorkey(title = "UK COVID-19"),
        )
    end

dm319 · on March 23, 2022

Thank you for that, good to see there's an elegant Julia solution! The last time I was using 'pipes' with Julia, I think I was using DataFramesMeta. I also really like this interactive gadfly plot - reminds me of Matlab, but better. It's been a little while since using Julia, so I'd forgotten about the pre-compiling thing, but generally this code looks pretty nice and clear.

Hasnep · on March 23, 2022

I used to use DataFramesMeta.jl, but eventually I found that the mini-DSL that DataFrames.jl has created is really powerful and not overly verbose. Now, going back to the Tidyverse's syntax makes me feel a little uneasy, like there's just too much magic going on behind the scenes, even though I used it for years with no problems.

azalemeth · on March 22, 2022

You don't need the download file command -- read CSV works with URLs directly :-)

dm319 · on March 22, 2022

Oh nice, that's very useful to know!

nuclearnice1 · on March 22, 2022

I think these kind of common task challenges are great for comparison. You see a few approaches to a single task and you can do a more aligned and detailed comparison.

Unfortunately, it’s also a lot of work. In this case, you’ve posted an intermediate stage artifact from R. If one of the many Python programmers reading this want to produce a comparable artifact they need to understand or run that code. That alone reduces your likelihood of getting any substantial replies.

Maybe add a link to an image of the resulting plot?

dm319 · on March 22, 2022

Good point, looks like I'm too late to edit, but here's a link [1] (excuse the R style indexing).

Yes, I love these things and very curious to see what hackernews comes up with. Project Euler was an eye opener for how things could be optimised in different languages.

[1] https://i.imgur.com/M8DX98I.png

CapmCrackaWaka · on March 22, 2022

I also use R for any heavy data manipulation, but I primarily use the data.table package. The efficiency that both of these packages unlock is absolutely unparalleled in any other tabular data manipulation library, in any other language that I have used. And R has the top 2!!

My skin writhes every time I need to type:

table.loc[(table.column > 2) | (table.column2 < 3)].reset_index(drop=True)

when I want to subset a table.

_dain_ · on March 22, 2022

.loc lets you supply a callable, so you can write:

  table.loc[lambda df: df["column"].between(2, 3, inclusive="neither")]

this is useful when your dataframe has a long name, or when you have some long method chain and you need to subset at the end: table.foo().bar().baz().loc[lambda df: ...]

it is still more verbose, but I actually prefer always providing column names as strings. it's more explicit. I don't like R's environment-manipulation metaprogramming magic where you can give column names as symbols.

as for resetting the index all the time, this is something of an antipattern. if you set up your index right beforehand it isn't necessary so often.

sntscy · on March 22, 2022

Can't get around resetting the index, as far as I know, but for the filtering you can also do,

`table.query("column > 2 and column2 < 3")`

claytonjy · on March 22, 2022

do any IDEs help you autocomplete the string argument? that's a bit part of the non-standard evaluation magic in the tidyverse these days; you can use bare, _unquoted_ names, _and_ get excellent autocomplete, at least in RStudio.

alexilliamson · on March 22, 2022

Not to mention the auto complete that comes with RStudio. Is there any way to get equivalent functionality in Jupyter?

discordance · on March 22, 2022

If you set up use the Jupyter extension[0] and open your notebooks in VS Code you get Intellisense (code completion, method info and hints etc).

0: https://marketplace.visualstudio.com/items?itemName=ms-tools...

claytonjy · on March 22, 2022

IME this is strictly worse than the RStudio experience; most of the time I hit tab in a VSCode notebook I get way too many options that IMO are clearly not what I want, though at some level it's more pandas' fault (too many methods & attributes even before attaching every column name as an attribute) than VSCode or Intellisense.

ogogmad · on March 22, 2022

In addition to the sibling answers:

If you know the class of an object, let's say Class, but the object has yet to be "constructed" so that IPython can correctly infer its type, you can type `Class.[TAB]` in IPython and look at its methods.

For example, in Sympy, you have a matrix type called Matrix. You can do `(A * B).diagonalize()`, or alternatively you can do `Matrix.diagonalize(A * B)`, which has some advantages because doing `(A * B).[TAB]` does nothing useful because Python can't infer types.

You can also do the same for modules. `ModuleName.[TAB]`

To be honest though, I found the experience smoother in R for some reason.

CapmCrackaWaka · on March 22, 2022

I use pycharm which has decent autocomplete. Pycharm has its own issue though, it fills out its autocomplete info by looking at the function that created the object, not the object itself. So if a function can return different types, autocomplete won’t work. That’s caused me quite a bit of pain.

qudat · on March 22, 2022

My wife is a researcher and started delving into doing her own statistical analysis. It's be fun (and frustrating) learning R with her. I agree that dplyr and the tidyverse are some fantastic packges for a software engineer who thinks about spreadsheets as SQL tables.

I would say the most frustrating part about RStudio is that it is a workbook where you can execute code based on your cursor. For my wife, these workbooks become a total mess because things aren't necessarily run sequentially.

dash2 · on March 22, 2022

Here are two good strategies:

1. Always run from the top, using the "run previous chunks" button. When this gets too slow, you know that it's time to think harder about your workflow. For a more extreme version of the same idea, regularly restart R using Ctrl-Shift-0, and run from the top. It'll ensure your code is working right.

2. Have a setup chunk that always gets you to the same state. Make sure every other chunk works directly after calling the setup chunk. Then just alternate between "run setup chunk" and "run current chunk".

da39a3ee · on March 22, 2022

0. Don’t use notebooks. Neither RStudio nor Jupyter. They prevent people from developing good programming skills and good version control skills, and they encourage making a huge mess.

dash2 · on March 23, 2022

I mean that's fair, but if you have to write an academic paper, your options are more or less use a notebook, or copy and paste your results. Of course, the question is how much code you should write in the notebook, versus having it in a more organized set of functions and libraries. It's very easy to end up with a huge bloated document which contains thousands of lines of spaghetti.

da39a3ee · on March 23, 2022

I absolutely agree with this:

> the question is how much code you should write in the notebook, versus having it in a more organized set of functions and libraries. It's very easy to end up with a huge bloated document which contains thousands of lines of spaghetti.

But this isn't correct:

> your options are more or less use a notebook, or copy and paste your results.

What's wrong with writing scripts that write images to disk? That's how millions of academic papers were written before the advent of notebooks. You could use Makefiles if you like, or you could even use a technology such as Sweave to automatically mix images with LaTeX output.

I mean this as politely as possible but the fact that you think that the options are "use a notebook or copy and paste" I think shows that you've caught a notebook mentality disease! The fundamental point I'm trying to make is that you we don't need to do everything interactively from REPLs. REPLs are great for trying things out, but when it comes to producing the images for your paper, those should be produced by scripts, not by commands entered into a REPL, or notebook. An those scripts should evolve via version control, which is the basis of evolving any good and correct software. And the scripts for producing images for a paper should be good and correct software.

dash2 · on March 23, 2022

We might be talking at cross purposes. I'm thinking of e.g. a Rmarkdown notebook, which is indeed a script not (necessarily) an interactive notebook like Jupyter. And it can be automatically compiled, via a makefile or something similar, and put into version control.

The point is that it makes sense to mix english prose + code to e.g. produce tables or graphs, even if most of the heavy lifting is done separately in code files.

da39a3ee · on March 23, 2022

Ah, OK, yes I was talking at cross-purposes to some extent then, thanks. (Jupyter notebooks are hopeless in version control due to the JSON format, but I'm not familiar with Rmarkdown notebooks -- do you get sane diffs?)

Yep, so what you say makes sense. Isn't it sometimes a bit overly prescriptive to assume all collaborators use Rmarkdown? (Perhaps not! I used to work in biology and statistics and R was very ubiquitous.)

dash2 · on March 25, 2022

Rmarkdown indeed makes sane diffs, it's just a nice text-based format. You can use python or julia as well (see quarto.org for the latest version of this). Getting collaborators on board... yeah, that varies.

hadley · on March 22, 2022

I love the phrase "eviscerate a fresh data set" :D

stadeschuldt · on March 22, 2022

Great stuff. Could you maybe share a bit of these "3 dozen" scripts? This could be super helpful.

ekianjo · on March 22, 2022

Looks like your notebooks focus on crypto currencies :-) Good use case for data analysis!

crispyambulance · on March 22, 2022

I was expecting a rant, but the OP's article is actually very thoughtful. He definitely knows what he's talking about.

The thing about R, for me and many others, is that it's very much an everyday grind language. Especially with Rstudio, its natural domain is as one of "notebook" languages like python, julia, matlab, and mathematica but with a more clear focus towards the tasks of data-analysis. I just tell the BI-tool people that R is excel on 'roids.

R frustrates me a lot, however. But I think the frustration comes out of the fact that when I am using R and get stuck, I am always in the middle of doing something that I need to get done and I don't feel like diving into a long "vignette". Moreover, the documentation is usually too terse and generalized for me to just understand it immediately. Even though I've been using R for years (albeit in fits and starts rather than continuously), there are things about it that I've just never picked up-- I just DON'T KNOW (or care) what F S3 and S4 mean. Unlike the OP, who clearly knows more R than myself, I grit my teeth when I am looking at docs and see the "..." in the arg list.

I suspect that this is part of the heritage from R's beginnings. I once tried to read John Chamber's book but found the presentation complete ass-backwards and impractical for my immediate needs. The Tidyverse has been great, it's far more consistent and ggplot is a kick-ass tool to have in your box. The drawback is that it makes Base-R seem really alien and if you want to be good at R, you have to know more than just the Tidyverse, IMHO.

epistasis · on March 22, 2022

The tidyverse docs are the only ones with the super frustrating ... of impenetrable gnostic "documentation" that I know of. In general the tidyverse documentation is horrible, almost as bad as typical Python docs, IMHO. Other parts of base R are wonderfully documented in my opinion.

hadley · on March 22, 2022

Do you have any specific examples that illustrate the general problem? I'd love to better understand what you're looking for in docs.

epistasis · on March 22, 2022

Thanks for taking my aggressive comment with such spirit, it really speaks to a good community. (Sleep training an infant has me a bit frazzled)

I should have been more specific, the ... frustration for me comes up mostly in ggplot, Which usually directs you to layer(). Which gets parameter string documentation like:

* geom - The geometric object to use display the data

* stat - The statistical transformation to use on the data for this layer, as a string.

These are two hugely important parameters, with really big concepts and abstractions under them, but the documentation is of the style "foobar(): this is a function that foos the bar", documentation that restates the information in the name, but with more words, and no insight on where to go next.

So now a person is two pages deep into documentation, and it's actually circular documentation because layer() has a ... argument that gets passed back to what? The function documentation that you came from? For a newcomer it's a completely twisty series of passages, and as an experienced user who reaches for ggplot before any other tool, it's confusing.

The other function based confusion is that the list of aesthetics is not connected quite well enough to the mapping argument from aes(). What aesthetic values the function understands is probably one of the most important things about looking up the function. But reading the parameter documentation, it's not clear that there's an entire section below that describes that crucial material, far further down the page. And on a long long page it's easy to accidentally skip over that section when skimming.

(These are the sorts of frustration I have with typical Python documentation, btw, so maybe my brain is just different from typical engineers)

hadley · on March 22, 2022

Ah yeah, connecting the dots in ggplot2 docs is hard. It's hard for us to document because, under the hood, the pieces quite decoupled and different pieces are responsible for different arguments. But since we last took a deep dive on the ggplot2 docs, we've gotten much better at generating docs with code, so maybe it's time to have another look. I've filed an issue (https://github.com/tidyverse/ggplot2/issues/4770) so we don't forget about it, but no guarantees about when it might get done.

mbreese · on March 22, 2022

Is there a tutorial someplace that explains how ggplot actually manages plotting? Or the architecture and layers between the high level code and how a plot is drawn? Meaning, I love being able to express what I want and ggplot figures out a good plot for me. But I know there are many layers that can be manipulated, but I just don’t understand the layers.

One of the best compliments I can think of is that with ggplot, easy things are easy and hard things are possible. But I haven’t been able to figure out how to fully work the system.

(Thanks for all of the work!)

jointpdf · on March 22, 2022

That should be covered in the original A layered grammar of graphics paper: https://vita.had.co.nz/papers/layered-grammar.html

And then there is an entire ggplot2 book (there are many, but this one was written by Hadley): https://ggplot2-book.org/

mbreese · on March 23, 2022

That's very helpful. I think this chapter was what I was looking for:

https://ggplot2-book.org/internals.html

jhart99 · on March 22, 2022

I think the issue with some of this documentation is that for other packages, the function documentation is largely self contained. If I look up glm() it tells me how to use glm(). However, for ggplot2 there is an assumption that you have some level of knowledge of how the pieces should be strung together. So when I know I want a boxplot, and I find geom_boxplot() documentation it wonderfully describes the options for itself, and gives examples for it's use. But sometimes it doesn't give a good idea of the context of how the other pieces might interact. It makes complete sense if you read the book and just want to refresh your memory, but if you are coming in as a new user it really can be difficult to use the documentation exclusively.

epistasis · on March 22, 2022

Having read an online book of some sort on ggplot2, on one of the tidyverse sites, I found the per-function documentation difficult to use and difficult to match to the concepts I had learned. This may be because I'm used to using the parameters section of a function as the primary resource for understanding the inputs. But with ggplot it's scattered in other places, and the holes are not apparent unless you know the specific terminology (not concepts) to match up.

All that said, I find the documentation to be saying a lot more than it did in the past, and it sounds like it has been continually improving.

dash2 · on March 22, 2022

He may be right in specific instances, but I think he's way wrong in general. Tidyverse is generally a triumph of documentation, and part of that is that it doesn't tell you too much. Lots of how-to, not too much implementation detail. It's appreciated.

mst · on March 22, 2022

After 15+ years of shipping open source stuff, I've rather concluded that any given piece of documentation is either going to be too terse or too verbose for any given user and all you can really do is mix judgement and balancing how many of each type of complaint you receive.

It's probably possible at least in theory to structure docs so you have a terse section followed by a verbose section for each thing, but I've yet to develop the discipline or the competence to pull that off remotely regularly.

Maybe in another 15 years.

magnio · on March 22, 2022

> almost as bad as typical Python docs

I found numpy, scipy, pandas, and plotly docs to be quite clear and extensive. The only docs I have found to be confusing are matplotlib's and the Python standard library's. Not sure what packages you are referring to?

bscphil · on March 22, 2022

On the other hand, I'm curious what you've found to be lacking about the standard library documentation. I've found it to be generally very thorough, in some cases fantastic, though there's an occasional weak point.

magnio · on March 23, 2022

For me, the stdlib docs isn't lacking in material, but it is hard to navigate. I think that is partly because it mixes different types of documentations together (tutorial, reference, and changelog). They should split each module's docs into two pages: a tutorial with examples of the most common use-case of the module, and a reference listing all of the classes, methods, etc. of the module.

Also, the built-in types are documented in one page, going from boolean to sequence types and even type annotations. Every Ctrl + F gives me 20 different results, which is annoying as hell.

clove · on March 22, 2022

I've used R for 19 years and do not have any other programming language ability. I am curious: What is frustrating to you about R relative to other languages?

vharuck · on March 22, 2022

In my experience:

- It has a bunch of different types of classes, and they all behave differently. Debugging isn't awful, but it's harder than it should be. Also, the documentation isn't clear about which classes to use.

- A lot of the workhorse functions suffer from parameter glut. Despite having different kinds of classes, almost all functions expect plain vectors. Packages like survival show how objects make it easier to read code, reuse data, and validate data. Without the base packages doing it more, everyone's chosen their own systems. The community's been gravitating to organizing "objects" as rows in tables (i.e. tidy).

- The way a function uses an argument might surprisingly change based on other arguments given (e.g., `binom.test`). And then the documentation won't have examples for the different use cases.

- Most users don't have the time or desire to become better R programmers. They have other work to do. For my own work, I write packages with custom classes, functions, and template documents. For collaboration, I keep things very plain and rarely go beyond dplyr; very often, the script goes between two steps executed in a GUI software.

rvanlaar · on March 22, 2022

I'm not OP.

To me, it's the tooling around it. Everything is done in R-studio and it's focus is to generate statistical documents.

The result is a sub optimum solution. It lacks good tooling around installing and running R programs. R programs don't import, they include. It doesn't make it more readable. R-studio is very Emacs like in the sense that it just lacks a decent editor. Due to R-studio being the default, there's not much support for other editors.

Mikeb85 · on March 22, 2022

> Due to R-studio being the default, there's not much support for other editors.

But RStudio is amazing... Easily best environment I've used for any programming language (well, except Pharo).

kafkaIncarnate · on March 22, 2022

I use Vim with the R command line wrapped around Makefiles. I don't even have RStudio installed. Works great.

I can even pop up an interactive R command prompt session and do whatever I want in it, even quick ggplot2 graphs. Help shows up just as you would expect, and plots pop up in new windows. RStudio is much less advanced than people think it is, it's really just managing R's windows for you and doing generic IDE work. R is doing all the heavy lifting.

If you're wanting a more "import" like thing you might want to look into making R packages instead of scripts. You don't have to submit them to CRAN, and you can execute them pretty simply on the R command line as well.

That being said, it's really not an OOP or software development tool. It's definitely geared toward data science, but you can automate that very well for generating graphs automatically and reporting for whatever reason needed.

racl101 · on March 22, 2022

Do you have notes or documentation on how to get this working without RStudio? Preferably on a Macbook?

kafkaIncarnate · on March 23, 2022

I use Linux, but it should be pretty straightforward on Mac as well. Just type "R" in the command prompt and check the manpage for R itself ("man R").

I'd also look into the "knitr" package, which is what all of the Rmarkdown is based around. So for instance, most of my Makefiles are based around a simple command like:

    R -e "library(knitr); knit2html('index.Rmd')"

Then I just code using VIM on index.Rmd. You can probably set this up however you like with the R command line.

For interactive it literally is just typing "R" in the command prompt. For help, things like "?ggplot", "??knitr", or whatever, so you can open multiple interactive sessions like you were using IPython or something. When you print a plot, it just pops up in a new window.

You can also use "R" to just execute R raw if you are trying to do it without Rmarkdown. I just prefer the HTML output. Pretty sure all the RStudio RMarkdown stuff just calls knitr as well.

The output looks the same as anything on RPubs (and there is a way to publish to RPubs, I used to have to do that at one point), random one from the first page:

https://rpubs.com/mnguy1019/881028

lottin · on March 22, 2022

Emacs has had excellent R support since much before R-studio came around. In fact to me R-studio has always felt like a stand-alone implementation of ESS (minus Emacs).

crispyambulance · on March 22, 2022

FWIW, I do prefer R to all other "notebook" languages. It has everything I need for answering questions about data, working with files, text, and visualization. Libraries that do everything I could imagine are easily available.

For me it's a pragmatic workhorse tool that I use and aside from frequently getting frustrated with the task at hand, it has never failed me in the end.

I think R is much like a handheld power tool. I have no interest in diving deep into the workings of the tool because when I need to use a drill, for instance, I just need to drill holes and anything else is an annoying distraction (I realize that sounds bad!).

I've also worked with Mathematica and JMP in the past. They're very capable, but not as good at general-purpose data-wrangling as R is today (given Rstudio, knitr, shiny, all the specialized libraries, and most especially the tidyverse).

svnt · on March 22, 2022

In decades I have not encountered any development environment where the obscurity of error message presentation can touch R. If one uses other languages or build environments the error messages can frequently be used to diagnose the issue.

In R they default to just vomiting some internal exception often with no context, and I can count the times I have encountered helpful or even seemingly deliberately constructed error messaging in single-place base five. Even kernel development is in some sense better because at least there you are in context and the layers are traversable.

Seemingly one of the primary skills of an R programmer is serving as an informal database of “what the fuck does this mean?” when the issue is something trivially detectable like passing a list vs a vector.

indeed30 · on March 22, 2022

I love R more than any other language I have ever used. Perhaps more than any piece of software I've ever used. All of these points are valid, and yes, it's messy, and if you try to write the same type of code that you would in Python, it will frustrate you.

And yet.. it somehow works. It makes data analysis and statistical modelling a pleasure. It somehow gives off a sense of lightness, and makes it easy to investigate and explore. I would guess I am genuinely 2x as productive in R as I would be in Python on similar tasks.

I know it's not a "proper" language, but I think that, maybe, not everything has to be exactly like "proper" software engineering?

azalemeth · on March 22, 2022

I very much agree with this. I use python for (different types of) data analysis too, and in python in particular it feels like the "boilerplate" to "science" ratio is rather high in the direction of "boilerplate". R manages to abstract this away very effectively, as the article highlights.

The beauty of R is that you can write one line of code and use some hot-off-the-PhD-thesis cutting-edge-just-published-in-J.-Stat.-Soft-chunk of statistical analysis in your totally different, completely whacky problem, and it's fast, and (by and large) works.

Of course, that's its biggest problem as well. Scientifically, it will quite happily give you a 150 mm howitzer to aim at your foot, assuming you know best.

hervature · on March 22, 2022

> hot-off-the-PhD-thesis cutting-edge-just-published-in-J.-Stat.-Soft-chunk of statistical analysis

I think you mean "poorly-documented-cobbled-together-under-deadlines-never-to-be-maintained by someone who has no idea of software principles". Very few labs have a dedicated software engineer to actually turn this software into a usable/hackable tool let alone maintain it.

ekianjo · on March 22, 2022

thats an unnecessary negative stance. not every algorithm needs to be scalable and over optimized to be useful in most cases. and if something becomes really useful in R it ends up being reimplemented in more effe five ways down the road.

milliams · on March 22, 2022

No, but it does need to be tested and reliable.

horsawlarway · on March 22, 2022

Coming from Matlab, I have the opposite feeling.

I truly, genuinely dislike the language. I think it's very productive, and I appreciate that Matlab costs an arm and a leg (and god help you once you start paying for some of the nicer packages on top) - but Matlab has spoiled me immensely on the language front.

To me, Matlab feels like a language that was designed with an intent to appeal to folks with some understanding of traditional procedural programming, but nudged into treating matrices as first class citizens.

R feels like a language that was built for people who were using excel, and have never written a line of code in their life - it's riddled with completely unintuitive, frustrating, intentionally obtuse operators and terms for things that have perfectly fine definitions in normal programming.

The difference is that I have 20+ years of programming experience (including quite a bit of functional programming) that I can easily port over to Matlab, and which becomes literal baggage trying to use R. The end result is that I will use R, but I basically always walk away frustrated and infuriated, even when the problem is solved.

jghn · on March 22, 2022

> R feels like a language that was built for people who were using excel

The S language predates the first release of Excel by 11 years.

> and which becomes literal baggage trying to use R

I've had the opposite experience. My experience was that having a broad array of programming experience made it easier to pick up the weirder corners of R. It became more likely that I'd seen *something* similar to that construct in the past. The converse has also been true. Seeing all the weird corners in R has made it easier to pick up new concepts in other languages & paradigms as it's been more likely I've seen *something* similar from R.

dm319 · on March 22, 2022

Using pipes and tidyverse/data.table allows for great things in R, and has a strong functional feel. It can be quite beautiful reshaping data, splitting, map, recombining and plotting it.

It doesn't go well at all with a procedural method.

ekianjo · on March 22, 2022

> R feels like a language that was built for people who were using excel,

I don't think so. Most people who come to R after years of Excel find it just as alien as you do.

jghn · on March 22, 2022

I recall when the pipe operator was first being proposed the argument for it was that it'd enable workflows that felt more like Excel. The implication being that indeed, base R is alien to an Excel user.

I also recall my pushback was along the lines of "who on earth would want that". Yeah, it's a good thing I'm not the person coming up with these things :)

klmr · on March 23, 2022

> I recall when the pipe operator was first being proposed the argument for it was that it'd enable workflows that felt more like Excel.

Where are you getting that from? To start with the pipe operator has been independently reinvented multiple times in R, and neither ‘magrittr’ nor ‘dplyr’ were the first to introduce the pipe operator into R. And (at least when I was exposed to it), the pipe operator had nothing whatsoever to do with Excel. Instead, it was an attempt to introduce the composability concepts from the UNIX shell and Haskell composition into R.

jghn · on March 23, 2022

You have me second guessing myself, that perhaps I’m conflating it with the convo around dplyr in it’s early days

EDIT: I found the conversation in question but it involved deleted tweets. And those deleted tweets are the one that reference the package name. Sigh. It was just after the release of magrittr and several months after dplyr

ekianjo · on March 24, 2022

> I recall when the pipe operator was first being proposed the argument for it was that it'd enable workflows that felt more like Excel.

I have no idea where you get that impression, most Excel power users I have met take a long time to understand how to use the pipe operator in R.

LosWochosWeek · on March 22, 2022

How do you feel about the pipe operator these days?

jghn · on March 22, 2022

I haven't used R enough in the last 10 years to have an R-specific opinion. And to be honest it was more an unlearned statement on my part as it was an "ew, Excel" response and not thinking about the underlying workflow.

In the intervening time I've become a large advocate for the pattern of chained operators. So I'd imagine I'd enjoy piping in R. And if that means I'm emulating a common Excel workflow, that's fine. I won't have the childish response of "ew, Excel" :)

hadley · on March 22, 2022

100% this :)

dm319 · on March 22, 2022

Ha ha, I love that this is your only comment here! Thanks for all your work on R.

I came here with sleeves rolled up to defend the language, but was pleasantly surprised to find it was already being done much better than I could have.

It's interesting to see how R elicits such a reaction to some programmers. I think it's frequently misunderstood, and R needs to be used in a particular way to allow it to fly.

When I've tried to recreate analyses in Python or Julia, they have nowhere near the fluency of R. It isn't possible to know this if you're messing around with if statements and other procedural methods of achieving things which are better suited to other languages, but rather when crunching data for analysis and graphically visualising the results.

I also understand that it's due to R's lisp-y-ness that allows us to have tidyverse in the first place.

Question for Hadley - there have been a couple of projects to fuse the speed of data.table and tidyverse. What do you think of this aim and are you tempted to change tidyverse to get to the speeds of data.table, or would that require too much of a fundamental change?

hadley · on March 22, 2022

Have you seen https://dtplyr.tidyverse.org? It gives you the syntax of dtplyr and (almost all of) the speed of data.table.

dm319 · on March 23, 2022

Oh, I feel a bit silly now! I'd seen a couple of attempts to combine the two, but didn't realise this one was official. It looks great. I have an analysis coming up that borders between dplyr and data.table in terms of size, so will check it out then!

laughy · on March 22, 2022

There is also the tidytable package. But dtplyr works really well. Have used it in a couple of shiny apps that wrangle some heavy input files.

_solr · on March 22, 2022

Before downvoting this short one liner make sure you to check who wrote it!

ekianjo · on March 22, 2022

Thanks for all your work on the tidyverse!

msaharia · on March 22, 2022

Not a lot of people would realize that Hadley is a minor celebrity in many scientific fields due to Tidyverse! Thanks for all the work.

deehouie · on March 22, 2022

Bravo!

Gatsky · on March 22, 2022

Yeah, and the growing user base, widening ecosystem, and continual stream of analysis packages being written only in R suggests that lots of others agree.

An important factor not often mentioned is that I think R really helps individual developers/very small teams to be productive.

Breza · on March 25, 2022

I feel the exact same way! I've used R for the past decade. Once you learn the philosophy behind it, it just works. Yesterday my boss asked me a question about a dataset and I wrote code to analyze it while talking through the problem in real time.

ourlordcaffeine · on March 22, 2022

The main issue I've had is speed. As soon as you have problems that can't be vectorized, models that take 30 hours to run in R take 30 minutes in python.

CornCobs · on March 22, 2022

In my limited experience, problems that cannot be vectorized really shouldn't be written in python either (assuming you mean python loops). But indeed the edge that Python has is the ease of use of drop-in solutions like Numba allowing you to continue to write in Python but not Python

yarky · on March 22, 2022

Mind giving an example ? The only time I faced this was due to an autoregressive model, which was super easy to delegate to c++.

I've been working with Python for the last year and appreciate how much it helps with general IT problems, but I would still stick to R for statistical/data analysis.

mellavora · on March 22, 2022

Example, please?

This seems highly unlikely, based on my 20+ years with R. Yes, using wrong data structures/algorithms can lead to slow code, but switching languages won't fix this.

rprof and microbenchmark are your friends if you really need to optimize your code.

and (as in python, and as several others have pointed out), if you have something especially challenging, write it in C/C++/fortran instead, and link it to R.

1980phipsi · on March 22, 2022

In both languages, you can write/use C extensions.

ekianjo · on March 22, 2022

you can insert C code very easily in R for when you need more speed.

SubiculumCode · on March 22, 2022

Thanks for expressing how I feel about R so succinctly.

em500 · on March 22, 2022

The common trope with R is that statisticians and love it and developers hate it.

The the main reason that statisticians love it is that the libraries useful to them are much better in R than elsewhere (though Python keeps encroaching in that turf, and "real developers" dislike Python a lot less than they do R).

The main reasons that developers hate it is that it is very unlike almost all other languages that they're used to. This is very valid since outside the narrow domain of statistics, there's probably nothing that R does better than other languages. So for a dev who occasionally dabbles with R by necessity, the otherness serves nothing but frustration.

Still, I wonder how much criticism there is against R as a programming language, that is not some variation on this works very differently from other languages. IMHO the sub-setting syntax, and countless x-apply variations are big warts. I'm not a big fan of Tidyverse, and even less of the schism between base- and Tidy-R. I read some seemingly fundamental criticism about R's deficient scoping rules, but I'm not nearly knowledgeable enough to judge their merits.

I guess it doesn't help that almost nobody learned R as their first computer (as opposed to statistics) language. Personally, I learned C, Matlab, Python/numpy, SQL, R in that order. R does seem to be quirkier than all the others, except maybe SQL. But I don't dislike working in R any more than working in any other language.

yarky · on March 22, 2022

> it doesn't help that almost nobody learned R as their first computer (as opposed to statistics) language.

Aside from two statisticians I had as professors, I am yet to meet someone with deep understanding of statistics who doesn't speak R as first language ...

I found it way easier to grasp the meaning of statistics by playing with R than by reading the maths.

jmt_ · on March 22, 2022

I used RStudio to work through problem sets, textbooks, and ideas constantly during my time as an applied math student. My concentration was stats, but still found RStudio invaluable for pretty much every math class I took. I say RStudio specifically because it offered the complete package for what I needed at the time. Built-in graph viewing, workspace management, etc. As another commenter said, pretty much the best graphing/scientific calculator I could ask for.

edgyquant · on March 23, 2022

You’re right, but I’d would think in recent years most stats people would have python as their first language. It probably comes preinstalled on their OS at this point (not sure if this applies to windows outside of WSL yet.)

Breza · on March 25, 2022

Python has a big advantage in deep learning but R still has an edge in classical statistics.

pantulis · on March 22, 2022

I had some Matlab experience about 3 decades ago. What's you take in Matlab vs R as programming languages?

em500 · on March 22, 2022

If signal processing and matrix algoritms are your thing, you should (and probably would) be using Matlab. Most statisticians don't really do much of that, what they mostly do is data management and try to mold their tables in some form accepted by an existing R package (or even Stata). As far as I remember, Matlab was pretty horrible for data munging, even worse than R for anything non-numerical. But my Matlab experience is also almost 2 decades old, so I don't know if it's any better today.

horsawlarway · on March 22, 2022

My take is that Matlab is better in basically every regard, with the exception that the same functionality will cost you a considerable amount of real world money.

Having used both in a professional setting - and coming in with a fair bit of programming experience - Matlab is generally a pleasure to use. It's different where it needs to be in order to treat matrices as first class citizens, but otherwise you can apply many of the same intuitions and paradigms that you would in any other language.

R on the other hand... R is a fucking disaster of inconsistency. I find myself incredibly frustrated attempting to do simple and sane things - things that I know are only a line or two in Matlab (or even python) and instead fighting with "which version of the 12 different slight variations of this operator are you attempting to use today!" hell scape.

My strong guess is that if you have no coding experience, and you learn R fairly thoroughly - it will feel very nice. My problem is that for anyone with actual coding experience, it's like being given a keyboard with a qwerty layout, but which is actually using dvorak. All your intuitions are pointlessly wrong - not because they are actually problematic, but because R has decided that the A key is really on the other side of the fucking keyboard.

edgyquant · on March 22, 2022

But isn’t the consensus that R is a better language? At least that is what the cool kids said when I went to college (I never used Matlab except maybe a handful of times.)

horsawlarway · on March 26, 2022

define "better"?

They each have a few strengths over the other, but generally speaking, I much prefer the language consistency of Matlab.

My general experience is - industrial shops will be using Matlab. Almost all of my Matlab work was aerospace related (think sigint/radar/signal processing/modeling).

R is more popular in education environments - but I strongly suspect that's just because it's free. Post-grads don't have much lab funding to work with at the best of times, and Matlab with an associated set of plugins/libraries specific for your task can easily run 30k a seat.

Personally - I find it pretty telling that most places with money choose Matlab. Doesn't inherently make it better, but it does mean Matlab is getting used in places where mistakes are expensive, and there's a focus and consistency to the tooling that I just think is desperately lacking in R.

yarky · on March 22, 2022

I've had to translate a lot of Matlab to R in college (physics and econometrics).

I rarely found an important difference between the languages besides having to transpose some matrices here and there.

tomrod · on March 22, 2022

Not the grandparent. Matlab is much simpler for matrices than R, approaching Python's numpy is ease of use (and very reminiscent of Fortran and Julia).

mellavora · on March 22, 2022

> "real developers" dislike Python a lot less than they do R

I thought that was real Scottmen.

Because real Scottsmen prefer:

table.loc[(table.column > 2) | (table.column2 < 3)].reset_index(drop=True)

to

table[column > 2 & column2 < 3, ]

and everyone knows this!

not to mention, if you aren't managing 100 virtual environments and 100 conda environments (with different syntax for requirements), you aren't a real scottsman!

maleldil · on March 22, 2022

You can do

    table.query('column > 2 or column2 < 3')

If you want. I'm not sure why you're dropping the index there.

em500 · on March 22, 2022

Yes, I definitely put that up as a real Scotsman. The "real (Java) engineers" at my company scoff at the loosey-gooey attempts of the Python engineers trying to productionalize the numpy mess produced by our ML-engineers. I mean, how can you "productionalize" anything without Builders and Factories?

dash2 · on March 22, 2022

I think this is really interesting. The author certainly isn't an expert, for example `result[which(result < 0.5)] <- 0` is a mistake for `result[result < 0.5] <- 0`.

But that's just why it's useful - R is great when you are an expert, but becoming an expert takes years. The perspective of new users is really important. (I've been using R almost 20 years, have written several packages, and still feel like an amateur. Indeed, I'd never heard of `**` as an alias for `^` until today; nor `sequence`, which apparently has always been in base; and I still can't remember what `sweep` does.)

I thought some of these arguments were better than others. True that base R regex is confusing and messy (and that stringi/stringr are improvements). False that allowing string concatenation with `+` would be a good idea. That's just a footgun waiting to go off, given that R also is weakly typed. Expecting `nchar(1000)` to magically work seems naïve. `<<-` (roughly, global assignment) is an ugly necessity and a code smell, not a cool language feature.

An awful lot of these problems are fixed, or try to be fixed, in the tidyverse. Not using tidyverse is a bit unusual because most beginners nowadays, I think, start with the tidyverse more than with base R.

For me the worst part of R is simply it fails silently. This is really deadly, especially when you are producing scientific results. There are so many places where R will plug gamely on after you have done something deeply inappropriate. Given how badly scientists code, one has to worry.

I don't agree that "R won’t change" is the base problem. It's not so simple. R is used for science. I like very much that my code from 2008 will probably still work if someone wants to replicate my results. I appreciate the R-core team's work in making this true. There are genuine trade-offs here.

If you want emotional relief, it's worth following https://twitter.com/whydoesR.

Maybe Julia is the way forward? Or is R "worse is better"?

bluenose69 · on March 22, 2022

> Maybe Julia is the way forward?

Julia is well worth learning, if you do computationally-expensive work. It is kind of a pain to use interactively, though. I use both R and Julia in my research. Think of Julia as the new Fortran, though, not the new R.

edgyquant · on March 22, 2022

Isn’t Julia a compiled language “pretending” to be interpreted? I thought it was interpreted or JITd up until it’s last release or so when they mentioned it has to actually compile. It’s fast, sure, but is it faster than other compiled languages?

bluenose69 · on March 23, 2022

I do not have deep experience with this, but the word on the street is that Julia can be faster than other compiled languages. I think that's partly because it can find a good algorithm, based on your data structure; think of loop unrolling, etc.

You can inspect the assembly code for anything you're working on, and that can be quite helpful at times; see e.g. https://youtu.be/wU6c8CDRXJE?t=3887.

I think the reason why quite a few high-performance people (I mean in the science community -- I don't know much about other communities) are excited about Julia is simply that well-respected experts are also excited. An example is the Julia implementation of the MIT GCM (general circulation model) for the ocean; for similar projects, see https://github.com/CliMA.

Programmer effort is also a factor in scientific computation. If the system can do some of your work for you, so much the better; see e.g. https://www.youtube.com/watch?v=rZS2LGiurKY for a lecture that touches upon how the framework of Julia eases the burden of machine-learning tasks.

As I say, though, I do not have deep experience with Julia. I've rewritten one of my numerical models in Julia and the speed is about the same as before, but my code is much shorter and easier to understand. I would not burn up 6 months translating a complex code, but nor would I start a 6-month coding project in Fortran anymore.

adgjlsfhk1 · on March 23, 2022

In general, Julia has similar performance characteristics to other fast compiled languages (Fortran/C++). There are some performance differences do to different semantics (eg bounds checks by default), but Julia is good about giving you the ability to opt out of these easily. There are also definitely places where Julia makes it a lot easier to get better code by making it a lot easier to use better algorithms, so in general, simple Julia code tends to be similar/better performance than Fortran/C++, and optimized Julia code tends to be about the same speed as Fortran/C++, but with 10x less code.

adgjlsfhk1 · on March 23, 2022

I agree that Julia isn't yet an R replacement, but I think that in addition to being a new Fortran, it also does well as a new Matlab/numpy.

dm319 · on March 25, 2022

I think a lot of the problem with comparing R to other languages is that a lot people don't get the problem space that R is working in. Science deals a lot with categorical variables, missing data and high dimensional data, and the 'table' or 'dataframe' is adept at storing and working with this information. Under the hood it's just a load of optimised fortran code working matrices, but the code clearly shows what kinds of data manipulations and transformation you are doing to eek the right information and visualisations out from the dataset.

I see problems when people take an imperative approach to solving numerical problems, and something like Python is better suited to that. Also, R isn't really set up to work with matrices like Matlab/Julia are.

_solr · on March 22, 2022

The points the author mentions are fair but something feels amiss. I have used R heavily and still use it from time to time and I never use most of the functions mentioned in the post. For instance I have never used switch().

R is for data manipulation. 90% of what I do in R is manipulate dataframes or matrices and then run machinelearningmodel(mydataframe) or ggplot(mydataframe). And for this it is incredibly efficient. You can rightly argue that some elements of the language are quirky but that's missing the point.

> Asked over 100 Stack Overflow R questions.

As a tangent I find a hundred questions asked the first year for a very mature language is a lot.

dm319 · on March 25, 2022

Yes, I think if you are using switch() for an analysis in R, you're either using the wrong language, or you're doing R wrong.

uniqueuid · on March 22, 2022

What always surprises me is how many people make beautiful, lovingly-crafted band-aids to the language's warts and problems. Not just code + packages but social band-aids too.

In a way, you could argue that the entire tidyverse is a huge effort of a band-aid.

So for all the irritating design choices and idiosyncracies, R is still a network of islands that work incredibly well for people, as long as they don't ever go to sea.

CornCobs · on March 22, 2022

I've written an interpreter for R (a subset; it was for school and I left out some features like S4 and the condition system) so I have done some pretty deep dive into the language reference and GNU R source.

I agree with the author's sentiment - I love a lot of what R has, but there is a lot of small madnesses.

There are so many unique PL ideas in R (may not actually be unique but certainly unique among common languages today)

- first class environments - named, default parameters and even the ... parameter which encourages the pattern of hierarchical library functions - there's one large customizable main workhorse function, and many wrapper functions that specify some defaults or add some behavior, but all the underlying customizations are exposed through ... - copy on write as a default - ability to choose evaluation strategy

But I also wonder how many of these cool ideas would actually work well in a saner language

ggrothendieck · on March 22, 2022

Is the subset R interpreter you wrote available on the net?

CornCobs · on March 23, 2022

Sorry for getting back to you so late; I'd rather not link my github here as I'd rather remain as anonymous as possible on public forums

tharne · on March 22, 2022

R is a truly terrible language with a handful of bright spots, such as it's visualization libraries.

The boost you get from the slightly better expressiveness of R over something like Julia or Python is not worth the headaches you'll run into down the road in trying to maintain whatever you wrote 6 months later, or God forbid, trying to integrate your code into someone else's work.

R was my first language and in hindsight that was a HUGE mistake. So much of the R code out there is horribly written, and even when it isn't you still have to deal with all of the issues the author here points out. If you pick up R as your first language, you will end up picking up all sorts of bad habits;

R is fine if you're working solo and you don't plan on maintaining or reusing or reusing your code. For everything else, R is garbage. It took me a year or more to undo all of the bad habits I picked up learning R.

I don't agree with the "worse is better" comparison in the comments here. "Worse is Better" was meant to refer to the idea of "Don't make the perfect the enemy of the good", among other things. It was not meant to be used as a justification for poor design. If anything, python for data analysis fits the "worse is better" philosophy much better than R. It's not as well optimized for data work compared to R, but it's much simpler, more consistent, less error prone, and it plays well with others.

bscphil · on March 22, 2022

Counterpoint: a lot of work in R is not "development" in the ordinary sense. The outcome is not a piece of maintainable code that needs to be built on later or be generally useful in any way other than copying an occasional snippet.

In some research fields (e.g. scientific fields that use R) the ground rules are that the code needs to be understandable and it needs to be clear that the libraries involved were used correctly. That's basically it. Even hardcoded directories are common. Good development practices are not widely understood to be important and in general many people are just starting to get the hang of version control and might not use it at all.

If R enables you to solve a statistical problem you have right now and it does this in a way that is better or more comprehensible for the people who use it, that means it has a niche. As someone with software development experience in a bunch of other languages, I agree with you that R is full of weird warts, but let's not forget that there are areas where its value is still obvious.

Citation: my partner works in a scientific field where R is predominant.

tharne · on March 22, 2022

> I agree with you that R is full of weird warts, but let's not forget that there are areas where its value is still obvious.

For sure. As much frustration as I had with R, at the time it was an enormous improvement over the stuff that came before it. And its emergence and success led to other languages improving their data and analytics capabilities.

stewbrew · on March 22, 2022

> R has two types of empty string: character(0) and "".

I understand it's frustrating trying to use a language you don't understand. And instead of reading the language manual you go on rambling.

"" is an empty string (almost) as you know it from other languages.

character(0) is an empty vector of type character (i.e. a vector with no elements). This vector doesn't even contain an empty string.

R is a vectorized language. You almost always deal with vectors. "" actually is a character(1), a character vector of length 1. Once you understand this, there is a chance for you to enjoy R.

probably_wrong · on March 22, 2022

I'm going to side with the author here: if he read "Advanced R", "R for data Science", "The R Inferno", "Rtips. Revival 2014!", the official "An Introduction to R", "R Language Definition", and "R FAQ", and yet he still has problems with the language, then maybe the language is to blame.

And even if the author is the problem, I wouldn't accuse them of not reading enough.

em500 · on March 22, 2022

Ok, but if someone claims to have read all the Python manuals and wrote something like

> Python has two types of empty string, array('u',) and ""

you'd probably conclude that hasn't really understood what he read.

kafkaIncarnate · on March 22, 2022

For Python (at least Python3) a better example might have been b"" and "". They are not equal, and they are empty. You have to decode or encode one, for instance, and different functions return different things. Then different functions might return False, None, (), {}, etc.

This OP complaint seems like weird nitpicking about R. Many languages have different empty/null-types for different variable-types. Also, don't get me started on "nulls" in C, C-strings, C++ strings, or memory allocation.

All languages are complicated.

mellavora · on March 22, 2022

does f"" also count as an empty string? Because oddly enough, I don't see anything in that syntax which suggests that it is really a function which returns a formatted version of whatever is in the "".

stewbrew · on March 22, 2022

Or maybe the language just isn't for every one and for every usecase. I would be hesitant to write something customer-facing in R. But it's great for doing statistics. The main problem with R is that people underestimate how different it is and thus don't care to learn practices for writig robust R code.

pmyteh · on March 22, 2022

The issue isn't so much that character(0) is a zero-length vector and "" is a length 1 vector containing the empty string, it's that you can't necessarily rely on other people's code returning one or the other: things that 'nearly' always return a one element vector (which may contain the empty string for 'nothing') can vary unexpectedly if it fails on an edge case. And, unless you catch it correctly, this can cause downstream failures with little in the way of warning (because a zero-length vector is obviously a 'sensible' thing to return from a function in the usual case).

In that sense, it's similar to the problem many languages have with NULL, but on steroids: you can have NULL, NA, character(0) (or anythingelse(0)), or '' as your null result and each of them are tested for in different ways.

Obviously this won't be a problem for the various battle-tested standard libraries, but a lot of my work in R at least is assembling somewhat-novel analysis pipelines based on quite new statistics code.

stewbrew · on March 23, 2022

To some extent, this is rather a general problem with quality of code in dynamically typed languages. Public functions should return predictable results. This is a matter of testing. In my experience, packages on CRAN are well tested.

With respect to dealing with return values, you can circumvent some pain points by using identical(), isTRUE(), isFALSE() in if conditions instead of, e.g., `==` which many people use because this is what they know from other languages. The assertive package is also nice.

em500 · on March 22, 2022

That could have been expressed more diplomatically, but I think you're right. IMHO, what people should try to understand about new languages first most thoroughly, are its native data types. This is more fundamental than the syntactical constructs.

R's data types are one of its most alien part, and that's why I think if you're coming from another language, chapter 20 of Hadley Wickham's book[1] is the most important one.

[1] https://r4ds.had.co.nz/vectors.html

usgroup · on March 22, 2022

> I understand it's frustrating trying to use a language you don't understand. And instead of reading the language manual you go on rambling.

I'd agree with this assessment. If you start doing R and it feels weird to you then -- in my opinion -- you're probably in the wrong place. Meanwhile, for the cognoscenti -- the researcher, the statistician -- R behaves just as you'd expect. That is the draw -- a language developed around statistics.

R is not a great computing environment for computer science. E.g. writing iterative algorithms. Almost everything worth a damn in R is written in C++ and then FFId in. Those who do not want to use C++ can write their algorithms in Python or Julia -- and they often do. Arguably the defacto for computing oriented machine learning is Python, not R.

chubot · on March 22, 2022

The popularity of the Tidyverse is a major blow to your motivation to learn R. Why would anyone want to learn a language that is treated as secondary to some packages? Worse still, if that turns out to be the best way to use R, then you’re forced to admit that R is a polished turd with a fragmented community.

As others have mentioned, just use tidyverse. I picked it up 4 years ago, and last week I went back to the code I wrote then.

I was productive in minutes. I could read the code, modify it, and easily test it in the REPL. The docs for dplyr are good.

ggplot2 is still awesome and the docs are good there too. ggplot2 is the fastest way to figure out what you want and make a pretty plot.

(However one thing that still annoys me is that R moves faster than Debian. So it's possible to do install.packages() in R, and it will break telling you your Debian R interpreter is too old. There is no easy solution for this, just a bunch of workarounds)

-----

OK, sure you can call it a polished turd, and to some degree that's true. But a polished turd is better than just using ... a turd!

The error messages in R are not quite as good as Python, but I wouldn't call it a problem. I'm able to localize the source of an error, even when using tidyverse.

My article comparing tidyverse to some other solutions:

What Is a Data Frame? (In Python, R, and SQL) http://www.oilshell.org/blog/2018/11/30.html

----

But would I recommend learning it to anyone else? Absolutely not. We can do so much better.

I would recommend with the caveat that it's one of the hardest languages I've had to learn. However that is partly because it changes how you think. But if you have a certain type of problem then you have to change how you think, or you'll never get it done. Data analysis is surprisingly laborious even for people who have say written compilers and such.

dxbydt · on March 22, 2022

Lot of useful insights in the comments here. I wanted to address one specific comment -

>can't remember the last time I saw a project someone did in R get very much traction anywhere...the only time people talk about R on the internet is to discuss the language itself which is definitely frustrating

There is a lot of R deployed in industry, even in silicon valley, but you have to be in-the-know. R gets plenty of use in statarb & model checking in finance - speaking from personal experience at GS & BofA/ML. My one non-trivial project at Twitter involved working with this team building a model & I remarked - hey this can be done rather easily if you use this library in R - and the teamlead says, yeah that's how we're doing it! But I thought we are a Scala shop, I said. So he says, yeah but imagine building that entire library in Scala from scratch, it'll take forever! So I enquired how he gets it done - you basically spin up a socket server & the jvm sends R commands plus data as payload over the socket, the server runs R and returns the result of the model back as a string, boom done! I said it was kinda janky & he says - I won't tell if you don't ! So that's R for you - it gets the job done & its fast & somewhat messy, but it is used everywhere, yet people won't openly admit to it because its a 30 year old language & we all want to be using the latest & greatest tool.

I now work at a news startup with a few million users, & all of the news personalization is done in R. So when these millions of viewers watch TV, the piece of code that decides which news clip should be shown ahead of which other news clip & which clip comes after - all of that is decided by a block of R code that I wrote. ~ 300 lines of R, uses quanteda, tidytext & parallel under the hood. Pretty much everything I do involves mcmapply, which parallelizes your compute & uses as many cores as you specify. But that's sort of the thing with R - you have to know which functions/libs to use & which ones to avoid. Just switching from tm to quanteda got us a 200% bump in perf. Switching sapply's to mcmapply was another winner. These things aren't documented cleanly - you have to keep up with cran, experiment & see what works best for you.

CapmCrackaWaka · on March 22, 2022

I would say about 90% of the posts / articles / comments I see on the internet which discuss R are usually of the "meta" format. They talk about R's strengths or weaknesses, about the difference between R and Python, about how much they love or hate R, or any other high level subject.

I can't remember the last time I saw a project someone did in R, or a tutorial on how to do something in R, get very much traction anywhere. It seems the only time people talk about R on the internet is to discuss the language itself (which is definitely frustrating), and it's getting old. Even this awesome, comprehensive document, which I would usually be foaming at the mouth to read, has me going "meh". I'm tired of the subject.

mellavora · on March 22, 2022

> I can't remember the last time I saw a project someone did in R, or a tutorial on how to do something in R, get very much traction anywhere.

well, you know, I'm not very active in C++ any more, and I haven't seen an article in over a decade on C++ which received any traction at all. So I guess C++ isn't getting any traction any more either.

hadley · on March 22, 2022

They don't often come with code, but one of my recent sources of R programming joy is the folks posting their generative art to twitter: https://twitter.com/search?q=%23rstats%20%23generativeart&sr...

disgruntledphd2 · on March 22, 2022

This is definitely true on HN, at least. I think the vast majority of R-users are just plugging away on their domain specific problems daily, and tend not to participate in these conversations.

Dark-matter statisticians, I guess?

ggrothendieck · on March 22, 2022

What needs to be added is that before R the reproducibility problem in science was compounded by the fact that analyses were done with proprietary software limiting communication and replication of those analyses. This was and continues to be a major problem, particular in some fields, but at least now there is a common widely used language that can be used to overcome this. I wouldn't focus on idiosyncrasies but rather on the major problem it addresses. Any large system will grow over time and have some inconsistencies but after a while you learn the workarounds so they are less important than the big picture.

tgb · on March 22, 2022

On the contrary, R packaging system is too broken for R to be reliably reproducible. No one specifies package versions or R versions. Base R has no way to install a specific version of a package. There’s a package that lets you do that, but well, you might need a specific version of it. Particularly if you need to run an old version of R for reproducing an old script it may be impossible to use any standard tool to install the correct packages thanks to this problem - the version of devtools that install.packages gets won’t be compatible with your old R but you need that package to request another version. Instead everyone just ignores it and hopes package versions don’t matter.

_Wintermute · on March 22, 2022

I don't see how R specifically addresses the reproducibility problem, It's been around for almost 30 years and before its recent rise in popularity, lots of science was done in C, perl, fortran etc. Not to mention that actual dependency versioning is pretty poor. I struggle to run other people's R code after about 6 months (especially if they used the tidyverse as it pulls in hundreds of unstable dependencies) and nobody records what package versions are used and functions are seemingly deprecated every week.

ggrothendieck · on March 22, 2022

1. Before R commercial statistical packages were mainly used. You can, in principle, just use assembler too and develop everything yourself but it isn't practical. Regarding C/C++ and Fortran, many R packages are, in fact, wrappers around code in those or other languages making it easier to access them. From that point of view R can be regarded as a glue language. 2. Regarding keeping versions straight, all past versions of packages in the CRAN repository are kept on CRAN. Microsoft MRAN repository also maintains histories of packages that can be accessed via the checkpoint package which will install packages as they existed on a given date. Furthermore, install_version in the remotes and devtools packages can install specific versions. 3. Regarding tidyverse dependencies you can reduce the number of packages you load by not using library(tidyverse) and instead load the specific packages you need. This will result in fewer packages being loaded.

_Wintermute · on March 22, 2022

> Before R commercial statistical packages were mainly used.

Maybe in your field, I work in bioinformatics - before R, perl was widely used as a high-level language.

> Regarding keeping versions straight, all past versions of packages in the CRAN repository are kept on CRAN...

This is woefully inadequate if you need to replicate somebody else's environment. Nobody should think manually guessing and then typing in each package version and hoping they're compatible is a viable option. Not to mention even if you specify an older version of a package it doesn't pull in compatible dependencies, it just pulls in the latest version. There's renv but it's not reached widespread use.

> Regarding tidyverse dependencies you can reduce the number of packages you load by not using library(tidyverse) and instead load the specific packages you need. This will result in fewer packages being loaded

We're talking about replicating other people's work. We don't have any control over their code, and R users are largely ignorant of best-software practices.

scientism · on March 22, 2022

Totally agree. I find it frustrating trying to reproduce other people's work in R. How has this situation has been allowed to continue for so long? It's unacceptable, especially when used for science. It's impossible to replicate anything unless you are lucky enough you manage to find which package version introduces breaking changes and even then this is something you have to do repeatedly for every code break. Even with _renv_ it's a library you have to install within your R environment which is pointless. Where is a dependency solver like conda for R? - Not that it's perfect, but I've been happy with its drop-in replacement - mamba recently.

ggrothendieck · on March 22, 2022

The packages that were used in statistics were SAS, SPSS and Stata. perl is not a statistical package and has nowhere near the depth of statistical capabilities of R.

Don't forget that I also mentioned the checkpoint package in my post. You only need to know the date for that, not the version of each of the packages.

In your last paragraph I think you are referring more to software development practices than what is available through R. Simply using R or any language doesn't guarantee this.

scientism · on March 22, 2022

That's a very roundabout way to solve an actual problem. In many cases you don't pin your package version to _latest_ (whatever that date is) and you need a more fine-grained solution to keeping package versions. I don't think that solves this and I don't know if you can do it with checkpoint.

ggrothendieck · on March 22, 2022

Of course it is possible to screw up but if you don't update your packages and record the date that does not seem to be R's fault.

popcube · on March 22, 2022

um... this is about statistics, before R people should finish analysis in MATLAB

AuthorizedCust · on March 22, 2022

I don’t like this. Much of this is:

1. pointing out that, like every other language, base R has idiosyncrasies

2. how use of R is more complex when you’re largely ignorant of the tidyverse, which is crucial for the vast majority of tissue today’s use of R

3. frustration because you’re using a language/ecosystem, that’s targeted for a few specific uses, as a general purpose programming language

fn-mote · on March 22, 2022

> how use of R is more complex when you’re largely ignorant of the tidyverse

This.

I'm interested in non-flamewar non-religious reasons that the tidyverse is bad. He does give some. I think his complaints about inconsistency and a moving target have some validity. However, the price of not using tidyverse is (roughly) paid in the rest of the article. I would definitely not use R without it.

Read his Section 5 on the tidyverse... and see how absolutely minimal his complaints in that section are. E.g. to "purrr" his objections are "largely philosophical"... but he's complaining in the previous section about the annoyance of writing lambdas (which purrr makes even easier).

Yes, R has a big community and there's a lot of quirks in individual packages, especially less-used ones. Yes, there are packages presenting unified interfaces to other quirky outputs (e.g., broom). The necessity of this is not good. The existence of it is good.

HN readers - do you have an "up and coming" language that you think has better structured the fundamentals from R, that you hope will someday have enough capabilities you can use it instead of R? I've tried Julia, which is beautiful but the startup/compilation times were difficult to get over. Is it reasonable to hope Julia will be good for interactive usage someday? Is it already? Are there other candidates in this area?

disgruntledphd2 · on March 22, 2022

> Yes, R has a big community and there's a lot of quirks in individual packages, especially less-used ones

Most of his examples of WTF's are from base-R. And he's definitely not wrong, as many of these have bitten me a bunch over the years.

> I'm interested in non-flamewar non-religious reasons that the tidyverse is bad.

For the very reason that it's great to use, it's a nightmare to develop with. NSE is super handy as a user, but it's an absolute nightmare to build new functions on top of (dplyr specifically). Like, I now know 2-3 different ways in which quoting/substituting etc can be done for the tidyverse, and I've had to maintain code using them a bunch of times.

It's incredibly annoying, and every time I do it I need to look up Hadley's new approach to NSE (don't get me wrong, I adore using the tidyverse, but I absolutely despise programming with it).

cmontella · on March 22, 2022

> HN readers - do you have an "up and coming" language that you think has better structured the fundamentals from R, that you hope will someday have enough capabilities you can use it instead of R?

Hope is the operative word here!

I'm writing a language to compete in this area. It's called Mech and I'll be releasing the first beta in October. You can think of it like Matlab + Excel. It's very fast, has default-parallel semantics for operators and functions like Matlab, reactive dataflow like Excel, and supports full interactive coding with no startup/compilation latency issues. It's meant for robots, but I've also designed it to be a better Matlab, and I think it should take on R handily. Fair warning, it's public alpha now so error messages are sparse and the happy path is narrow.

https://github.com/mech-lang/mech

mellavora · on March 22, 2022

> I'm interested in non-flamewar non-religious reasons that the tidyverse is bad.

Going to answer with a question: Why is tidyverse == R considered true?

I use ggplot frequently, but for data manipulation data.table is orders of magnitude more powerful. And more stable.

kyllo · on March 22, 2022

data.table also uses OpenMP to parallelize operations, so it tends to be much faster

VeninVidiaVicii · on March 22, 2022

Tidyverse is so much more verbose than data.table, it’s painful. I don’t see the draw to it, to be honest.

epistasis · on March 22, 2022

Worst part of the tidyverse is learning it, and then looking up how to use specific functions. The bad documentation is mostly in the ggplot lib though.

It's a pleasure to use, though!

AuthorizedCust · on March 22, 2022

The “bad” ggplot documentation is mostly a function of one’s own understanding of the grammar of graphics. That’s what the “gg” in ggplot stands for.

If you don’t understand the GG, then ggplot will seem opaque, and no goodness of documentation will suffice.

I don’t mean to blame the user. Perhaps the ggplot documentation could improve by reinforcing the need to understand that or referencing it more frequently?

kgwgk · on March 22, 2022

It's a good reading but some of the complaints are hard to understand.

For example, in 4.5.1:

  Selecting and deleting at the same time doesn’t work either. For example, data[c(-1, 5)] is an error.

What would it mean for that to work? He seems to acknowledge that "selecting and deleting at the same time" doesn't make sense in 4.11.1

  Can you guess what data[-1:5] returns? I can’t either, so don’t ever try it. If you must know, it’s actually an error.

Also in 4.11.1:

  The : operator is absolutely lovely… until it screws you. The solution is to prefer the seq() functions to using : [....] As I’ve said, seq() and its related functions usually fix this issue.

Maybe the "related functions" fix some issues but seq(a,b) is not different from a:b

In 4.11:

  Now what do you think names(foo) <- names(bar) does? Seriously, can you guess? I can think of roughly four realistic guesses. Is it even valid syntax?

How is that surprising? Can the author also think of four realistic guesses about the effect of A[1,2] <- B[3,4] for example?

In 4.13:

  The index in a for loop uses the same environment as its caller, so loops like for(i in 1:10) will overwrite any variable called i in the parent environment and set it to 10 when the loop finishes. [...] This sounds awful, but I’ve never encountered it in practice.

Is it awful? The same happens in other languages like Python or C if I'm not mistaken.

  The plot() function has some strange defaults. For example, you need to have a plot before you can plot points [...]

I have no idea what that means. You can plot points using plot() without having a plot beforehand.

Edited to add: In 4.5.3:

  The $ operator is another case of R quietly changing your data structures.

Is it unexpected that when we extract an element from a data structure we get a different kind of data structure? Is A[1,1] another example of silently changing one data structure (matrix) to another (number)?

mst · on March 22, 2022

> Is it awful? The same happens in other languages like Python or C if I'm not mistaken.

It annoys the shit out of me in python. I much prefer perl's

    for my $x (@array) { ... }

or ES6's

    for (let x of array) { ... }

Note that given python and ruby only do function level scoping rather than block level, I can -understand- why they work the way they do even if it annoys me. R already has the necessary granular scoping to do the (IMNSHO) sensible thing so it seems like a pointless wart.

My -guess- would be that if it was intentional, it came about in R because for loops are rare enough that you want to know the last index more often than you don't because if you don't care about the index presumably you'd've written something else.

(also you could argue the ES6 version would be better written using 'const', but I've lisped sufficiently my fingers invariably generate 'let' when left to their own devices - caveat emptor)

kgwgk · on March 22, 2022

> Here’s a challenge: Find the function that checks if "es" is in "test". You’ll be on for a while.

grepl(“es”, “test”)

merlincorey · on March 22, 2022

I haven't used R at all in years and only used it a couple times in passing many years ago to try it out.

I searched "R string functions", saw "grep" and wondered if there was something I was missing in the author's challenge.

Is it because it's using regular expressions they don't consider it the correct answer or is it because they aren't as familiar with regular expressions as some other people are, I wonder?

saeranv · on March 22, 2022

Personally, I feel like the biggest problems with Python for math is the lack of a native vector datatype and our subsequent reliance on Numpy which really disrupts the elegance/terseness of working with vectors/matrices.

First, there's the constant inelegance/clutter/inefficiency of having to cast into and out of arrays and lists, even when doing basic list comprehension. R, Julia, and Matlab are all vector-based languages (I think), so you avoid having this casting as much.

Secondly, having a native vector type means you don't have to worry about the performance penalty of operating directly on arrays if an existing prebuilt method exists. Since the efficiency of Numpy comes from calling it's underlying C library, you're forced to memorize and use prebuilt Numpy functions rather then just use the more obvious and elegant array manipulations. For example rather than calculating the the cumulative sum like this:

  cumsum = reduce(lambda a, b: a+b, [1,2,3,4]) 

 We have do this: 

  cumsum = np.cumsum([1,2,3,4])

(There are better examples, but this is all I can think of right now).

And once you add something like Pytorch tensors on top of this, we now have an additionally layer of casting/redunduncy/memorization of prebuilt functions!

VariableStar · on March 22, 2022

Excellent read. I agree with a lot (after only a cursory read). One thing the author seemingly forgives R by not mentioning is how harsh and discouraging of beginners the community was at some early stage. That was my experience around 2002-2006.

jboynyc · on March 22, 2022

I'm not sure that's true anymore. The one thing that I find interesting about the R community (being at home in Pythonland myself) is the availability of great teaching and learning resources. At least that's the case in the social sciences, where R has pretty strong adoption.

s_Hogg · on March 22, 2022

Yeah I tried to publish a package back in like 2015 and was dealt with very harshly, got banned for a week from submitting for asking questions about the process after the first attempt didn't work. It really turned me off R and frankly I haven't looked back.

bluenose69 · on March 22, 2022

R is designed for data analysis, not for general computing. Its syntax differs from that of other systems. Python's syntax also differs from other systems. Same for Matlab. And so on.

Non-uniformity imposes a burden that will be too much to bear, unless the system offers particular advantages. The fact that several systems co-exist is proof that the advantage-burden balance is favourable in each case.

There is no need to converge on a single tool. Carpenters need both saws and hammers.

In practical applications, language syntax is just part of the story. One must also consider the issue of available libraries. One thing that really stands out with R is its immense collection of well-vetted and well-documented packages. Python and Matlab -- the two main alternatives in my discipline -- fall far behind R in this respect. If there's a journal article on a new statistical technique, then there's a pretty good chance of a package written by same author. And, if that package is on CRAN (the repository for such things) then it has undergone quite rigorous testing on several types of computer, with several versions of R.

mellavora · on March 22, 2022

AND! packages don't update every 3 weeks breaking things!

My diety! someone was complaining about inconsistent syntax but doesn't recognize inconsistent dependencies?

j_m_b · on March 22, 2022

My favorite operator is the pipe operator. When I first found out you could do a simple `ls | more` to read long outputs, it was an eye opening experience. In Clojure, we have the threading macros, `->` and `->>` that do a very similar thing. In R, we have `%>%` and now the native `|>`. Whenever a language has this operator and it is widely used, I know I am going to love it.

substation13 · on March 22, 2022

I credit F# with much of the popularity of the forward pipe operator. Unlike Haskell etc. which emphasize function binding (>>), idiomatic F# has pipes all over.

    [1..10]
    |> Seq.filter (fun x -> x % 2 = 0)
    |> Seq.map (fun x -> x * x * x)

j_m_b · on March 22, 2022

Looks like I need to check out F#! Thanks!

prionassembly · on March 22, 2022

I only briefly knew R in grad school (circa 2007) but I lived inside Stata for about 7 years, and yeah, while specific frustrations vary, the general tenor...

Then -- I once made a meme to that Oliver Stone Vietnam movie that said "This is my copy of Stata. Without me it is useless. Without it, I am useless. I must cherish it as I cherish my life..." (In the original said of a rifle.) I was good with Stata, fast and precise and never "ugh, okay, let's open a quickie notebook... there goes my morning..."

clusterhacks · on March 22, 2022

I support bioinformatics researchers and my R problem isn't the language itself but the increasing fragile tower of packages that users cobble together.

At this point, I see R users (typically PhD students and post-docs) doing "science" in R by playing with parameters to functions in poorly-understood packages and publishing papers on which parameters are "best" for data generated from some specialty source.

A very common situation for me is to be pulled in only after a package has been created with some vague hope of fixing performance problems (which R, Rcpp, and RcppParallel make fun to do for me, but I have some C++ background for scientific computing, ymmv). It is extremely common to find that these packages contain fundamental logic errors that probably should invalidate the (already published) results but never got caught because the code ran without actually failing. I guess I'm complaining that people are using buggy packages to write more buggy packages and it just bothers me.

Library-driven development is just how the world works these days. And it should! But I'm not confident that the R bioinformatics world has the kind of guardrails I would prefer to see. I mean, I am reasonably confident tensorflow is functionally correct. Any R package that pulls in too many other R packages to begin with is probably not.

As for the language itself - I guess it is ok. I have some lisp in my background and a fair amount of love for non-traditional array languages. But I don't see much R code that seems to stick to the R "standard library" rather than pulling in a million packages to do anything . . .

jghn · on March 22, 2022

> I see R users (typically PhD students and post-docs) doing "science" in R by playing with parameters to functions in poorly-understood packages and publishing papers on which parameters are "best" for data generated from some specialty source.

That's a lot of bioinformatics, and not specific to R. It is a huge issue with anything vaguely pushbutton in the bioinformatics domain.

newbamboo · on March 22, 2022

“I don't see much R code that seems to stick to the R "standard library" rather than pulling in a million packages to do anything.”

People teach the tidyverse to new r users. It makes them think that it’s standard practice to pull in lots of unnecessary but possibly convenient packages. Simple string manipulation should not require an extra package like stringr, but for many users it does. Often, they were taught this way.

deck16er · on March 22, 2022

I am not developer by profession but have been programming since early 90s starting with basic, fortran, C++ and Matlab. Learnt JavaScript, Python, Lua as well along the way for various reasons.

Found R when I was looking for something free alternative to Matlab and chanced upon R in 2010/11.

Now a days R is my goto scripting language anytime when I just want to get to the results and don't care about reproducibility.

I also use Shiny as alternative to multiuser scenarios involving spreadsheets since I when work in a financial firm where excel and VBAs still dominate most of the front office functions.

Sure Python could be good tool but once you become fluent with R ecosystem, moving to python just feels too much that can be done with few lines of R code.

For me the conciseness of data.table and ability to cook up shiny web apps with very few lines of code is biggest pull.