HN is predisposed to hate R because everyone here is coming from a "real" programming context. Their concerns are generally valid, but they should keep in mind a lot of people using do not have a software development background and do not care that the language is not elegantly designed: they just want to get analytical work done. In that respect, R is far, far superior to Python. Even something as simple as installing a library is a conceptual leap for these people (why wouldn't the software just come with everything needed to work?). Have you ever tried explaining the various python package and environment management options to someone with a background in Excel/SQL? Just getting a basic environment set up can be days of frustrating effort (though Anaconda is getting better with this). Compared to R, where you install RStudio and are off to the races, with a helpful package installation GUI. Another great example: in R, data types are pretty fungible, everything is a vector, coercing things generally "just works". In pandas, it can be very confusing that you need to explicitly turn a 1x1 dataframe into a scalar value. Same thing with Python vs R datetimes.
I understand some of this stuff is actually seen as a positive for Python in some contexts (production usage) and I agree. Just pointing out the woke take is the languages are both good, but good at different things. If I need to run a quick analysis on a dataset, I'm grabbing R 9/10 times. If I'm building a production pipeline, I'm using Python 9/10 times. This is perfectly fine.
It's also worth noting that R becomes much more pleasurable with the Tidyverse libraries. The pipe alone makes everything more readable.
I'm also coming from more of an office setting where everything is in Excel. I've used R to reorganize and tidy up Excel files a lot. Ggplot2 (part of the Tidyverse) is also fantastic for plotting, the grammar of graphics makes it really easy to make nice and slightly complex graphs. Compared to my Matplotlib experiences, it's night and day. Though I'd expect my experience with programming to be quite different from others' though, mainly because any code I write is basically an intermediary step before the output goes back in Excel.
That said, if anyone's interested in learning R from a beginner's level, I can recommend the book R for Data Science. It's available freely at http://r4ds.had.co.nz/ and the author also wrote ggplot2, RStudio, and several of the other Tidyverse libraries.
EDIT: I'm also currently writing my master's thesis in RMarkdown with the Thesisdown package. It's wonderful, it allows for using Latex without really knowing Latex which is great for us in business school.
Tidy features (like pipes) are detrimental to performance. The best things R has going for it are data.table, ggplot, stringr, RMarkdown, RStudio, and the massive, unmatched breadth and depth of special-purpose statistics libraries. Combined, this is a formidable and highly performant toolset for data analytics workflows, and I can say with some certainty that even though “base Python” might look prettier than “base R,” the combination of Python and NumPy is not necessarily more powerful or even a more elegant syntax. The data.table syntax is quite convenient and powerful, even if it does not produce the same “warm fuzzy” feeling that pipes might. NumPy syntax is just as clunky as anything in R, if not worse, largely because NumPy was not part of the base Python design (as opposed to languages like R and MATLAB that were designed for data frames and matrices).
What is probably not a good idea (which the article unfortunately does) is to introduce people to R by talking about data.frame without mentioning data.table. Just as an example, the article mentions read.table, which is a very old R function which will be very slow on large files. The right answer is to use fread and data.table, and if you are new to R then get the hangs of these early on so that you don’t waste a lot of time using older, essentially obsolete parts of the language.
> Tidy features (like pipes) are detrimental to performance.
Detrimental to the runtime performance; if you happen to be reading and processing tabular data from a csv (which is all I've ever used R for, I must admit), then you get real performance gains as a programmer. For one thing, it allows a functional style where it is much harder to introduce bugs. If someone is trying to write performant code they should be using a language with actual data structures (and maybe one that is a bit easier to parallelism than R). The vast bulk of the work done in R is not going to be time sensitive but is going to be very vulnerable to small bugs corrupting data values.
Tidyverse, and really anything that Hadley Wickham is involved in, should be the starting point for everyone who learns R in 2018.
> languages like R and MATLAB that were designed for data frames and matrices
Personal bugbear; the vast majority of data I've used in R has been 2-dimensional, often read directly out of a relational database. It makes a lot of sense why the data structures are as they are (language designed a long time ago in a RAM-lite environment), but it is just so unpleasant to work with them. R would be vastly improved by /single/ standard "2d data" class with some specific methods for "all the data is numeric so you can matrix multiply" and "attach metadata to a 2d structure".
There are 3 different data structures used in practice amongst the R libraries (matrix, list-of-lists, data.frame). Figuring out what a given function returns and how to access element [i,j] is just an exercise in frustration. I'm not saying a programmer can't do what I want, but I am saying that R promotes a complicated hop-step-jump approach to working with 2d data that isn't helpful to anyone - especially non-computer engineers.
I think what you're saying is mostly on point. I wanted to share a couple possible balms for your bugbears.
For attach metadata to an anything, why not use attributes()/attr() or the tidy equivs? Isn't that what it is for?
It might not make you feel much better, but data.frame is just a special list, c.f. is.list(data.frame()). So, if you don't want to use the connivence layers for data.frame you can just pretend it is a list and reduce the ways of accessing data structures by one.
You can paper over the distinction between data.frames and matrices if it comes up for you often enough. E.g.
`%matrix_mult%` <- function(x,y) {
if("data.frame" %in% class(x)) {
x <- as.matrix(x)
stopifnot(all(is.numeric(x)))
}
if("data.frame" %in% class(y)) {
y <- as.matrix(y)
stopifnot(all(is.numeric(y)))
}
stopifnot(dim(x)[2] == dim(y)[1])
x %*% y
}
d1 %matrix_mult% d2
... but I'll grant that isn't the language default.
I wrote a function once for a friend that modified the enclosing environment, and changed + so that sometimes it added two numbers together and sometimes it added to numbers and an extra 1 just to be helpful. I can sort myself out, but thanks for the thoughts.
The issue is that I learn these things /after/ R does something absolutely off the wall with its type system. And a lot of my exposure comes from using other people's libraries.
For my own work I just use tidyverse for everything. It solves all my complaints, mainly by replacing apply() with mutate(), data.frame with tibble and getting access to the relational join commands from dplyr. I'll cool with the fact my complaints are ultimately petty.
> For attach metadata to an anything, why not use attributes()/attr() or the tidy equivs? Isn't that what it is for?
I've never met attr before, and so am unaware of any library that uses attr to expose data to me. The usual standard as far as I can tell is to return a list.
> It might not make you feel much better, but data.frame is just a special list, c.f. is.list(data.frame()). So, if you don't want to use the convenience layers for data.frame you can just pretend it is a list and reduce the ways of accessing data structures by one.
Well, I could. But data frames have the relational model embedded into them, so all the libraries that deal with relational data use data frames or some derivative. I need that model too, most of my data is relational.
The issue is that sometimes base R decides that since the data might not be relational any more it needs to change the data structure. Famously happens in apply() returning a pure list, or dat[x, y] sometimes being a data frame or sometimes a vector depending on the value of y. It has been a while since I've run in to any of this, because as mentioned most of it was fixed up in the Tidyverse verbs and tibble (with things like its list-column thing).
> `%matrix_mult%` <- function(x,y) { if("data.frame" %in% class(x)) { x <- as.matrix(x) stopifnot(all(is.numeric(x))) } if("data.frame" %in% class(y)) { y <- as.matrix(y) stopifnot(all(is.numeric(y))) } stopifnot(dim(x)[2] == dim(y)[1]) x %*% y }
I have got absolutely no idea what that does in all possible edge cases, and to be honest if the problem that is solving isn't actually one I confront often enough to look in to it.
It just bugs me that I have to use as.matrix() to tell R that my 2d data is all made up of integers, when it already knows it is 2d data (because it is a data frame) and that it is made up of integers (because data frame is a list of vectors, which can be checked to be integer vectors). I don't instinctively see why it can't be something handled in the background of the data.frame code, which already has a concept of row and column number. Having a purpose-built data type only makes sense to me in the context that at one point they used it to gain memory efficiencies.
I mean, on the surface
data %>% select(-date) %>% foreign_function()
and
data %>% select(-date) %>% as.matrix %>% foreign_function()
look really similar, but changing data types half way through is actually adding a lot of cognitive load to that one-liner, because now I have to start thinking about converting data structures in the middle of what was previously high-level data manipulation. And you get situations that really are just weird and frustrating to work through, eg, [1].
scale() for example uses attributes to hold on to the parameters used for scaling. Most packages that use attributes provide accessor functions so that the useR doesn't need to concern themselves with how the metadata are stored. I'll grant that people do tend to use lists because the access semantics are easier.
If you're in a situation where 80% of the time is spent in 20% of the code, you only have to use less expressive features in those hot-spots; you don't have to give up your pipes or whatever in places that don't contribute much to the run-time.
> Tidy features (like pipes) are detrimental to performance.
But they are some absolutely amazing features to use. After helping my wife learn R, and learning about all the dypler features, going back to other languages sucked. C#'s LINQ is about as close as I can get to dypler like features in a main stream language.
Of course R's data tables and data frames are what enable dypler to do its magic, but wow what magic it is.
I also base this on my own experience. I typically work with 2-3 million row datasets. I found that doing certain data operations was quite slow in plyr but a lot faster in data.table. It’s possible that if I had spent time reordering my plyr pipelines and filtering out unneeded columns or rows, then it would have worked better. However, data.table doesn’t require such planning ahead and thinking about what columns/rows you need to send to the next operation in a pipeline, because multiple operations can be executed from a single data.table call, and the underlying C library is able to make optimized decisions (like dropping columns not requested in the query), similar to an in-memory SQL database. So between dealing with slow code while doing interactive analysis, and/or having to spend time hand-optimizing dplyr pipelines, I found data.table to be a significant improvement in productivity (other than the one-time effort of having to rewrite a few internal packages/scripts to use data.table instead of dplyr)
Thanks for the reference. Why don't you keep your data in a DB? I load almost anything that isn't a small atomic data frame into a RDBMS.
BTW one thing that always made me avoid DT (I even preferred sqldf before dplyr was created) was its IMHO weird syntax. I always found the syntax of (d)plyr much more convenient. ATM it seems to me that dplyr has won the contest of alternative data management libraries. I cannot remember when I last read a blog post, article, or book that preferred DT over dplyr. I'm old enough to have learned that wrt libraries, it's wise to follow the crowd.
About that article: I assume that DT uses an index for that column while dplyr does a full search. If that's really the case the result wouldn't be that much a surprise.
Answering questions in a rapid, interactive way (, while using C to be efficient enough that one can run it on millions of rows):
# Given a dataset that looks like this…
> head(dt, 3)
mpg cyl disp hp drat wt qsec vs am gear carb name
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710
# What's the mean hp and wt by number of carburettors?
> dt[, list(mean(hp), mean(wt)), by=carb]
carb V1 V2
1: 4 187.0 3.8974
2: 1 86.0 2.4900
3: 2 117.2 2.8628
4: 3 180.0 3.8600
5: 6 175.0 2.7700
6: 8 335.0 3.5700
# How many Mercs are there and what's their median hp?
> dt[grepl('Merc', name), list(.N, median(hp))]
N V2
1: 7 123
# Non-Mercs?
> dt[!grepl('Merc', name), list(.N, median(hp))]
N V2
1: 25 113
# N observations and avg hp and wt per {num. cylinders and num. carburettors}
> dcast(dt, cyl + carb ~ ., value.var=c("hp", "wt"), fun.aggregate=list(mean, length))
cyl carb hp_mean wt_mean hp_length wt_length
1: 4 1 77.4 2.151000 5 5
2: 4 2 87.0 2.398000 6 6
3: 6 1 107.5 3.337500 2 2
4: 6 4 116.5 3.093750 4 4
5: 6 6 175.0 2.770000 1 1
6: 8 2 162.5 3.560000 4 4
7: 8 3 180.0 3.860000 3 3
8: 8 4 234.0 4.433167 6 6
9: 8 8 335.0 3.570000 1 1
I used slightly verbose syntax so that it is (hopefully) clear even to non-R users.
You can see that the interactivity is great at helping you compose answers step-by-step, molding the data as you go, especially when you combine with tools like plot.ly to also visualize results.
What a lot of people don't get is that this kind of code is what R is optimized for, not general purpose programming (even though it can totally do it). While I don't use R myself, I did work on R tooling, and saw plenty of real world scripts - and most of them looked like what you posted, just with a lot more lines, and (if you're lucky) comments - but very little structure.
I still think R has an atrocious design as a programming language (although it also has its beautiful side - like when you discover that literally everything in the language is a function call, even all the control structures and function definitions!). It can be optimized for this sort of thing, while still having a more regular syntax and fewer gotchas. The problem is that in its niche, it's already "good enough", and it is entrenched through libraries and existing code - so any contender can't just be better, it has to be much better.
Completely agree. dplyr is nice enough but the verbose style gets old fast when you're trying to use it in an interactive fashion. imo data.table is the fastest way to explore data across any language, period.
I strongly agree, having worked quite a bit in several languages including Python/NumPy/Pandas, MATLAB, C, C++, C#, even Perl ... I am not sure about Julia, but last time I looked at it, the language designers seemed to be coming from a MATLAB type domain (number crunching) as opposed to an R type domain (data crunching), and so Julia seemed to have a solid matrix/vector type system and syntax, but was missing a data.table style type system / syntax.
Julia v0.7-alpha dropped and it has a new system for missing data handling. JuliaDB and DataFrames are two tabular data stores (the first of which is parallel and allows out-of-core for big data). This has changed pretty dramatically over the last year.
Plus I don't have to remember a lot of function names and what order to input vars to the functions. Just have a remember the data.table index syntax and I can do a lot of stuff. I'm sure I can do dplyr once I learn the functions but the data.table syntax seems very simple and elegant to me.
No, you are wrong. R is terrible, and especially so for non-professional programmers, and it is an absolute disaster for the applications where it routinely gets used, namely statistics for scientific applications. The reason is its strong tendency to fail silently (and, with RStudio, to frequently keep going even when it does fail.) As a result, people get garbage results without realizing, and if they're unlucky, these results are similar enough to real results that they get put somewhere important. Source: I'm a CS grad working with biologists; I've corrected errors in the R code of PhD'd statisticians, in "serious" contexts.
Scientific applications require things to fail hard and often, to aggressively fail whenever anything is potentially behaving incorrectly. R does the exact opposite of that in several different, pernicious ways. IMHO, Python is more dangerous than a scientific computing language should be, but at least it will stop when it hits an error. R has undoubtedly cost humanity millions of dollars in wasted research costs and caused untold confusion, from otherwise perfectly-performed studies reporting corrupted statistical results. The world would be a noticeably better place without it.
I simply cannot articulate my opinion about R without sounding grossly hyperbolic. I'm sad that HN, a place which is typically enlightened in the ways of the programming arts, is so confused what this article is on about. If we tolerate such blatantly hostile design in something as important as the language of scientific statistics, where do we expect to get?
Have you ever worked with other major statistical packages? Have you ever caught people doing data munging in Excel? R fails far less silently than the credible alternatives. Source: I've been around the academic block and seen many types of horrors.
It's unfortunate that you've gotten to a _terrible_ feeling about R without realizing that many of the 'silent' failures are easily configured away (some examples, https://github.com/hadley/strict). That R isn't noisy about things that CS majors might think it should be by default is, BTW, entirely appropriate. Many of what one might call 'silent' failure modes in R are for the express purposes of making exploratory data analysis easier... and that was one of the original purposes for R.
That's too bad, I wish R could be better and easier for these people, but I don't think it warrants your hyperbole. I can point to many of my own anecdotes where R has saved millions of dollars by empowering analysts to conduct data exploration and modeling that would have been vastly more complex undertakings using any other tool. They seem to handle the silent failures just fine (usually by double checking their results before presenting them). Poor rigor and coding practices in academia are practically a meme at this point. You really want to lay all of that at the feet of R? Suggest a code review step for their publishing process or use a different tool. R is certainly not perfect but the idea that "The world would be a noticeably better place without it" is silly.
R is fundamentally flawed. It tries to merge two highly conflicting goals: a productive analytics environment and a programming language.
To do the first really well means automating away many of the issues that would crop up in the second allowing R to 'just work'. Because of that nothing beats R for getting to an answer as fast as possible (not even Python) at the cost of making it more difficult to productionise a solution in pure R.
Given its huge popularity and free-nature the benefits clearly outweigh the costs by a large factor.
Not sure I buy this. R is a language with parser and interpreter. Parser spits out an AST and interpreter evaluates nodes according to rules. This is the same in every other sane language. AST is pretty much the same in every language. There is no reason R’s parser and language can’t be replaced with something sane.
I agree that R shouldn't be used in production, but R is great for prototyping different analytical models before porting them over to Python or another language.
Same here and I think that’s exactly how its meant to be used.
Even so, if you want to use R as the production system, you shouldn’t implement the jumbled spaghetti code an iterative analysis involves just for your own sanity's sake. A rewrite is always required at which point hello Python
While I kind of want to agree with you, I just don't see a better alternative. Do you really want biochemists to have to deal with the horrors of C compilation? In production code I'm very glad my makefile tells clang to fail on absolutely anything, but is that the best we can do? Other commenters have pointed out ways to avoid dangerous things like integer division, but if you think R is hostile then please offer a tenable alternative. The only ones I can think of are python and Matlab, and both are even worse for the intended use.
Yes, R is not my preferred language for anything heavy-duty, but I would guess ~95% of R usage is on datasets small enough to open in excel, and that is where the language truly shines (aside from being fairly friendly to non-programmers).
So yes, there are some problems with R, but what are your proposed improvements? Because if I have to analyze a .csv quickly, I'm going for R most of the time.
I have very quick flows for data processing: load data, make long form, add meta data as categories, plot many things with seaborn one-liners. I use Jupyter lab and treat it Like a full lab notebook, including headers, introduction, conclusion, discussion. Works very wel for me.
Julia is by far my favorite language; I used it all throughout grad school for my research. The problem is that it doesn't have enough of a network effect in industry. I've begrudgingly switched to Python for my day-to-day work.
They're in the run-up to their first stable release at the moment (I think they're aiming for August, but could be wrong about that). I can't speak for popularity, but development is certainly going strong.
That's ironic because I find Julia's documentation to be the second most clear documentation I've seen (after elixir). Notation wise, Julia is the most comfortably close to mathematics (APL is closer, but it's a write only language). I'm not a working mathematician, though i did graduate with a rather theory based math degree.
The documentation is fine but IMO written more for developers. We do need more mathematical-based introductions which introduce the right packages for working mathematicians. I am a working mathematician myself and find Julia to be the perfect language for it because its abstraction is on actions instead of on data representations which fits things like functional analysis extremely well.
Things that are OO based like C++ and Python are pretty bad at representing math because they put forward an idea of the actual representation (the object) as what matters, instead of the actions it performs (the function overloads). This may be good for some disciplines, but in a mathematical algorithm I really don't care what kind of matrix you gave me for `A`, I just want you to do the efficient `Ax=b` solve and have the action of the solver choose the appropriate method to abstract away the data. In Python you'd have to tell it to use the SciPy banded matrix solver, in Julia your generic ODE solver will automatically use the banded matrix solver when it's a banded matrix. This then allows for a composibility where the user overloads the primitive operations on their type, and your generic algorithm works on any data representation. This matches the workflow of math where an algorithm is proven on L2 functions, not on functions represented with column-wise indexing and ...
I've written a custom GF256 data type and used Julia's builtin matrix solves (note: required monkey patching in Julia <~ 0.6 because there were one and zero literals in the builtin solver) to do Reed Solomon erasure coding... It's glorious.
You're right and wrong: R is a disaster when you want to write programs as you would in a real programming language. R is an excellent choice for what it is used most of the time by these people whose education/training isn't related to programming: interactive analysis of data and (maybe) writing prototypes.
My brief encounter with R led me to the exact same conclusion - that the R culture does not value correctness. That is not a characteristic I value in a development culture.
Case in point, the bug I raised about TZ handling (which is also an example of silent failure):
Silent failure and continuing to run on errors are common in interpreted languages. SAS has similar issues, most RDBMSs will continue to process queries after failures. It’s something you need to explicitly guard against.
Are you sure about "most RDBMSs"? With the exception of SQLite and older versions of MySQL, all the databases that I've used are strict and fail the query immediately on error, will generally prevent silently dropping or truncating data, etc.
I'm one of the original authors of Presto, a distributed SQL engine for analytics on big data. From the beginning, we've been careful to follow the SQL standard and do everything possible to either return the correct answer or fail the query. For example, an addition or sum aggregation on an integer will fail on overflow rather than silently wrapping.
Returning an incorrect answer or silently corrupting your data is the worst thing a database can do.
What I mean is that if you run in batch mode they’ll fail a query and happily run the next. Generally, depending on the client, you need to handle begin/commit/rollback blocks yourself. This is pretty common in scripting languages. Unlike, for example Java, where an unhandled error will terminate the process.
You haven't been working with scientists very long, have you? I'm guessing you're also only a very recent CS grad. You're criticizing a language based on the behavior of certain people who use the language, rather than criticizing the language itself.
For many years before you mounted your high horse, scientists were writing equally shitty code in Perl. When they've moved on from R, they'll write shitty code in some other language.
The thing to keep in mind is that, from the point of view of someone who works with data, R isn't a programming language. It's a statistical software package that has a programming language. Its competitors are things like Minitab, SPSS, Stata, and JMP, all of which used to be entirely menu-driven. R was a genuine innovation when it was first introduced.
Now it's certainly showing its age and the limits of its design, but it's still best in class for a certain kind of user. We could do better for software development, but it's not clear that doing so would actually make data analysis easier.
> Their concerns are generally valid, but they should keep in mind a lot of people using do not have a software development background and do not care that the language is not elegantly designed
To me, the opposite is true. People with no CS background would benefit the most from a simple design.
> in R, data types are pretty fungible, everything is a vector, coercing things generally "just works".
Things just work until they don't, and then you need to understand all the weirdness of R.
I don't know what's the typical experience of a non-programmer with R, but as a programmer, I had some headache trying to understand R semantics (apparently I'm not the only one [1]).
I have a lot of experience teaching non-programmers R. Most of them come from an Excel/SQL background. I have found the amount of weirdness that presents a real problem is very low. And when it does get weird, I usually advise we just brush it under the rug and use a different method to accomplish the same thing. This probably sounds horrifying, but it's really not. Most of the code people write in R is not like other programming languages, it's rarely bound for anything other than the end user's laptop, if its even saved in the first place.
I use R at least a couple times a week. It gets the job done and I will be forever grateful for the tidyverse.
That said, R can be goddamn frustrating at times because of the way the documentation is written. It would be nice to simply be able to query about a function and get a cogent help file that explains THE BASICS of how to use the function for the most common use-case(s). Instead, the help files try to be "canonical" and front-load a bunch useless technical detail-- like that something is an "S3" object. Still haven't figured out what that really means and, I expect, that knowing something is "S3" will NEVER help me out when I am in a jam and need a little help to do something simple because I forgot some data manipulation detail.
Instead, I end up googling all the time connecting the dots all over the internet to get very simple stuffs done. At least now we have stackoverflow which, as vicious as it is, seems like Mister Rogers' Neighborhood compared to the old R mailing list.
Yes a couple of years ago I did this project where I needed to get a ton of analysis done, typically with plots and tables as output. Did this in python and it was this huge mess with pandas and matplotlib. Re-did it in R with data.table and ggplot2 and it was just ridiculously easier, and I could expand upon the code much more easily, plus the output was much prettier.
You might be right. Whereas programmers tend to regard languages as something to be understood, package managers are something they just use ("if it works, it works"). Typical R users (i.e. data analysts) regard languages a bit like programmers regard package managers.
> Even something as simple as installing a library is a conceptual leap for these people (why wouldn't the software just come with everything needed to work?).
> Have you ever tried explaining the various python package and environment management options to someone with a background in Excel/SQL?
I don't understand the difficulty I've often seen voiced against this. Why would a newbie or someone who just wants to get analytical work done need anything beyond installing Python and doing `pip install library`? It's certainly orders of magnitude easier and faster than, say, using a C library. The only trouble I can see a newbie running into is if they want to install a library which doesn't have precompiled wheels and they need some dependencies to build it, but that's rarely an issue for popular packages.
Pip install needs root on my ubuntu install, my lab's and university's old redhat servers and my windows for linux install. I've had to install anaconda python to get any real work done on all three systems. Anaconda works fine for me but I've not even had to think about anything to install packages in R.
Ubuntu doesn't ship with pip or virtualenv. In fact it ships with a version of Python where the built-in equivalent to virtualenv, pyvenv, is explicitly disabled.
So you have to install extra Python packages, as root. You have to have that Python experience that guides you to install as few of them as you can, just enough so you can get started with a virtualenv, so you don't end up relying on your system Python environment.
And this is really hard to explain to people who aren't deeply familiar with Python. "Never use sudo to install Python packages! Oh, you got errors. We obviously meant use sudo for two particular packages and never again after that."
In the terrible case where you don't have root, you have to ignore Ubuntu's version of Python and compile it yourself from scratch. Hope the right development libraries are installed!
Maybe I'm wrong and there's a method I've overlooked. If there is: please show me how to install a Python package on a fresh installation of Ubuntu 16.04, without ever using sudo, and I will happily spread the good news.
That sounds like a major problem with Ubuntu, rather than with Python or pip.
On Windows, meanwhile, the standard Python installer gets all this set up properly in like three clicks. Better yet, because it installs per-user by default, "pip install" just works. And if you still choose to install it globally, it will fail, but it will tell you exactly what you need to do to make it work:
Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: ...
Consider using the `--user` option or check the permissions.
One can't help but wonder how we ended up in a situation where the most popular Linux distro somehow does Python worse than Windows.
Don't despair, in the Anaconda installed with visual studio (now a default) you can't update or install packages without being admin! And if you install Anaconda again it merges the start menu entries and you can't tell which is which...
Eh, that has always been the case for windows vs linux, that you don't have to compile anything yourself because there is always an installer that will deploy precompiled binaries for whatever you want to install (except for when there isn't, because nobody has compiled it for windows, at which point you're in deeper shit) (or except when something installs itself but doesn't update your envars, so you have to do it yourself, which kind of defeats the purpose of the whole "installer" thing).
Iiish. For small projects or when you want to get development versions etc that are not in a distro's repos it's pretty common to have to do a make-configure.
Then again, with Python in particular, I have often had errors either with pip-install, or after "successful" installation, for various reasons.
In this case, we were talking about Python itself. I don't see any particular reason why most people should need to build it themselves, whether on Windows or on Linux. Packages are another matter, but here the issue is the way Python itself is packaged on Ubuntu.
Not on a personal computer, no, but the vast majority of managed systems won't let you install anything outside of your home directory. Of course you could install using `pip install --user` but you will inevitably run into problems when something you install locally needs an updated version of something installed on the system.
Makes it fun when running on a VM in the cloud which only has a root user. Docker becomes almost essential to preventing errant Python scripts fudging up the system.
While you're right that it's bad advice, it also highlights the problem with pip that these less experienced people have. The ideal way to deal with Python packages is virtualenvs, but setting up a virtualenv, and then activating it every time you want to use it (or setting up tools to do it for you) is an incredibly huge headache for less experienced people to deal with. R doesn't require that whatsoever.
Neither language requires an isolated dev environment, but it can help with avoiding headaches. As python has things like virtualenv and buildout, fortunately R has 'packrat' available, which provides a similar isolated/reproducible dev environment solution.
You can certainly update multiple packages at once using pip. Just use a requirements.txt file, which you should be doing anyway if you're using multiple packages (or just want to be able to reproduce your environment).
>> Why would a newbie or someone who just wants to get analytical work done need anything beyond installing Python and doing `pip install library`? It's certainly orders of magnitude easier and faster than, say, using a C library.
Except when it isn't. For instance, because some wheel fails to build because you're lacking the VC++ redistributable (or it's not where pip thinks it should be):
C:\Users\YeGoblynQueenne\Documents\Python> pip install -U spacy
Collecting spacy
Downloading spacy-1.2.0.tar.gz (2.5MB)
100% |################################| 2.5MB 316kB/s
Collecting numpy>=1.7 (from spacy)
Downloading numpy-1.11.2-cp27-none-win_amd64.whl (7.4MB)
100% |################################| 7.4MB 143kB/s
Collecting murmurhash<0.27,>=0.26 (from spacy)
Downloading murmurhash-0.26.4-cp27-none-win_amd64.whl
Collecting cymem<1.32,>=1.30 (from spacy)
Downloading cymem-1.31.2-cp27-none-win_amd64.whl
Collecting preshed<0.47.0,>=0.46.0 (from spacy)
Downloading preshed-0.46.4-cp27-none-win_amd64.whl (55kB)
100% |################################| 61kB 777kB/s
Collecting thinc<5.1.0,>=5.0.0 (from spacy)
Downloading thinc-5.0.8-cp27-none-win_amd64.whl (361kB)
100% |################################| 368kB 747kB/s
Collecting plac (from spacy)
Downloading plac-0.9.6-py2.py3-none-any.whl
Requirement already up-to-date: six in c:\program files\anaconda2\lib\site-packages (from spacy)
Requirement already up-to-date: cloudpickle in c:\program files\anaconda2\lib\site-packages (from spacy)
Collecting pathlib (from spacy)
Downloading pathlib-1.0.1.tar.gz (49kB)
100% |################################| 51kB 800kB/s
Collecting sputnik<0.10.0,>=0.9.2 (from spacy)
Downloading sputnik-0.9.3-py2.py3-none-any.whl
Collecting ujson>=1.35 (from spacy)
Downloading ujson-1.35.tar.gz (192kB)
100% |################################| 194kB 639kB/s
Collecting semver (from sputnik<0.10.0,>=0.9.2->spacy)
Downloading semver-2.7.2.tar.gz
Building wheels for collected packages: spacy, pathlib, ujson, semver
Running setup.py bdist_wheel for spacy ... error
Complete output from command "c:\program files\anaconda2\python.exe" -u -c "import setuptools, tokenize;__file__='c:\\users\\yegobl~1\\appdata\\local\\temp\\pip-build-7o0roa\\spacy\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'ex
ec'))" bdist_wheel -d c:\users\yegobl~1\appdata\local\temp\tmpypkonqpip-wheel- --python-tag cp27:
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-2.7
creating build\lib.win-amd64-2.7\spacy
copying spacy\about.py -> build\lib.win-amd64-2.7\spacy
[217 lines truncated for brevity]
copying spacy\tests\sun.tokens -> build\lib.win-amd64-2.7\spacy\tests
running build_ext
building 'spacy.parts_of_speech' extension
error: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat). Get it from http://aka.ms/vcpython27
----------------------------------------
Failed building wheel for spacy
Running setup.py clean for spacy
Running setup.py bdist_wheel for pathlib ... done
Stored in directory: C:\Users\YeGoblynQueenne\AppData\Local\pip\Cache\wheels\2a\23\a5\d8803db5d631e9f391fe6defe982a238bf5483062eeb34e841
Running setup.py bdist_wheel for ujson ... error
Complete output from command "c:\program files\anaconda2\python.exe" -u -c "import setuptools, tokenize;__file__='c:\\users\\yegobl~1\\appdata\\local\\temp\\pip-build-7o0roa\\ujson\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'ex
ec'))" bdist_wheel -d c:\users\yegobl~1\appdata\local\temp\tmp8wtgikpip-wheel- --python-tag cp27:
running bdist_wheel
running build
running build_ext
building 'ujson' extension
error: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat). Get it from http://aka.ms/vcpython27
----------------------------------------
Failed building wheel for ujson
Running setup.py clean for ujson
Running setup.py bdist_wheel for semver ... done
Stored in directory: C:\Users\YeGoblynQueenne\AppData\Local\pip\Cache\wheels\d6\df\b6\0b318a7402342c6edca8a05ffbe8342fbe05e7d730a64db6e6
Successfully built pathlib semver
Failed to build spacy ujson
Installing collected packages: numpy, murmurhash, cymem, preshed, thinc, plac, pathlib, semver, sputnik, ujson, spacy
Found existing installation: numpy 1.11.0
Uninstalling numpy-1.11.0:
Successfully uninstalled numpy-1.11.0
Running setup.py install for ujson ... error
Complete output from command "c:\program files\anaconda2\python.exe" -u -c "import setuptools, tokenize;__file__='c:\\users\\yegobl~1\\appdata\\local\\temp\\pip-build-7o0roa\\ujson\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, '
exec'))" install --record c:\users\yegobl~1\appdata\local\temp\pip-ibtvwu-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_ext
building 'ujson' extension
error: Microsoft Visual C++ 9.0 is required (Unable to find vcvarsall.bat). Get it from http://aka.ms/vcpython27
----------------------------------------
Command ""c:\program files\anaconda2\python.exe" -u -c "import setuptools, tokenize;__file__='c:\\users\\yegobl~1\\appdata\\local\\temp\\pip-build-7o0roa\\ujson\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --recor
d c:\users\yegobl~1\appdata\local\temp\pip-ibtvwu-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\yegobl~1\appdata\local\temp\pip-build-7o0roa\ujson\
Now that's newbie scary.
Note that this is just one case where I was trying to install one particular package. I got a couple more examples like this in my installation diary, notably one when I tried to install matplotlib, this time on Windows Subystem for Linux, a.k.a. Ubuntu, and hit a conda bug that meant I had to use an older version of QT until upstream fixed it and other fun times like that.
> install RStudio and are off to the races, with a helpful package installation GUI.
Unless the package needs a native component like libcurl of particular version then it can turn into couple of hours of blindly trying everything you can think of.
> Another great example: in R, data types are pretty fungible, everything is a vector,
Unless it's a dataframe or factor or string or s3, s4 or s5 or a couple of other things.
And the documentation will tell you the reference paper that you can read and some completely impractical example.
> Their concerns are generally valid, but they should keep in mind a lot of people using do not have a software development background and do not care that the language is not elegantly designed: they just want to get analytical work done.
This implies we strive for good design in languages just because it appeases some ideal we have about how languages should be. But really we strive for good design in languages because it makes them more powerful, more expressive, easier to use, etc. Sure, maybe Python doesn't have all the right abstractions to be perfectly suited to statistical tasks, whereas R has more natural abstractions for that kind of stuff. But that doesn't mean that R doesn't also have many objectively bad design decisions even for statistical uses.
As a datapoint in agreement.
Some of the Biologists I work with love R. I had one tell me that its like how they think, and they weren't a python fan.
I think R-Studio (an R based IDE that turns it kinda into a more excel like experience) where you can inspect the data in memory (including matrix data) and graph making is where it really helps bring people into the R language. And with a set of instructions anyone can go load the analysis packages and do their data analysis.
Compare this to python, where they have to go the unix shell set up the environment, load the libraries. When they come back reset everything and get back to where they started.
Anaconda and Jupyter is a much more friendly environment than straight writing .py files, much like r studio. It lacks the integrated debugging features, though vscode does provide some rudimentary assistance. I'd say it's superior overall, especially in regard to getting help and documentation.
Another feature for this audience is the philosophy that functions shouldn't have side effects. You can still do (several types) of object oriented programming in R, but it does take away some of the ways in which non-programmers shoot themselves in the foot.
I've come to really like the way environments work in R, as well.
Meh, S3 is nice and lightweight for a very particular kind of analysis interoperability. S4 isn't super useful IMHO. RC is very well thought out, and I've heard good things about R6. It might not be sane language design, but it works well for designing very different kinds of analytical procedures.
Good point about the package management, but I disagree with your argument. Non-Computer Scientists seem to have a much easier time with Python than R, anecdotally.
I think the reason is that R is not just a badly designed language, but in particular its design is inconsistent. That’s as confusing to newcomers as it is to people who care about PL design.
I used R for almost a decade. Last year I switched to Python and Jupyter, never looked back. Can’t recommend the switch highly enough. R has great stats packages, but struggling with the language is just not worth it.
I came from a "real" programming background. This was pre-Tidyverse. Learning R thoroughly was the best thing I could have ever done for my proglang understanding as all the weird things R had meant that every time I learned something new I could say "Oh, it's like X in R".
HN is predisposed to hate R because everyone here is coming from a "real" programming context.
There is that. Matlab has the same problem.
One problem with "real programming languages" is that programmers who grew up with C don't see any need for built-in multidimensional arrays. This is one reason FORTRAN is still around, and why array work is straightforward in Matlab and R.
Anaconda is becoming to python what Chrome is to browsers, particularly as Jupyter matures. Drop it in, and a huge amount of what you want to do is ready to go. Sure, there's lots of libraries/extensions available, but most of the time you can do real work with a default userland, non-privileged install.
And there is litterally no equivalent to dplyr and ggplot2 in Python. Those alone can make a huge difference in how many lines you need to write to do something.
ggplot2 has plotnine (http://plotnine.readthedocs.io) which has a nearly identical API. I've found though it's not perfect, you can get closer to dplyr with JS-style method chaining on Pandas.
I actually totally agree with this. I learned how to program in R and found it to be quite wonderful to use as a noob. As you say, shit just works. If you think you should be able to do an operation, you typically can. To this day I still prefer cleaning and doing proof of concept analyses in R rather than Python. It's so much easier than having to fuck with pandas and numpy.
One common thing across most the "real" programming languages makes them unfit for data work: 0-based indexing.
It is just ridiculous to call the first row in a data set as 0th row, and the last row as (n-1)th row. It does not make any sense for data analytic work.
I'm not sure I understand this, and I'm genuinely interested in why it would be.
I find zero indexing logical: zero is the first natural number and is thus a fine candidate for being the first ordinal.
In my experience most mathematical series lose nothing in terms of elegance or readability by being indexed from zero instead of using more traditional indexing from one.
Generally I've found you carry around fewer n +/- 1 type expressions when you index from 1. Also, most applied math papers I've read index from 1 and that makes implementing them a lot easier.
Generally 1 is considered the first natural number, except bourbaki.
The reason for this is the set {1...n} has order n, but the set {0...n} has order n+1, so you get lots of off by n+1 inelegancies or errors when order is important. It's better to be explicit at the set level when you need an {0...n-1} set, because usually the order gets passed around to later expressions and not the set element, so there's less algebra.
Zero indexing is great when your index is an offset, as it is for true arrays.
You find zero indexing logical most likely because you learned programming on languages which are zero-based. But most of the rest of the population, including statisticians, for which R is the intended audience, likely start at one and aren't used to OB1 errors.
It seems needlessly confusing to me to refer to the first number in a series as the 0th number. 0-based indexing is only good for offset counting, which is very much based on having a mental model based on pointer arithmetic for a number sequence.
Yeah years start at 0. But that's because it measures the offset from the beginning of the calendar. You can similarly expand this to all distance based measurements. But that is completely different from counting, which shouldn't be conflated with distances. People always say the first of some sequence and only people who care about 0-based indexing tries to spread the 0th of some sequence meme.
My ideal calendar has 12 30 day months, days 0 to 29.
If it's 10th June and you have an appointment for 0 August, that's 50 days from now.
At the end of the year, a 5 or 6 day 'month' called Holiday. 30th December becomes Christmas Day (observed) and 0 January is still New Years. New Years Eve is either 5 or 6 Holiday.
I think this comment reflects the fact that a lot of pen and pencil linear algebra/stats/econometrics all uses one-based indexing.
There are a few times where I’ve had to formally write out a zero-based indexing scheme of a given expression because python indexing can seem so weird in this case. For example, “lag zero” just sounds like a funny way of talking about lag 1 (to me). Of course, if you’re predicting y_t+1 then it sort of makes sense that y_t[+0] would be paired with B_0.
Then there is the whole thing about how, say, range(5) will return 0,1,2,3,4 and NOT 5.
All of this makes sense once you use python for a while but if you spend most of your time writing with pencil then it will probably take some adjustment.
This is really stupid. Language decisions coming from a background native to them are always superior. This isn't a matter of "all opinions are equal". They never are. R is horrendous, plain and simple.
My company has tons of python code producing reports with reportlab, making UIs with PyQt5, as well as a multitude of small scripts to interact with MySQL.
We’ve been nothing but happy with Python in the years of using it.
Mainly because production pipelines care a great deal about performance (speed) and Python is generally considered to have worse performance in comparison to compiled languages (Java or C++.) Depending on your pipeline that may not be a big deal.
"I 've done it so it's not so bad" is not a very good argument. I've done a significant data science project in Prolog (with R for the plotting btw) but that doesn't mean Prolog is the go-to language for data science :)
As a long-time R user, I agree with all of these complaints. The language itself is ugly and actively tries to get in your way.
I'll add that concepts like data frames are not really intrinsic, and you get needless complexities like "length", "nrow", "dim", each of which does the wrong thing in 90% of the scenarios of interest. The confusion of lvalues is another strange quirk -- a <- 0; length(a) <- 20 is totally valid, and you get things like class(a) <- 'foo' being preferred over the equivalent a$class <- foo. It has all sorts of odd concepts between lists and data.frames -- the double-bracket syntax, etc. The object model is very confusing, though most people seem to have converged on the S3 system, which is the oldest one.
If you discipline yourself to learning "the good parts", especially by learning either data.tables or tidyverse or becoming a master of split/lapply/aggregate/ave, then it is very powerful. The modelling tools and plotting (both base graphics and ggplot2) are excellent.
I'd love to see a NeoR arise at some point that fixes the strange historical inconsistencies (like what happens when you refer to vec[0], as noted by the author) in non-backward compatible ways.
What is needlessly complex about length, nrow, and dim and why would they ever give you the wrong thing? length always gives the length of an object, i.e. the number of elements in it[1]. nrow will always give the number of rows, and dim will always give the dimensions e.g. for a data.frame with 3 rows and 4 columns dim(df) is (3, 4).
[1]That is, top-level elements. If a list L contains 3 vectors, each with 9 elements, length(L) is 3, not 9 or 9*3 = 27
Sometimes a loop is vastly more performant if you count the amount of time it takes to get the "idiomatic" way working and working in a way that allows easy troubleshooting.
This is a stupendous example of someone going overboard on their criticisms in order to grandstand.
R may not be the most "beautiful" language in a general perspective, but it certainly is more beautiful than Python when it comes to actual data analysis. There is nothing in R that is as ugly as even the best implemented pandas, numpy, and matplotlib code. All of the options in Python, which is generally pointed to as the "superior" language to R, feel tacked on and hackish.
The real story behind most of the complaints is that they come from software developers who only rarely need to do data analysis that would require R, and therefore use it infrequently and mistake their unfamiliarity with the language with the language being bad.
I also groaned at the part where the author struggled to google questions about R because of its "stupid name". I have literally never, ever had issues Googling anything about R they same way I haven't ever had issues finding answers to my questions about "Python". The author is grasping at straws, and this is a programming blog's equivalent of clickbait.
R is a poor name, whether you can google it or not. The name can get lost in the minefield of text on the internet. Just because you never had any issues with google R does not make it any better. I have had many issues googling R and it always makes me second guess if this thread is about R language at all. On SO, I have to check if R is tagged.
R is a terrible name and it is not up for a debate. Whenever you name a product, company or in this case a language as a letter "R", you're literally asking for trouble.
Just to be fair, C is also a horrible name. On the other end of the spectrum - Julia and Rust are excellent names for a programming language because they're unique in the context of programming.
>Just to be fair, C is also a horrible name. On the other end of the spectrum - Julia and Rust are excellent names for a programming language because they're unique in the context of programming.
Funny enough I most often get wrong results when googling something Rust related, because there's a town called Rust (Germany), so Google pushes the location based results up, and rust is also, well, oxidized metal, so sometimes I get DIY pages as a result.
I love pretty much every other design decision about Rust, but it's the one "hard" to Google language in my experience.
Funny story, I once worked extensively with early Julia in a scientific setting where the hr directors name was Julia Lang, and we joked that if someone checked packets they would think I was stalking her.
> Index vectors like a[1] … a[4]. All indexing in R is base-one. Note that no error is thrown if you try to access a[0]; it always returns an atomic vector of the same type but of length zero, written like numeric(0)
That's serious WTF right there.
In general a lot of the complaints revolve around the language making error handling unnecessarily difficult which is something that will drive me up the wall with a language. I'm a fairly defensive programmer and if your language is fighting me when I'm trying to do error checking I'm not going to be happy. I can kind of understand the thinking of "just write a perl script to verify/reformat your data before passing it to R", but that doesn't help me find my own errors.
It's not WTF at all once you actually understand why it works that way and the benefits it provides. Subsetting in R allows for any index number to be retrieved, and if there is no value, it returns an empty value (that's what "numeric(0)" is: an empty numeric value). It's the same if you tried to access a[0] or a[90000] (in an array that doesn't have 90000 elements). This makes it easier to select multiple elements at the same time or select a range of elements; if you try to access one that doesn't exist, it just returns an empty value rather than erroring.
You can also access values by negative indexes, which have another special purpose in R that make it easier to quickly manipulate/analyze the data in the array, rather than causing an error.
It's just like how SQL won't throw an error if you try to do a SELECT...WHERE ID=non_existant_value, instead it will just not return any rows (or return NULL, depending on the exact query). But you wouldn't call SQL bad, you would just acknowledge that it serves different purposes and acts differently than something like Java or C.
Your comment is a great example of what the parent commenter was talking about: experienced "programmers" tend to dislike R simply because they are unfamiliar with it, and it therefor does things that they do not expect. But again, that doesn't mean the language is bad, it just means the users should probably become more familiar with their tools rather than trying to use a wrench to bang on a nail.
It has a different behavior when you access past the end of the array. That makes error checking more difficult since you have to test for both error conditions. 1 based indexing is a mistake, but compounding that by changing the error conditions is the WTF.
Having 1 based indexing is a design flaw that R shares with SQL. SQL's error handling also leaves much to be desired.
One time I was struggling with some odd R behavior of the sort described by the author. I asked my local R expert. He told me how to fix my program, but I protested that none of it made any sense, even when explained. He didn't disagree, he just laughed and said "don't worry about it."
That works great for him, he can "not worry about it" and things work because he knows all the quirks.
If I just "don't worry about it" my programs don't work for mysterious reasons.
It seems likely that R could have been designed to have the same strengths without having so many weird and arbitrary quirks.
Many of the arbitrary quirks started out for the sake of backward compatibility with S. All that being said, yeah, you could probably design a 'modern R' without the weird and arbitrary quirks.
I think many of the gotchas and annoying parts of base R are solved by using tools from the tidyverse: http://github.com/tidyverse. For example, the pain of needing to specify `stringsAsFactors=FALSE` is solved in the tibble package by setting a sensible default.
At any rate, at least it's not Pandas and matplotlib...
> At any rate, at least it's not Pandas and matplotlib...
There's a certain yin and yang to the space, isn't there? You get to choose between a hacky language that has pretty good tooling built on top of it, or a pretty good language with hacky tooling built on top of it.
I think that Python is probably winning because being a decent language gives you a decent escape hatch, whereas no amount of great libraries can save you from having to go through the bizarro language.
That said, R may be bizarro, but at least, once you learn it, it's predictable. Whereas I'm not sure even Pandas really knows whether a given call to .loc will copy or refer to the original data.
Pandas has a lot of defensive copying. I love Pandas, but I think it demands a lot from the user. When I started using it I was new to traditional coding (i.e. knowing anything about datastructures), and have come from R. Over time as I've learned a real amount about legitimate data structures, I've become far far better with Pandas.
I can't stand the non-standard evaluation of the tidyverse. It works great for writing one-off scripts, but as soon as you start trying to put it into functions or your own package it's just not worth the pain of quosures and the tidyeval nonsense that changes every 6 months.
I used to feel similarly, but I think it's much more stable than it was even a year ago; `!!`, `enquo`, and `:=` is good enough for the vast majority of users who want to write their own NSE functions now.
Beware of adding that to your .RProfile though. The author makes a good argument for why that's a bad idea: your code will execute differently on someone else's machine.
That’s true. It may create more problems than it solves. I don’t change the default value myself, setting stringsAsFactors=FALSE when required is good enough for me.
I learned R coming from Java, Node, PHP and Python and I love it !!! It is awful as an application development programming language, but it was never designed for that purpose. It was designed for STATISTICS. Try to achieve advanced statistics with your traditional software engineer's preferred language and see which language you hate then. The only tricky R concepts to learn for newbies are: recycling, formulas and vectorized functions. Add RevoScaleR to R and it kicks major ass when dealing with big data manipulation. Oh yes, big time !!!
I use R a lot and I have to say some of these comments are weird.
1. R and Lisp are hardly alike even if it was inspire by it. It's like saying Erlang and Prolog is very similar. If you want learn FP do it in Erlang, Lisp, Haskell, etc.. Don't do it in R, it's half baked.
2. R syntax is ugly with warts. But built in datatype like dataframe, factor type, NA (missing value notion) value, make this language much better than many languages out there for dealing with data. Subsetting dataframe is a breeze even in base R.
3. There are many many advance statistical packages only in R. GLMnet was was in R for 4-5 years before someone decided to port it to Python. You can argue that there might be alternative package. But the statistician that created ridge, elastic, etc... method made GLMnet. There are many statistician out there that just implement their latest method in R. If you want to learn a subject in statistic there is probably a book out there and it'll have an R package and code to come along with it. Next to that will be SAS. There are very few stat book with python packages. You want to learn bayesian statistic? Social Network Analysis? There's a book for it with R code and a package to do that. Good luck finding one in Python for these subfield of statistic. There's a bayesian hierarchical analysis in Ecology and that book is in R.
4. ggplot2 is amazing for static graphic. R doesn't have good dynamic graphic out there and I kinda meh with Shiney. If you hate the syntax then you may learn to appreciate it by reading it from the creator https://www.r-bloggers.com/a-simple-introduction-to-the-grap...
> R and Lisp are hardly alike even if it was inspire by it. It's like saying Erlang and Prolog is very similar. If you want learn FP do it in Erlang, Lisp, Haskell, etc.. Don't do it in R, it's half baked.
They are very alike in the underlying core design, not in how you use them.
In R, everything is an expression, and every expression is a function call. Even things like assignments, if/else, or function definitions themselves, are function calls, with C-like syntactic sugar on top. You don't have to use that sugar, though! And all those function calls are represented as "pairlists", which is to say, linked lists. Exactly like an S-expr would - first element is the name being invoked, and the rest are arguments. And you can do all the same things with them - construct them at runtime, or modify existing ones, macro-style.
So in that sense, R is actually pretty much just Lisp with lazy argument evaluation (which makes special forms unnecessary, since they can all be done as functions), and syntax sugar on top. Where it really deviates is the data/object model, with arrays and auto-vectorization everywhere.
R certainly has a lispish code-as-data element to it, but it seems like it has some serious flaws. Don't most lisps have functions and macros as separate constructs? R has functions, but with some mucking around you can make them do macro-type stuff. Then people write these half-function, half-macro things (e.g. "non-standard evalation") that tend to break composability, either totally or sometimes only in edge cases.
Lisps do that distinction because they need it. In R, you can do everything with functions, because arguments can be lazily evaluated, or you can even get the syntax tree used for that argument at call site instead. So in R, a macro is just a function.
And yes, it's easy to break stuff that way. Just as easy as it is with macros (esp. non-hygienic ones).
I'm not saying it's a better way to do things. It trades having fewer primitives (and hence simpler language structure) for performance. But the use of lazy evaluation is pervasive in R in general, so it's a conscious design decision that they made.
Do you have an example of what R would look like without the C-like syntactic sugar? It doesn't need to be complex, I'm just intrigued about what it might look like.
Sure! If you want to experiment with this, it's pretty easy to "reverse engineer" that original form. Just use quote, and convert to a list (you need to do that because expressions will pretty print by default using the same sugar!), to see the internal structure:
And to make sure that it really does evaluate only the correct branch:
> `if`(1 > 2, cat(3), cat(4))
4
Now something more interesting:
> f <- as.list(quote(function(x, y=1) { x + y }));
> f
[[1]]
`function`
[[2]]
[[2]]$x
[[2]]$y
[1] 1
[[3]]
{
x + y
}
[[4]]
function(x, y=1) { x + y }
This last entry is probably confusing, because it looks recursive. However, it's not the function itself - it's the srcref (basically, metadata about where the code came from, used e.g. by debugger to report line numbers) - it just pretty-printed itself like the function it is for. We can ignore it, though. Otherwise there are two arguments here - first one is a pairlist with named elements, one for each argument, and values are the default values for those arguments (if present). Second argument is the function body, which is itself an expression. We can look at that:
> as.list(f[[3]])
[[1]]
`{`
[[2]]
x + y
So {} is itself a function! And x+y works as you'd expect:
> as.list(f[[3]][[2]])
[[1]]
`+`
[[2]]
x
[[3]]
y
Now let's try to do the same ourselves. One catch here is that function() expects the first argument to be a list itself, rather than an expression that evaluates to a list. So we can't do this:
> `function`(pairlist(x=1, y=2), quote({x + y}))
Error: invalid formal argument list for "function"
Because the first argument is not itself a pairlist, but a promise of one. So we need to construct the call, thereby evaluating the arguments in advance, and then eval it. Here's the first take, ignoring the function body:
> eval(call("function", pairlist(x=1, y=2), quote({x + y})))
function (x = 1, y = 2)
{
x + y
}
The body we can just rewrite as plain calls:
> eval(call("function", pairlist(x=1, y=2), quote(`{`(`+`(x, y)))))
function (x = 1, y = 2)
{
x + y
}
You might have noticed that I've cheated a bit here by giving each argument a default value - the original didn't have one for the first argument. It's because we need to somehow get a "missing" bit on a list element for that to work, and this makes it a great deal more convoluted - R has an easy way to check for it, but not to set it, other than by omitting arguments in function calls. The easiest way to get it is to quote() a call with one, and then just pull the pairlist out of the expression tree.
Absolutely agree with #3. R is by far the most accepted and expected implementation language for statisticians. I remember reading a blog post where the author had submitted a manuscript to a statistics journal that ended up being rejected, in part, because he had implemented the code in julia instead of R.
I use python for most things, but there are so many packages that can only be found in R (especially in the bioinformatics world), so it becomes a necessity.
>A R factor is a sequence type much like a character atomic vector except that the values of the factor are constrained to a set of string values, called “levels”. For example, if you have a table of measurements of some widgets and each row corresponds to a single measurement of a single widget, you could have a factor-typed column called measurement.type containing the values “length”, “width”, “height”, “weight”, and “hue”, with the corresponding numeric measurements stored in a “value” column.
This is a very bad example of what factors are for in R, because it makes it seem like factors are for defining variables or keys in key value pairs. You can use them for that, but it isn't the intended use. A better example would be:
suppose you were comparing the amount of sugar in fruits based on several growing locations, and you had three columns:
| Fruit | Location | Density (g/L) |
Fruit would be a factor variable (let's say it takes the possibilities of apple, banana, orange), and location could be too, if it were a discrete set of possibilities (as opposed to lat/lon coords)
This author seems to forget that R was built for working with data in an analytical setting, unlike all of the languages he's comparing it to. It has creeped into other areas, but that seems to be because in the hands of a skilled user it is far easier to implement a data analysis solution. I'm sure someone will come in and say how much better pandas is, but on the small datasets, I'll stick with R, especially with how brittle and buggy matplotlib is.
>> This is a very bad example of what factors are for in R, because it makes it seem like factors are for defining variables or keys in key value pairs
Do you have a reference to where Hadley et al. suggest using factors in a key-value system? I'm reading Wickham's books at the moment and have not seen this assertion. Indeed, I believe he would not state this, as he explains the utility of factors explicitly:
A factor is a vector that can contain only predefined values, and is used to store categorical
data... Factors are useful when you know the possible values a variable may take, even if you don’t
see all values in a given dataset...
It was my interpretation of the original article quote that it was referring to tidy schema, but I could be incorrect. (the gather() function of tidyr names its parameters key and value as well, and the function is described as "Gather columns into key-value pairs": http://tidyr.tidyverse.org/reference/gather.html)
If you are interested in this topic and haven't seen Advanced R, I'd recommend taking a look - the book explains why those functions have key-value pairs as parameter names. Note that the functions you cite aren't related to factors.
I don't understand why HN hates R. HN loves lisp, and R as a language shares a much greater affinity with lisp languages than python or Go do. The language was born out of the original authors reading SICP (as statisticians). Sure, many of the users of R molded it to look like what they were used to (S), but that just highlights the powerful metaprogramming capabilities of the language.
> HN hates R. HN loves lisp, and R as a language shares a much greater affinity with lisp languages than python or Go do
HN was started by someone who wrote a book on, and flipped a startup using, Lisp. Python and Go are both used a lot by a buyer of startups, and HN exists to deliver startups, or their IP, to those buyers. R is more of a language for helping its users do data analysis, perhaps in corporate offices, and hence doesn't have as much use for HN's business purpose. Submissions on Python and Go are more likely to stick to HN's front page.
I use lisp and R. While R evolved from lisp, which let's me understand how it does certain things, I don't know if I'd describe it as more like lisp than python.
Indeed, the analogy I use to describe to friends why I have a strong emotional distaste for R is to use the following analogy:
Imagine you grew up as a heterosexual male. In your early years, you have fond memories of a young girl whom you had a fling with.
She drops off your radar, and you run into her 30 years later. She's gotten breast implants, botched her face with plastic surgery, and went through a rather traumatic divorce and reinvention of herself.
To your friends who lived on an island where there were no women and kept in basements and were regularly beaten by other stats programs, she might even be beautiful, and she certainly pays them attention to their base desires that they crave.
To you, she'll always be a mangled shadow of her former self and what could have been...
I won't speak for HN, but here's a tongue in cheek summary of my experience with R.
I write a script. It doesn't work. I don't know why. I look at the error message, and then google for 30 minutes to understand what it really means - which parts of the code broke, why, how to fix them. Because none of the 3 things (which, why, how) is easy to get to.
OK, I fix it, having learned something new (like that there are infinite special cases with almost any functions).
I commit it to repo, go for coffee. In the afternoon, a colleague asks how to run that code. Well, it was a simple script, half a page, what's the problem?
I take a look, and on their machine it doesn't run. We don't know why. An hour later we discover she has some R profile file with a setting that changes behavior of some standard library... and she also has different encoding set as default, and so on, and so forth... whatever. I don't know why runtime environment encoding changes behavior of code that only deals with numbers, but hey! It's interesting at least.
We fix it, we are happy.
A few days later I run the script again. It works. The result doesn't look right though. It's mostly zeroes. Hmm.
I run it a few more times, playing around with input, trying to figure out what's up.
OK, after a few minutes I realize there's lots of red color that flashes on running the script on my screen - just so fast I barely see it.
It turns out half the code isn't really running, the script just ignores it though (errors do NOT stop the code from running), and keeps going. It produces partial output happily announcing it finished.
That is the most serious mindfuck. Everything is OK, says the prompt, here's your 1 megabyte result of the calculation, oh, just don't look at the numbers, because I havent' really run any of the code... I couldn't find one of the functions.
I sit there wondering. Which is worse: the fact that every time I try launching the script something else is happening, or the fact that the runtime environment by default will return garbage with NO warning at the end (which is the only thing you see on screen) but with a million warnings in between (which you won't see unless you have really good reflexes...).
Which is worse?
I decided at some point, that I want a language to fail, and to always give me the same result. An error, an exception, this should kill the program and shout as loud as possible "Won't give you anything". Also I want code that ran yesterday to run today, and to run on my colleague's machine, and on a newer version of R. This was never our experience.
Sounds like you aren't running your scripts as scripts. If you source a script or run it via Rscript it will halt when it hits a failure (unless you've changed a default). Copying and pasting in to the REPL will hide errors like you describe.
The other part is that it sounds like you don't have a standardized R environment. I admit that R's tooling there isn't the best, but there are options, e.g. {packrat} & {lockbox}... or better yet a Docker image.
R is much more lisp-like than python: more functional, more emphasis on DSLs and metaprogramming, and the community is far more inviting to New comers: reminds me a lot of Racket in that regards.
Yes! People ask me what the best R programming book I’ve ever read is, and I like to say “How to Design Programs.”
I keep thinking someday I’ll try and build some of the missing stats infrastructure in Racket. The problem at present is time and that this stuff is real work!
Having used Python, JSL, Julia, R and Matlab; I agree with most of the things in R. R is an extremely ugly language. It seems to be created by people who wear capris and uggs (both at the same time). But, R has incredible packages, especially the work done by Hadley Wickam. ggplot2 is beautiful. It is utterly gorgeous. It is what Ted Baker is to the capri guys that designed the language itself.
My personal hack to deal with the unbearable ugliness of R is to use Rpy2 and call R packages from Python --- at least writing some boilerplate code in Python makes me happier than having to write in R.
ggplot2 produces beautiful graphs. I don't think it's beautiful as a package -- the syntax is strange and reflects an earlier evolution of the ideas that went into the tidyverse.
Notably the use of + instead of chaining operators, the use of a custom "ggproto" object system instead of S3 (which makes extensibility a nightmare), and the superfluous presence of the aes() function (rendered unnecessary by better lazy evaluation tricks not really well-explored at the time).
> the superfluous presence of the aes() function (rendered unnecessary by better lazy evaluation tricks not really well-explored at the time)
I find this statement interesting - the aes() function helped distinguish attributes that were bound to values as opposed to being bound to constants. What would you propose as an alternative if the aes() function were eliminated?
EDIT:
Looking at ggviz [1], the presumptive successor to ggplot2, it seems that two new syntactic features will be used to distinguish constant vs. bound data.
I seem to be in the extreme minority opinion that ggplot2 has facilitated the creation of millions of ugly charts. The default theming hurts my eyes. Much improvement can come from just adding + theme_bw() or + theme_minimal() as the last layer.
But it is flexible, and it's fun. I have enjoyed following other's examples of recreating chart themes mimicked from the Economist and FiveThirtyEight. I think the latter often does employ ggplot2 with heavy customization to create some nice looking visualizations.
If you don't like that syntax use another library. But ggplot2 how you code graphic like that is amazing for me and many other people. I've tried SAS and Matlab and ggplot2 is the best.
> There are subtle differences and some authorities prefer <- for assignment but I’m increasingly convinced that they are wrong; = is always safer. R doesn’t allow expressions of the form if (a = testSomething()) but does allow if (a <- testSomething()).
...I am confused here. If you're testing for equality, R requires you to use == and not =. If you try to test for equality with =, it throws an error instead of treating it as an assignment. That's good. But who is trying to test for equality with <-?
He's saying to use = instead of <- because it won't let you assign to a variable there. But that's because it assumes you mistyped the equality operator. The only reason you need safety there is because it's easy to forget that == is the equality operator, not =. It's not easy to confuse == and <-.
The latter always evaluates to true and assigns the value you're trying to compare with to your variable. This can be extremely difficult to catch and detect, especially for people who aren't software developers. They aren't writing unit tests.
If you have tried R and found it painful I can’t say enough good things about the “R for Data Science” book. Great overview of Tidyverse, ggplot. After 50 pages I was further along than my previous 5 years of googling and cursing.
There are a bunch of odd non-standard syntax choices in this tutorial. For example, the author ends statements with semicolons. R does allow equal sign assignment (although style guides prefer the stupid arrow syntax). The author mentions Bioconductorm the... second biggest package repository for the language?
I clicked because I was a programmer for 15 years before I used R, and I have subsequently developed and shipped R packages, so I feel like I'm in a pretty good position to get the visceral, cathartic, "argh" the writer here was going for.
the "stupid arrow syntax" is not nearly so stupid, and I speak here as a 10+ year Python developer, as the misuse of the equal sign that every C-style language seems to think is ok. The = sign meant something long before computer programming, and it is something the developer community ought to be ashamed of that it is being used for "change this thing to that". /rant
Well, the "=" operator is eas to type since it's on pretty much every keyboard and probably has been for ever. And it's faster than typing a two-symbol operator like := or <- so.
If you want "=" to (more or less) mean what it used to mean "long before computer programming", try Prolog.
?- a = a.
true.
?- a = b.
false.
?- A = a.
A = a.
?- A = a, A = b.
false.
?- A = a, A = B.
A = B, B = a.
?- A == B.
false.
?- A = B, A == B.
A = B.
Nice? Most programmers would pull their hair from the root at all this :)
(hint: "=" is not assignment, neither is it equality and "==" is just a stricter version thereof).
I would willingly put up with the stupid arrow syntax if I could use ← for it, and with the gratuitously confusing to people coming from other languages use of [[ ]] for an element of a list rather than a sublist if I could use ⟦ ⟧.
R was originally not much more than a bunch of Scheme macros if I understand the history right. Suspect <- may have just been an infix macro for define at one point.
Of course I probably would have preferred just keeping the lispy syntax of scheme without all the infix stuff!
The second time, I had a purpose and found enough code to copy to achieve it. I was rewarded.
The third time, I had a more complex problem demanding use of JSON from Elastic Search, and found that the two packages out there in git are basically orphanware, use dplyr in extremely confusing ways, and offer little or no advantage to simplistic HTTP fetching and direct to JSON parsing. Which is a huge shame, because the idea of an elastic search abstraction is very attractive. But. "it just didn't work out of the box"
I am very clear I am an R "consumer" not an R developer. But, at this point, absent Shiny and a gui, I think that Python and Numpy has as much to offer me basically.
Some people say the syntax is FP friendly. I have been trying to learn FP in Haskell and I think R is about the worst notation you could invent to sell FP.
Great guide. R syntax is awful, no matter how you slice it and would have been the tool of choice before Python could walk. Still, it is very powerful and great to have around.
I use R most of the time and I find R notebooks very data exploration friendly. It makes it easy to back and forth just like Jupyter notebook. Producing HTML files from Rmarkdown files is also analysis friendly.
99% of the time I use tidyverse with no noticeable impact on the performance. For that occasional 1%, I must admit datatable package works out really well. tidyverse pipes are so unixy that makes it easy to transition to command such as cut, head, sort and column if needed without any mental contortion.
I have used Python occasionally and with method chaining, it can almost simulate the "dplyr" like syntax. However, it is hard to find some obscure statistical test out of the box which is easy in R.
There are some highly productive researchers who have taken the time to write well constructed and very useable packages in R, married with meticulous documentation. The same individuals or groups of 1-3 people have also maintained said packages for 5 years or more, and regularly respond in person to queries. Two examples are ggplot2 and limma.
To me, this is all the encouragement I need to use R.
Some reading this will claim any criticism is the fault of the critic. Others will jump in claiming it is all revealing the emperor's clothing and promoting an alternate religion.
A few will deconstruct the criticisms and look for the small documentation or even language changes to solve something.
Hmmmmm, okay after 20 minutes of squinting at the script I see there's a lowercase letter at the header name of the last column somewhere in the middle of the 200 lines...now we can move on to debugging the next "Error"...
I don't use R to write programs. I wrangle data, run analysis, and plot results.
It has a great number of solid stat packages for mixed modeling, clustering, ordination, etc. That is why I use R.
I thought I was the only one who hated R.
Guess not. I gave up trying to make a simple repreeeesentation when I couldn't find a tutorial on how to import data into R without using fucking excel.
R worshippers: "To a man with a hammer, everything looks like a nail." Interactively manipulating tables and producing visualizations is not what software engineers/programmers on HN would call "programming". Try learning a compiled programing language like C and creating a tool or three.
R haters: "It doesn't matter whether the cat is black or white, as long as it catches mice." Remember, R was designed by statisticians for statisticians -- nothing more, nothing less. Try manipulating a tabular dataset and producing visualizations in R, and see how easy and painless it is. There's a good reason behind all the hype.
"See and despair: x<-y (assignment) is parsed differently than x < -y (comparison)!"
And that guy talks about computer science? "<-" is a single token, what is so hard about that? It is a lexing issue, not a parsing issue!
Surely one cannot accuse OCaml of being written by laymen:
# type r = { mutable x : int };;
type r = { mutable x : int; }
# let y = { x=10 };;
val y : r = {x = 10}
# y.x <- 100;;
- : unit = ()
# y.x < -10;;
- : bool = false
I don’t hate R as much as I hate universities insisting on mandatory use of R for all comp-stat courses. This to my mind verges on civil rights infringements. Bear in mind we aren’t talking about a private institution such as a company, where when you sign on as a dev for a paycheck, you do so voluntarily, knowing that the company uses X language and you won’t have a choice in the matter. In a public university, students, especially grad students, who go on to do research and write papers etc., must have a choice in what they use. If I don’t like R, doesn’t matter what the reasons behind my dislike, I should have a choice. Let me use numpy, scipy.stats and pandas. How does that bother you ? First of all, you should be teaching statistical concepts, not syntax - so this whole notion that graders & TAs don’t know python is nonsense. Let them grade my numerical answers, not the R syntax used to get that answer. And at that point, how does it matter if I use R or python to get my answers ? Sorry, had to get that off my chest ( and don’t even get me started about sas )
Well, the purported civil rights issues aside (which is a patently absurd point of argument on its face) – if you’re at the point where you’ve accumulated enough experience with both environments to form strong preferences for one over the other, that’s a decent sign that said course might just not be a good fit for you. I mean, professors and TAs are pretty overloaded already, and asking them to redesign course materials to accommodate additional environments is a tall order, especially for introductory courses.
Because then students start using 3 or 4 different languages, can't figure something out, and come bitching to the TA to tell them what they did wrong, and now the TA needs to know 5 different ways to implement the same thing.
Ironically, you complain about wanting to learn statistical concepts, and not a language, but by focusing on a single language, they can minimize the amount of time spent on agonizing over lines of code trying to get to the same answer, and instead focus on the actual statistical problem.
Its also important to note that there are lots of potential pitfalls for people who don't know about them. For example, scikit-learn is the most popular python package out there for most data analysis work. But I would bet my ass that at least 50+% of its users don't realize that the logistic regression implementation implements L2 regularization by default, and in fact there is no (non-hackish) way of implementing non-penalized logistic regression. So you will be getting completely different answers than someone who would be using R, SAS, SPSS, etc. And the only way for someone to know this is for someone to understand the implementation of the function in all possible languages.
That's fine on one condition. If you get the wrong answer, you just get 0 and no feedback.
With a library and language I know, I can look though, advise on what went wrong and see how close you got. If I have to start supporting every language in the world, that's simply not reasonable.
There are running jokes amongst stats students about those who resort to using SPSS and EViews. It's mostly because of the GUI as most of the students use SAS (proprietary and expensive) as well as R.
I understand some of this stuff is actually seen as a positive for Python in some contexts (production usage) and I agree. Just pointing out the woke take is the languages are both good, but good at different things. If I need to run a quick analysis on a dataset, I'm grabbing R 9/10 times. If I'm building a production pipeline, I'm using Python 9/10 times. This is perfectly fine.