Wow, people here are being pretty hard on this article. I'm sure everyone here on HN has read their share of programming language flamewars, and this one doesn't come close. It's pretty mild.
I also think it gets something right - it does feel different to use Python and R, and the reason may be rooted in how these languages arrived at data science. Python, as the article points out repeatedly, is a general programming language that scientists liked using for numerical computing, so slowly it acquired a billion libraries for data analysis, stats, machine learning, and so forth. R was created by scientists and statisticians specifically to do stats and analysis, but in order to be useful, it needed to acquire the full capabilities of a programming language.
What feels most natural to you often depends on what direction you come from. If you're a programmer taking on data science, you may gravitate toward python. If you're a statistician getting deeper into code R may feel more natural. It would behoove you to learn both.
What on earth is wrong with that? I suppose that for a few people on HN, this might be a bit repetitive. But otherwise, I'd recommend this article as a relatively non-combative bit on explaining the different languages, especially for someone getting started.
It's been a few years since I've done some real R work, but my general impression of things was that "core R" - ie. the R that is explained in the R manual - is actually a bit deprecated. It's not the correct way to use R. The correct way is to use the Hadley tools (ggplot, dplyr, etc.)
These tools are grafted onto R - but seem to have a completely different design philosophy. I actually don't know why they're in R and not Python or C++ or whatever other language - but they form a set that is very easy to work with and produce results really quickly (especially in combination with RStudio).
So the design principles behind R (or I guess the S language) kinda becomes irrelevant.
R is explicitly designed around S-expressions and as such lends itself to domain-specific languages like these. The choice was not an accident either by the original Rs (Robert Gentleman and Ross Ihaka) nor by Hadley.
Guido has explicitly stated that he does not want Python to be "more lispy" e.g. in regards to lambdas (asterisk). Thus I've seen many people even at, say, Stanford, Harvard, and Cambridge going back to R from Python. Sometimes there does not exist a language that best suits a workflow, and a DSL works better. That is where lispy languages hold an advantage.
Use the right tool for the the job, imho, but I fucking hate people that mix the two within a project intended for wide public release. Worst of both worlds, again imho.
(Asterisk) apparently functional data structures such as iterators and generators are OK though. Wtf guido
Python is similar. Base Python is terrible for data analysis but a large number of useful packages have grown up around numpy (scipy, pandas, theano, keras, etc.) which has similar conventions to R.
The "standard" R libraries cover most of the stuff. There are a couple of things that are hard to do. Hadley's libraries are very popular, but saying that R base is "deprecated" or somehow "not correct" is completely wrong.
Also all these libraries are somewhat tied to R data structures and often take advantage of R's peculiar evaluation rules of function arguments, which means they're not that easy to port to other languages.
R is good for tabular data and Python is good for text/image/nontabular data. And there's nothing wrong with knowing and using both languages.
Likewise, the world will not end if you use Python pandas for tabular manipulation or various bespoke R packages for nontabular manipulatition.
This isn't the battle that people should be fighting. It's not even a religious argument like web development stacks where a language can eek out better benchmarks. And as others note, this very article concludes that they both have their advantages.
No kidding. If you want to be a "data scientist" (whatever that means), it's good to have a scripting language (Perl, Python, Ruby), a math language (Matlab/Octave, R), and a fast language (probably C). You can torture any one of these to fill the others' roles, but usually it's easier to use the best tool for each job.
"Need to know" might be a stretch but you should be familiar with it.
There's several times I've come across something in R or Matlab that I want to do in Python, and it's easier to port over code/processes if you have an awareness.
I'm a thousand times better at Python then R/Matlab but being familiar with them has helped me a lot.
You will be a better data scientist if you spend some time exploring the extensive R documentation and lots and lots of real life examples.
On the other hand R and Python (not to forget Latex) are easy to connect so there is not need to choose only one of them. You can call R from Lisp also of course.
At the risk of ironically causing another religious framework war, I never said pandas was bad/inadequate for tabular data, but R/tidyverse has its perks in that area.
This attitude is common for people who only want to learn ONE way of doing things. They buy the illusion that there is a silver bullet for all computing (or data analysis) tasks. In fact computing is the art of using the right tools and languages to solve particular problems.
Both are nice and pleasant guys to work with, unlike a lot of the drooling idiots that act like this "rivalry" is some sort of football game. Funny how master craftsmen don't often blame their tools, they sharpen them instead.
Any time someone blames their tools for their own inadequacies, show them this video of Kelly Slater surfing better on an overturned table than most of us can surf on a 7 foot three fin board: https://m.youtube.com/watch?v=XQ4owd3yQ_4. Up your game instead!
Edit: hacker news doesn't use that part of markdown
I agree—it's about not blaming the tools, but also following Vonnegut's rule: Goddamnit, you've got to be kind.
Some of our other FOSS luminaries have chosen a different interaction model. Sometimes that other approach to technical project leadership is touted as necessary. I'm sure Hadley and Wes suffer their fair share of fools, and they generally seem to do so with kindness.
Rightly so. If you start out framing a comparison of strengths as a battle for mindshare then you essentially claim that the two languages are mutually exclusive and that really isn't the case.
> Yes, Python makes preprocessing easy, but that doesn’t mean you can’t use R if you need to clean up your data. You can use any language. In fact, in many cases, it’s structurally unsound to mix up your data purification routines with your analysis routines. It’s better to separate them. And if you’re going to separate them, why not use any language you like? That may indeed be Python, but it could be Java, C, or even assembly code. Or maybe even you want to preprocess your data within the database or some other storage layer. R doesn’t care.
I'm guessing he's being fascetious when he suggests using assembly language is even a consideration when thinking about preprocessing data. But why would you even include it in the list?
All these articles always say that Python has better preprocessing compared with R. Where? I find tidyverse/data.table much more elegant than pandas, scipy et al. The only thing that I like more in Python it is how it handles streams/generators.
I use both R and Python extensively, and also was confused for a long time why the conventional wisdom is that one of Python's strengths is a data cleaning and preprocessing. I too always found data cleaning to actually be one of R's strongest features. My hypothesis: I think there is an inconsistency between what is meant by "cleaning/processing" between stats- and CS-oriented people. To the latter, it often just means turning unstructured or tree-like data (JSON/XML) into a tabular format. As a statistician, I think of it being a much larger set of operations; specifically those supported by packages such as dplyr, tidyr, and stringr... I'm not sure if that interpretation is correct, but Python does excel in the XML/JSON => CSV step and R is great at all the processing that happens after that.
Ya, completely agree with this (having also used both extensively). dplyr can also connect to remote SQL server so data don't have to be local. Maybe pandas does this now too but in my experience SQL connections were generally more painful in python
The last time I preprocessed with Python I only used csvreader. Client had sent a tree structure as two column table, so in one pass I outputted SQL for the recursive table with a bit of if-second-column-empty-then-figure-out-depth-of-first-column logic
Maybe R would work, but I'm familiar with Python. So Python stayed out of my way
Having recently completed a data analysis project I'd say the biggest thing that made me choose R was its ability to make really pretty graphs easily.
I have a lot of experience with Python doing all sorts of programming. This was my first R program and I don't regret it. The libraries in R(ggplot2) for making pretty graphs are much better than anything I could find for Python.
ggplot2 no longer has the pretty pictures field to itself. Python's matplotlib does great graphs too. Javascripts d3.js does similar. And there are other libraries as well.
For static images to put into a PDF, maybe it doesn't matter what you use, but when you see the ease of creating interactive data exploration graphs on a web server using d3.js and its ecosystem, you will be pleasantly surprised.
"The first stage of data aggregation can be accomplished with Python. Then the data is fed into R, which applies the well-tested, optimized statistical analysis routines built into the language. It’s as if R is a library for Python. Or maybe Python is a preprocessing library for R."
I really like this approach, actually. Taking advantage of the strengths of both languages.
We often take this approach at my company. The heavy lifting of feature extraction from raw data (wearables in our case) are done by python/numpy models. The population level stuff is then often handled in R by data scientists with more of a maths/stats background than an engineering one.
This is one of the worst articles I've ever read. It's literally (figurative literal here) creating a shitshow out of a mountain made of non-combative people who get along with each other mostly, but get a little pissy if you bone one of their wives, which is totally theoretical because none of them actually have wives.
In the "both of these are awesome, thx for your input infoworld" camp, does anyone know of an equivalent of purrr for pandas/ python? I've been digging the pandas/scikit/numpy/numba stack recently but a friend was showing me the most beautiful data manipulation R code the other day, written in purrr.
Use both because the real tool that you are using for data analysis is the computer, not the programming language. R and Python are both just parts of the toolset.
Nowadays there are lots of ways to combine the two from Rpy2 to Orange3 to Jupyter and the Beaker Notebook. Notably the last two let you use Groovy,Java,Scala and a host of other languages as well. Apache Taverna also plays in this space of integrating multiple tools with different strengths to do a job.
R will likely never be eclipsed by anything because it has such a broad and deep collection of statistical libraries. But Python won't go away because it is a great tool for general purpose computing and even hardcore stats heads have a lot of general purpose computing problems to deal with.
It is sad to read R code that copies files, gets data from S3 buckets, runs SQL queries, and so on. So much of it is crudely hacked together and even the libraries that support this are shoddily built. The best of both worlds is to use Python for pre and post processing, but R for the stats libraries (CRAN, BioConductor).
For lots of S3 wrangling the best tool is a Java library called Je
tS3t, and using a language like Groovy or Scala makes it easy to tame. And Groovy is integrated deeply into Jenkins which has evolved beyond a CI tool into a general purpose dashboard for managing and running "jobs". Works great for big data stuff that is not purely Map/Reduce.
Beaker Notebook is leading the charge by integrating seamless conversion of data frames between languages so that you can write a script in two or three languages at the same time, building on the strengths of each one.
If you stick with just one language then expect the next generation of data scientists to leap far beyond you in a few years. A sea change is coming.
> the last two let you use Groovy,Java,Scala and a host of other languages as well
Neither Python nor R run on the JVM, so if you end up using Java,Scala,Kotlin,etc then you've decided to open that JVM can of worms which is another huge pile of tradeoffs.
> Je tS3t, and using a language like Groovy or Scala makes it easy to tame. And Groovy is integrated deeply into Jenkins which has evolved beyond a CI tool into a general purpose dashboard
If you end up there, know that only a subset of Apache Groovy is used by Jenkins, e.g. Groovy collections methods aren't supported. Each step along the "native Python or R" -> "Java on JVM" -> "Scala or Groovy" -> "Jenkins as dashboard" decision process entails some cost-benefit tradeoffs which need to be assessed.
I've been using R intensively for almost two decades, and Python for about half that time. I enjoy using them both and think they're great languages. I don't think it's an either-or thing, because they both have something to offer.
At the same time, they also have a lot of weaknesses, most of which are summarized by the Julia benchmarks (https://julialang.org/benchmarks/). You can criticize these particular benchmarks, but similar patterns emerge in lots of other benchmarks.
R was never meant to do the heavy lifting it's doing today. Ihaka sort of lamented this fact for a while, and then got ignored as people went on to use it anyway.
Sure, you can wrap things around low-level C/C++/Fortran in either language, but eventually if you find yourself getting into nitty-gritty stuff, the computation and/or memory use of R and Python becomes a problem. It also complicates a task to rely on juggling two platforms at the same time.
Julia is new, but it reminds me a lot of R in its early stages. I started using R when it was in beta because it offered something new, and Julia has a similar feel at the moment. Maybe Julia will die away but it doesn't seem that way to me at the moment. I've seen lots of prospects come and go, and none of them had the same traction as Julia.
If anything will stem the growth of Julia, it probably will be Python. Javascript saw a lot of performance gains after Google and other players invested heavily into it as part of the mobile ecosystem. It seems like Python is getting similar investments now with ML/DL, and I wouldn't be surprised if Google, et al. started dumping tons of resources into PyPy or something in the same way you saw javascript implementations getting that investment. At the same time, if you look at benchmarks of PyPy, it seems like you might get to the same level as javascript, which isn't the same as Julia (or C++, which is maintaining its relevance, or Rust or Go, which are growing and relevant).
I guess my point is if a student asked me, sure, I'd recommend they prioritize R or Python first, but I would also explain Julia to them and recommend they become familiar with that as well.
Yup. It's already a great, fast, enjoyable language with a couple of excellent libraries, and it's maturing: Apparently release 0.6 is around the corner, and the next one might be 1.0.
Maybe I am the only one, but I find numpy is like alcohol. You may feel exhilarated in writing numpy code, but the resulting code is often very difficult to read.
Oh God the new age politically correct "battles" where in the end you all hug and sit down to sing the kumbaya and try to please all readers by claming everyone is actually a winner in this "battle"...
If you are too afraid to actually analyze a situation and give your opinion, then just don't write about it and spare us all the time it takes us to read it.
> Indeed, it is a variant of S with lexical scoping to make large code bases cleaner
I have no clue where he's getting this.
R have 3-4 ways to make a class btw. The code base isn't cleaner. Most packages that need speed is coded in faster languages. So R is gluey for packages.
The code base is decent overall but I think Python is much better.
> Python does everything any language can do
I want it to preemptively stop processes like Erlang but it can't. So this is wrong.
> The Python world has been trying to catch up lately by working with existing IDEs like Eclipse or Visual Studio.
It have Rstudio equivalent, it's Rodeo.... this guy. He mentioned it later on for some reason which contradict his previous statement.
I think the article is unorganized brain dump. Maybe he just need to reorganize his thoughts.
I also think it gets something right - it does feel different to use Python and R, and the reason may be rooted in how these languages arrived at data science. Python, as the article points out repeatedly, is a general programming language that scientists liked using for numerical computing, so slowly it acquired a billion libraries for data analysis, stats, machine learning, and so forth. R was created by scientists and statisticians specifically to do stats and analysis, but in order to be useful, it needed to acquire the full capabilities of a programming language.
What feels most natural to you often depends on what direction you come from. If you're a programmer taking on data science, you may gravitate toward python. If you're a statistician getting deeper into code R may feel more natural. It would behoove you to learn both.
What on earth is wrong with that? I suppose that for a few people on HN, this might be a bit repetitive. But otherwise, I'd recommend this article as a relatively non-combative bit on explaining the different languages, especially for someone getting started.