Hacker News new | past | comments | ask | show | jobs | submit login
Advantages of Using R Notebooks for Data Analysis Instead of Jupyter Notebooks (minimaxir.com)
167 points by minimaxir on June 9, 2017 | hide | past | favorite | 70 comments



Once I got serious about this stuff, I started using Emacs with Org Mode and ESS. It has most of the features listed here. It's just a text file, magit is incredible, you can export to HTML and there are lots other formats available as ox plugins (e.g. MediaWiki), excellent LaTeX support, Org Mode is incredible for organizing large analyses and managing todos, you can glue together anything from anywhere (R, Python, shell scripts, Spark clusters, SQL, remote processes, etc.), and so on. Considering I code in addition to doing data analysis, I can reuse all my coding stuff to do data analysis too. I can take notes in Org Mode during a meeting, and then afterwards do some analysis directly on those meeting notes, export to HTML/PDF, and send it to a colleague.

It is missing the quick backtick interpolation that you get with R-Markdown, and some of the nice UI stuff like inline Shiny graphs and clickable tables (but it is easy to output Org-format tables that can be piped into other languages).


The disadvantage is collaboration; I don't know of anyone that uses a non-Emacs Org implementation, and adopting an editor/religion/lifestyle just to collaborate on a project is a huge ask. I find that in Emacs, ESS + polymode works great for rmarkdown files, providing many of the benefits, in a file that will also be interpretable in Rstudio.

Reproducible research must be reproducible by the unenlightened, after all!


IMHO R notebooks, jupyter and their ilk are essentially emacs for the masses.. which is neither good nor bad, but is.


Yup. Lots of people who rave about the benefits of notebooks are describing the benefits of a REPL.


Disagree. A REPL does not make for a log of what you did - which is a big deal if you're revisiting/-running data analysis workflows you haven't touched in a long time.


Using a terminal is enough to do that for any REPL: run "script".


You'd hope that in both cases, source code would.


Yes, ESS is fantastic. The only thing I envy about Rstudio is the GUI to save plots.


ESS is great, BUT its auto complete is non existent for the tidy verse, which is a deal breaker for those who work within it. It also takes setup to work with shiny and knitr, whereas Rstudio just works. On the other hand, I'm not sure ESS has anything that Rstudio doesn't offer.


The editing capabilities of Emacs are simply unmatched, for one thing. By comparison, writing code in Rstudio with its built-in editor feels like you're using some clunky, primitive tool :)


This will probably change (I hope!) with the adoption of the Language Server Protocol. I personally find Rstudio's suggestions to be a little odd (though I appreciate the fact that it actually works).


ESS already has symbol completion. Am I missing something?


My two main gripes with the current ESS AC implementation is that (if I've understood it) it relies on an active R process (e.g. it won't suggest the name of a function you've just written unless an active REPL has evaluated it), and that it doesn't work at all with pipes, which means it fails for ggplot2 and dplyr style coding (which is really all I write).


I see what you mean. Yes, AC needs an active process. Personally, I use dabbrev-expand for things that don't autocomplete via the standard AC mechanism.


My only complaint about org mode is that nobody seems to want to make caching work with noweb references even though it's an old issue.


I switched from Python to R about 3 years ago. I missed iPython (Now Jupyter) for a long time. Then I just got attached to RStudio.

I have tried R in Jupyter a few times and it was nice but the advantages in R Notebooks is just awesome. Git playing nice is the best advantage.

I still am clueless to the religious Python vs R and the smack that is read that "serious" work is done on in Python? R works best for me.


why do many say 'serious' work gets done in Python? R is great for linear models, but I find it tedious for many other things such as machine learning. However, I wouldn't classify that as 'serious', just that I find one performs better for different tasks.

its already been said, but I do NLP a lot. R handles text poorly. humans use a lot of text.

tensorflow, neural networks, etc is better in Python

between pandas, list comprehensions, python collections library, sklearn, spyder, I feel I have a lot of power at my finger tips and its easy to do most of the machine learning I want.

importing a package takes a meaningful amount of time in R. Several seconds, that is just unacceptable.

its a personal matter, but R has syntaxes that get on my nerves. python list: a = [1,2,3] a = c(1,2,3). perhaps its because i used other languages before, but my fingers are more adept at hitting [ which requires no shift compared to (. some people love curly braces and lots of parentheses in if/for statements, I appreciate them not being there.

I have to fight with R on scientific notation, always copy - pasting into my code: options(scipen=999)

that said, spyder is buggy, and R studio is fantastic. I still haven't come across a good python IDE that is par with R studio.

edit: I forgot to say, I feel pyspark is far superior to sparkr. last i seen, sparkr only works with a VERY old version of spark. I dont even think that version is supported anymore. this is a bit of a big deal to me



Yes, sparklyr is very good for using Spark in R (I have another post detailing that: http://minimaxir.com/2017/01/amazon-spark/ )


> that said, spyder is buggy, and R studio is fantastic. I still haven't come across a good python IDE that is par with R studio.

It's certainly taken some time investment, but after bouncing around all the editors for both, with some config, Emacs (with ESS for R and anaconda mode for Python) is the best environment I've found for both languages.


I'm not a python or R dev (Scala dev actually whose day job consists substantially of re-factoring python and R code written by data scientists and data analysts to run on the JVM on prod servers). Sure python is easier to grok for up-stream ETL/data processing, but that's commodity work (or it should be anyway) and not a solid basis to compare R vs python. R has far more packages than the "scientific" python portion of pypi and for certain domains the quality (and quantity) of the packages in R makes the better choice; examples: signal processing (or any more-than-routine time-series analysis; seismic interpretation, finance, experimental design, chemo-metrics, etc. And with strict use of the datatables package--coercing dataframes as datatables and using that package's syntax to manipulate your data, R is very fast. Ignore the smack and leave those folks to their "serious" work


Out of curiosity, what was the motivation for switching from Python to R for analysis, was there a particular R package that you were looking to use?


I can't speak for top comment but I started learning R and realized that a lot of the primitives which are exposed in python as libraries are just primitives in R and are thus are more natural to use in subtle ways. Once you start thinking in R you think in data and statistics rather than how you deal with data and statistics within a language. This doesn't mean one is actually better than the other unless you want to do generic programing things along your math oriented code then probably python is better but I find this change of mental state to be useful when focusing on problems R is suppose to solve.


R has some really crazy metaprogramming facilities. This might sound strange coming from Python, which is already very dynamic - but R adds arbitrary infix operators, code-as-data, and environments (as in, collections of bindings, as used by variables and closures) as first class objects.

On top of that, in R, argument passing in function calls is call-by-name-and-lazy-value - meaning that for every argument, the function can either just treat it as a simple value (same semantics as normal pass-by-value, except evaluation is deferred until the first use), or it can obtain the entire expression used at the point of the call, and try to creatively interpret it.

This all makes it possible to do really impressive things with syntax that are implemented as pure libraries, with no changes to the main language.


Quite true. R was originally build on top of Lisp and it shows in many places, such as in the metaprogramming stuff that you mention.

Overall R seems a little weird at first but the more you get to know the language the more you realise it's actually pretty well thought out.


It's true, the more I do in R, the more I wish that it had remained scheme compatible (originally, R was built on a scheme if I remember correctly).

My someday project is `#lang arcket` for Racket, which would allow people to use existing R code, and mix with Racket, with appropriate data.frame data structures and whatnot.


The problem will likely be similar to that with alternative Python implementations - because so many existing libraries in the ecosystem are written in C, an implementation that is not ABI-compatible with them is not attractive to most existing users.


No fundamental reason I know of that the C libraries for R could be used with Racket as well. After all, both R and Racket are able to do the FFI dance. Could the hypothetical R-in-Racket implementation sufficiently mimic the R FFI as to allow all R code to run without modification? Probably, maybe?


That's exactly the problem. R has a very specific API for its extensions - it's not just vanilla C, it's stuff like SEXP.

Although now that I think more about it, it's not quite as bad as Python, because the data structures are mostly opaque pointers (at least until someone uses USE_RINTERNALS - which people do sometimes, even though they're not supposed to), so all operations have to be done via functions, where you can map them accordingly.

You'd also need to emulate R's object litetime management scheme with Rf_protect etc; but that shouldn't be too difficult, either.

Some more reading on all this:

https://cran.r-project.org/doc/manuals/r-release/R-exts.html...

http://adv-r.had.co.nz/C-interface.html


Oh, yeah, now that you mention it I have seen the SEXP and protect / unprotect stuff before. Maybe a hybrid approach of porting some of the core stuff / popular libraries to Racket's FFI would be more ideal if one were to do this for real.

Maybe aiming for "mostly compatible, with some porting work for a handful of the more popular non-R (C, Rcpp) packages would yield a better result in the end.


* could not be used


you should see Julia. There's a lot of syntactic sugar that borrows from the best of other languages, for example the pipe |> operator from elixir and do...end block syntax from ruby.

Full unicode is supported. Unicode pi is implemented to mean the pure mathematical entity, so at compile-time it is turned into an memory reference to the most possible exact value.

The metaprogramming in julia is so good I wrote a verilog DSL that transpiles specially written julia into compilable and verifiable verilog - in 3 days.


Yep, I'm aware of Julia. I hope it takes off - it certainly looks a lot better thought out than R, which is very idiosyncratic. So far, though, it's still a fledgling.


I had a similar story. I used R for statistics at college, but only base R, and it is verbose even for basic data manipulation. The scripts I made for my older data blog posts are effectively incomprehensible to me now.

I ended up learning how to use Python/pandas/IPython because I had had enough and wanted a second option on how to do data analysis.

Then the R package dplyr was released in 2013, alleviating most annoyances I had with R. dplyr/ggplot2 alone are strong reasons to stick with the R ecosystem. (not that Python is bad/worse; as I mention at the post, both ecosystems are worth knowing)


Same story. I use data.table and ggplot2, with a couple of dplyr functions, for pretty much all of my plotting and analysis now.


I use both, but R for interactive analysis and reporting, Python for data transformations (ETL).

While the syntax of Python is "cleaner" for backend scripts, R feels more straightforward when working with dataframes (dplyr) resulting in things to report on. The syntax for ggplot2 fits the same category.

As much as having one languages for both categories would be nice, using both today seems like a better option.


The thing that makes me sad about R is textmining. TM makes me sad, strings-as-factors makes me very sad. But maybe I'll try tidytext...


Yeah, Python is way, way better for text. And I say that as a long-time R user. R really doesn't like things that can't be represented as datasets.


"No cell block output is ever truncated. Accidentally printing an entire 100,000+ row table to a Jupyter Notebook is a mistake you only make once." Hah, sadly this is not the case for me.


Jupyter now has output rate limitations (I pushed for them to add them), though I think they may be off by default. I also implemented something better (I think), which is an output buffer, for CoCalc's version of collaborative Jupyter notebooks. Instead of rate limiting, CoCalc saves only the last part of the output, discarding earlier output, and provides a link to get it.


Wow, CoCalc is really cool.

I've often found myself wishing for a collaborative environment like that... and there it is.

Thank you!


Yeah that sounds like something that should at least be possible to fix pretty easily using css max height and an overflow=scroll setting.


Sage deserves a mention: http://www.sagemath.org/


The big question in data science is: should I spend more time learning Python or R?

The answer is always: math


One of the best things for R is the mailing list. At leasr when I was learning stats and R rhe knowledge both of how to do stuff in R and what math to use when was phenomenal. If gentle people didnt answer in time Prof Brian Ripley from Oxford would answer early morning British time and explain why your question was wrong and what the math meant and what you really meant to do and why and then 3 lines of R to do it.


Which I find leads you to Python, then Julia, then bracket. /s


Can I say how damn good notebooks (any notebooks) are for data exposition compared to traditional coding environments?

I'm more familiar with Jupyter than R Notebooks. I'd second the point about version control in Jupyter being.. hard. There isn't really a good pattern for it yet.

I would note that I believe the latest version of Jupyter has prettier tables though!

Edit: Also, matplotlib makes me sad. Surely there could be something better which abandons it completely?


Re: matplotlib

You have other options like bokeh and plotly


And for options utilizing grammar of graphics (really worth knowing, even if you're not coming from the R world) you have ggplot and plotnine, which are both sort of ports of ggplot2 from R.


> "grammar of graphics (really worth knowing, even if you're not coming from the R world)"

Can you explain why? I've never gotten the appeal. Besides the concept, the classic implementation (ggplot) does not make nice graphs in my opinion. To me these look wrong... I guess I'm not quite sure why. There is something about it too cookie cutter, cartoonish and information-light: http://r4stats.com/examples/graphics-ggplot2/


ggplot has many built-in themes if you dislike the defaults. theme_minimal() is a newer one which is close to the FiveThirtyEight style and it works great.


It looks better, but still I don't really get what is so great about ggplot. With base R I can pretty much make a chart look however I want based on the context, if it is just something quick and easy for me to check something (where axis labels dont really matter etc) it is like one line of code too.


Can someone here recommend good practical statistics book? Something with modern methods, but explained sufficiently in depth?


For statistical programming, since we're talking about R, I strongly recommend R for Data Science (http://r4ds.had.co.nz) by Hadley Wickham (who created a large amount of the R packages that are very commonly used [tidyverse] and incidentally also now works for RStudio)

A good book on statistical theory is harder to come by, though.


Introduction to Statistical Learning is free and quite good: http://www-bcf.usc.edu/~gareth/ISL/

Follow it up with Elements of Statistical Learning by three of the same authors for more advanced stuff.


I can thoroughly recommend Elements of Statistical Learning.

It won't teach you much about theoretical statistics, or even things like experiment design, but you will learn a LOT about regression, classification and model fitting which is what everyone seems to want to be able to do these days.


Perhaps not quite on topic, but introductions to statistics which take a Bayesian approach are starting to exist. Like http://xcelab.net/rm/statistical-rethinking/ or perhaps https://github.com/equinn1/MTH225_Spring2016 .


Each field has their own "good practical statistics book". I work in finance and so recommend Fabozzi. It's good, but so are many other foundational texts. Your requirement for practicality necessarily negates a one true answer.


I am currently going through this free book : https://www.openintro.org/stat/textbook.php?stat_book=os


It will depend a lot on your field, but a solid grasp of fundamental probability theory should be applicable everywhere.

I think this is an excellent overview [1]. Learning probability from a measure theory angle is more difficult to grok compared to the frequentist approach everyone is more familiar with, but I found it much more enjoyable. (I learnt the usual way from doing computer science undergrad, but now re-doing it more rigorously for masters in financial engineering)

[1]: http://www.math.uah.edu/stat/index.html


What can you do with the measure theoretic foundations vs the traditional approach? I know graduate classes typically take the measure theoretic approach, but it's never been explained to me why.


To be honest after one semester, I can't eloquently state it without rambling for a few hours, so I'll direct to this excellent HN comment, which does a much better job than I could:

https://news.ycombinator.com/item?id=14286604


What's your background and what exactly do you mean by modern methods? An Introduction to Statistical Learning is good and you can download the pdf: http://www-bcf.usc.edu/~gareth/ISL/ it assumes you have a pretty decent background in mathematics though.


Call someone explain in what sense these notebooks are "reproducible" to a greater extent than just a .py or R file? I'm not that familiar with them. Do they have key metadata or something?


Writing a bunch of scripts can quickly become a mess. I was working on some twitter analysis for a project, and not really worrying about the code because I didn't intend for it to be used again, and it quickly became a mess of "run this script, then run that script on the generated file, then use this shell command to process the file, then run the final analysis step on that file, then clean up all the intermediates". Not to mention, say, "one-time" data cleanup through the shell / REPL that runs into problems months down the line when you want to update the data set. And, of course, invariably none of this is documented. Notebooks don't force you to organize your code and write documentation, but they strongly encourage it.



Really cool article. It still seems like jupyter is the better longterm option because it offers so many different kernels.


The team working on Spyder (the Python closest alternative to RStudio) have something like R Notebooks in their roadmap for a while now, but it keeps being pushed into the future.

I wish they could use RStudio for a while and understand just how important is the feature for someone using Python for research.


Closest alternative to RStudio (on principle, if not in practice): http://rodeo.yhat.com


I don't see this replacing Jupyter notebooks any time soon, as they simply are better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: