Hacker News new | past | comments | ask | show | jobs | submit login
Boost Your Data Munging with R (jangorecki.github.io)
81 points by michaelsbradley on July 22, 2016 | hide | past | favorite | 36 comments



While data.table does have impressive benchmarks, the rise of dplyr in the years since with its significantly more readable and concise syntax makes life a lot easier for data sets that are not super large. (https://cran.rstudio.com/web/packages/dplyr/vignettes/introd...)

The vignette linked above describes some compatability between dplyr syntax and data.table, but I admit I have not tested that.

Nowadays, if you are working with multigigabyte datasets, it may be worth looking into Spark/SparkR (https://spark.apache.org/docs/latest/sparkr.html) for manipulating big data in a scalable manner instead, as there is feature parity with data.table + analytical tools [GLM] as a bonus. (The syntax is still messy, to my annoyance)


You can rent an r3.8xlarge instance with 344GB RAM on AWS for $2.66/hr. If that isn't enough, an x1.32xlarge instance with ~2TB RAM is $13.34/hr. Assuming your data is already in AWS, of course. Of course those machines have a crap ton of CPU power (I'm explicitly not saying cores because I'm mystified by what vCPU really means) as well which you will be hard pressed to take advantage of but if it's RAM you need they're super-cheap.

If you prefer Azure, a G5 instance with 448GB RAM is available for $9.65/hr.

Not a lot of need for Spark/SparkR for the vast majority of data sets given the cheapness of compute these days.


The CPU power would not be too hard to utilize, if one's task is amenable to fairly coarse-grained parallelization, which is rather easy to implement in R scripts with the help of foreach[1] and doFuture[2].

Such an approach worked to great effect for me recently, when I needed to perform a `zoo::rollapply`[3] across a time series with tens of millions of rows. The speedup when throwing more cores (7, on my laptop) at it is roughly linear. If I ever need to scale up the analysis to hundreds of millions of rows, the 128 vCPUs of an x1 EC2 instance would be well worth the $$/hour.

[1] https://cran.r-project.org/web/packages/foreach/index.html

[2] https://cran.r-project.org/web/packages/doFuture/index.html

[3] http://www.rdocumentation.org/topics/1505094


Shameless self promotion: my `slide_apply` function shows 5-10x improvement over `zoo::rollapply` (despite no Rcpp call, somehow...)

https://gist.github.com/stillmatic/fadfd3269b900e1fd7ee

if your function has a well-defined rolling form, e.g. rolling mean or standard deviation, you should use a pre-optimized function for that. package `catools` has a bunch of useful ones.


Thanks! As it happens, my function doesn't conform to any pre-optimized facilities, though I wish it did. At some point, I may look into writing an optimized variation, with the help of Rcpp, or even using the (inspiring!) fortran approach of quantmod/xts.


vCPU is a hyperthread


The syntax is very concise and cryptic, but what is messy about dt[i,j,k]?

data.table transformed the way I do analysis. Honestly probably 75% of my lines have are dt[] calls. It allows me to analyze on 500M row data.sets routinely on a powerful workstation interactively.


How much memory does a 500m row data set use in data.table?


While I like and use dplyr time to time, I wish it could handle POSIXlt datetime. Also, it seems to barf on some dataframes just because the df contained a column format, dplyr doesn't support even if it is not being used in processing.


just use lubridate, it has a lot less headache to it.

my beef is with the new tibble-ish format of dplyr. the format doesn't support lists as cells, unlike data frames. right now, `tbl_df` is a good in-between, but since dplyr is converging to the tibble/feather format, it's gonna be rough


Nowadays I almost always use both data.table and dplyr together in my projects. It's a really expressive combination.


Do you use data.table syntax, or do you use dplyr with the tbl_dt class to abstract that away?

I was a data.table acolyte for years, but after being forced to learn dplyr I can't imagine going back. I'm hoping if I need update-by-reference in the future I can use tbl_dt but I haven't played with it yet.


If I'm using them together, I'll start off a pipeline with a big DT expression, then pipe it through a few dplyr verbs as a kind of "post processing."

I find that some tasks are frustrating with one package but easy with the other, so more frequently I just have some "dplyr lines" and some "data.table lines" in my scripts.

And the compatibility between the two has improved a lot, so you don't get so many "invalid selref" warnings and unexpected behavior when you combine them in pipelines


Does any of these support out of memory data? IIRC dplyr can use a db as backend but I could be wrong. Anyway, I wonder why not just use a db & sql for more demanding data munging?


Presumably that depends at least on the type of data and whether you have a suitable database to use. As another approach, for R there's an interface to parallel NetCDF in the pbdR system that you might use for distributed processing on an HPC system. There doesn't seem to be anything like pdbR for Python, for instance.


I routinely do my data munging in SQL/Hive/Pig. I think R is more of a complement to that. Mainly for more data munging that needs some programmatic manipulation and of course advanced statistical analysis.


How to convert Wide format data to Long format data (process called 'reshaping')is very very important to understand. Certain statistical tests cannot be performed if not in Long format.

The article mentions Reshape2. I also love TidyR. It's much easier to use than Reshape2 when dealing with a large number of variables (i.e., columns).


I find Tidyr more frustrating than useful a lot of the time.


Interesting, I'll have to watch out for it then. I found it more helpful than Reshape2 when trying to convert a dataset with 55 variables from wide to long format. 8 of those variables needed to be collapsed into 1 variable. With Reshape2, I had to identify all the ones I didn't want in the final result, with TidyR I just needed to identify the ones I wanted. Hence it was easier with TidyR.


I went from barely using data.table to only using data.table for basically everything in less than a few years. I think this is the trend given it's faster than basically everything: https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A...


In Python, you have pandas and then Python is great for text processing in general. Is there anything that R provides more for data munging?


It's a bit lame, but I am so in love with rstudio... I can't think of a python IDE that's near for data science. Simple things like being able to see all the entities you've created in your session and being able to jump back and forth through plots are really nice.


Since the comment from the new account mentioning it may be modded away, I'll say (as a recent user, not affiliated with them in any way) that Rodeo is a fairly robust integrated analysis environment, a more immediate kind of Jupyter notebook. They recently added a native dataframe viz, too.

http://blog.yhat.com/posts/rodeo-2.0.6-dataframes.html



Also a silly reason: for me the text rendering in Rodeo on a Mac without a retina display is a lot worse than the text rendering in RStudio.


I am with you on this. While I feel like I should eventually move fully into python for data, with it being a more multipurpose language. I can never never seem to make the transition because python does not have a rstudio equivalent.

I have tried yhat's rodeo and also yhat's port of ggplot for python but it just isn't there yet in my opinion.


For python you have something similar, though not quite as good with https://github.com/spyder-ide (Spyder).

I use both heavily in my daily work.


I switched from Python/pandas to R because I was more interested in applying the algorithm/methods instead of developing them. Python is fine if you want to develop the algorithm or implement a particular method. But if you are more interested in application of algorithm/method, the availability of R packages to do so hands down the best. Anything you may want to do with R, most probably there is a R package already for it.


This really strongly depends on your industry/focus, and what your end goal is. Some industries are completely dominated (library/community wise) by one or the other. I used to do finance work and strongly preferred python, so I tried hard to use it. (This was also before/right when pandas was out.) I was constantly plagued by needing some fitting routine e.g. for a vector GARCH model, and there was a package in R just sitting there. I was able to get very good mileage out of the RPy2 interface, though.

On the flip side, I've done a fair swath of work in the machine learning arena, and in particular the deep learning topics before it was called 'deep learning', and it was nearly hopeless to use anything except Python or MATLAB (both strongly tied in with C++/CUDA libaries). I think this is still largely the case.

As others have mentioned here, for me the biggest pull away from R was that it's not general purpose. Hard to ship someone R code. Hard to throw a web framework in front of your code. Hard to build a rich desktop GUI on top of it. I know you can do most of those things in R, but last time I dealt with it there was a massive ravine in usability/maturity. I'd also be insincere if I didn't admit that the pervasive R coding style just drove.me.fucking.crazy. That and I found that while there was a mindbogglingly large pile of libraries, documentations was usually very lacking.

/rant


One way to get R more production worthy is to use things like PMML to scale up models.


Purpose-built syntax. Pandas feels like a hack by comparison.


One of the aspects of data munging is visualization, which is much easier in R with ggplot. Matplotlib is painful and always requires more than double the code to accomplish the same thing.


There's s lot more options in Python for data viz other than matplotlib. Pandas has several built ins as well, so in many cases simple plots on existing data frames can be a one liner.


R is generally better for data science, so it's nice to be able to use 1 ecosystem for a whole project. But yes, I too have in the past used python for data munging and R for machine learning.


R has a robust package ecosystem for data analysis, even better than Python's suite.

ETL is just one step of the pipeline.


The only good solution to ETL/munging/wrangling is streaming. Loading everything into memory is not a good idea.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: