Hacker News new | past | comments | ask | show | jobs | submit login

dplyr and related packages use the existing R data frame class. (A "tibble" is just a regular R data frame under the hood.) This means that it inherits all the performance characteristics of regular R data frames. data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm. There are other more subtle reasons for the differences, but that's the absolute simplest explanation.

Supposedly you can use data.tables with dplyr, but I haven't experimented with it in depth.




> data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm.

This is totally false. data.table inherits from data.frame. Sure, it has some extra attributes that a tibble doesn’t but the way classing works in R is so absurdly lightweight, that’s meaningless in comparison. Both tibble and data.table are data.frames at their core which are just lists of equal length vectors. You can pass a data.table wherever you pass a data.frame.


Thank you for the correction. I knew that tibbles were essentially just data frames with an extra class attribute, but for some reason I didn't realize this was also true of data.table. I think assumed that data.table's reference semantics couldn't be implemented on top of the existing data frame class, but I guess I'm wrong about that. Unfortunately it's too late for me to edit my original comment.


Tibbles are not just data frames with extra class attribute. For one - they don't have row names. Second, consider this example, demonstrating how treating tibbles as data frames can be dangerous:

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1
R-devel mailing list had a long discussion about this too: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...


Ok, fine, to be more precise, tibbles and data frames and data tables are all implemented as R lists whose elements are vectors which form the columns of the table. And also `is.data.frame` currently returns TRUE for all of them, whether or not that is ultimately correct.


dtplyr, the dplyr backend for data table is still IMHO not great, and will often break in subtle and not so subtle ways. Tidytable is, I think, a much more interesting implementation, and gets close to the same speeds.


Hmm, this looks very interesting! I've ended up preferring dplyr for it's expressiveness in spite of the speed difference, so this might be a nice compromise for when dplyr gets too slow.


Oh, I know that, I use it daily and I’ve read some of its source code. I’m just astonished that the best-performing data frame library in the world is developed in R and it outperforms engines written with million/billion dollar companies behind it.


data.table is written primarily in C. But R happens to have a very good package system and a very good interface to C code.

And Matt Dowle has bled for that C code.


I feel like some of it is to do with the way R's generics work - being lisp-based and making use of promises. It allows for nice syntax / code while interfacing the C backend.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: