dplyr and related packages use the existing R data frame class. (A "tibble" is j...

creddit · on March 14, 2021

> data.table is a completely separate implementation of a data structure that is functionally similar to a data frame but designed from the ground up for efficiency, though with some compromises, such as eschewing R's typical copy-on-modify paradigm.

This is totally false. data.table inherits from data.frame. Sure, it has some extra attributes that a tibble doesn’t but the way classing works in R is so absurdly lightweight, that’s meaningless in comparison. Both tibble and data.table are data.frames at their core which are just lists of equal length vectors. You can pass a data.table wherever you pass a data.frame.

rcthompson · on March 14, 2021

Thank you for the correction. I knew that tibbles were essentially just data frames with an extra class attribute, but for some reason I didn't realize this was also true of data.table. I think assumed that data.table's reference semantics couldn't be implemented on top of the existing data frame class, but I guess I'm wrong about that. Unfortunately it's too late for me to edit my original comment.

_fnhr · on March 14, 2021

Tibbles are not just data frames with extra class attribute. For one - they don't have row names. Second, consider this example, demonstrating how treating tibbles as data frames can be dangerous:

    df_iris <- iris
    tb_iris <- tibble(iris)

    nunique <- function(x, colname) length(unique(x[,colname]))

    nunique(df_iris, "Species")
    > 3

    nunique(tb_iris, "Species")
    > 1

R-devel mailing list had a long discussion about this too: https://stat.ethz.ch/pipermail/r-package-devel/2017q3/001896...

rcthompson · on March 14, 2021

Ok, fine, to be more precise, tibbles and data frames and data tables are all implemented as R lists whose elements are vectors which form the columns of the table. And also `is.data.frame` currently returns TRUE for all of them, whether or not that is ultimately correct.

lordgroff · on March 14, 2021

dtplyr, the dplyr backend for data table is still IMHO not great, and will often break in subtle and not so subtle ways. Tidytable is, I think, a much more interesting implementation, and gets close to the same speeds.

rcthompson · on March 14, 2021

Hmm, this looks very interesting! I've ended up preferring dplyr for it's expressiveness in spite of the speed difference, so this might be a nice compromise for when dplyr gets too slow.

orhmeh09 · on March 14, 2021

Oh, I know that, I use it daily and I’ve read some of its source code. I’m just astonished that the best-performing data frame library in the world is developed in R and it outperforms engines written with million/billion dollar companies behind it.

hugh-avherald · on March 14, 2021

data.table is written primarily in C. But R happens to have a very good package system and a very good interface to C code.

And Matt Dowle has bled for that C code.

dm319 · on March 14, 2021

I feel like some of it is to do with the way R's generics work - being lisp-based and making use of promises. It allows for nice syntax / code while interfacing the C backend.