even i had that question - especially when people are talking about using SFrame...

rch · on Feb 17, 2016

> this is pretty much nuke-from-orbit

That analogy might imply overkill, thus highlighting the tactical advantages of the SFrame approach in processing a month's worth of 1-10GB daily-generated SQLite files, for instance.

sandGorgon · on Feb 18, 2016

Well no. It means the same situation as git vs mercurial. One may be better than the other (as you mentioned), but it really doesn't matter.

If Pandas and other high profile products are endorsing it (and may adopt it), it's going to be very hard for 99% of people to choose something else.

tanlermin · on Feb 17, 2016

Python's Dask out of core dataframe can also do that.

infinite8s · on Feb 17, 2016

Dasks' out of core dataframes are just a thin wrapper around pandas dataframes (aided by the recent improvement in pandas to release the GIL on a bunch of operations)

tanlermin · on Feb 17, 2016

Uh, no they are not. They lazy- scale pandas to on disk and distributed files.

http://dask.pydata.org/en/latest/dataframe.html

"Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads."

http://blaze.pydata.org/blog/2015/09/08/reddit-comments/

sandGorgon · on Feb 18, 2016

Why doesn't Pandas have anything to save the entire workspace to disk (like .RData). There are all these cool file formats like Castra, HDF5, even the vanilla pickle - but I don't see anything with a one shot save of the workspace (something like Dill)

Is this an antipattern for Pandas?

infinite8s · on Feb 18, 2016

You haven't refuted anything I said. Internally the dask dataframe operations sit on top of pandas dataframes. All dask does is automatically handle the chunking into in-memory pandas dataframes and interpret dask workflows as a series of pandas operations.