Hacker News new | past | comments | ask | show | jobs | submit login

even i had that question - especially when people are talking about using SFrame as the underlying structure for Julia. but then I saw this:

"Arrow's cross platform and cross system strengths will enable Python and R to become first-class languages across the entire Big Data stack," said Wes McKinney, creator of Pandas.

Code committers to Apache Arrow include developers from Apache Big Data projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark, and Storm as well as established and emerging Open Source projects such as Pandas and Ibis.

this is pretty much nuke-from-orbit.




> this is pretty much nuke-from-orbit

That analogy might imply overkill, thus highlighting the tactical advantages of the SFrame approach in processing a month's worth of 1-10GB daily-generated SQLite files, for instance.


Well no. It means the same situation as git vs mercurial. One may be better than the other (as you mentioned), but it really doesn't matter.

If Pandas and other high profile products are endorsing it (and may adopt it), it's going to be very hard for 99% of people to choose something else.


Python's Dask out of core dataframe can also do that.


Dasks' out of core dataframes are just a thin wrapper around pandas dataframes (aided by the recent improvement in pandas to release the GIL on a bunch of operations)


Uh, no they are not. They lazy- scale pandas to on disk and distributed files.

http://dask.pydata.org/en/latest/dataframe.html

"Dask dataframes look and feel like pandas dataframes, but operate on datasets larger than memory using multiple threads."

http://blaze.pydata.org/blog/2015/09/08/reddit-comments/


Why doesn't Pandas have anything to save the entire workspace to disk (like .RData). There are all these cool file formats like Castra, HDF5, even the vanilla pickle - but I don't see anything with a one shot save of the workspace (something like Dill)

Is this an antipattern for Pandas?


You haven't refuted anything I said. Internally the dask dataframe operations sit on top of pandas dataframes. All dask does is automatically handle the chunking into in-memory pandas dataframes and interpret dask workflows as a series of pandas operations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: