Hacker News new | past | comments | ask | show | jobs | submit login
Pandas 0.4 (Python data analysis library) released (sourceforge.net)
84 points by wesm on Sept 12, 2011 | hide | past | favorite | 11 comments



This looks really well done.

I wonder what the motivation is to do this when R is so mature (especially in the availability of specialized packages), and available through RPy.


Here's an article I wrote about some of my motivations on the Python side of things: http://news.ycombinator.com/item?id=2790762

The bigger picture reason "why not R" is that R is not very suitable for building production systems. I started building this library while working for AQR, a quant hedge fund, and needed to have statistical computing building blocks integrated with a much larger system. R is a mediocre programming language and has very weak general purpose libraries. But amazingly good data visualization and mature statistics libraries indeed. Using R as a black box (e.g. via RPy, Rcpp, or RJava) is a good idea in theory, but recovering from and dealing with errors/exceptions with real world data is a very thorny problem. Plus maintaining a big pile of R code is kind of a nightmare (believe me, been there, done that!).


On a related note, I've said exactly the same thing about Matlab to people before. It (Matlab) is good for getting the algorithms and calculations correct, but as a programming language, it's pretty terrible.


Exactly. I use scipy, and only use R when forced to, mainly because I want to know what's going on, and write my analyses in a real language.


I've had to do this with SAS and perl.

Looking forward to checking this out.


I want to also point out that doing statistics in Python is currently a chicken-and-egg problem. You need the foundation of statistical data structures and algorithms to enable people to implement their models and research. NumPy and SciPy by themselves just don't compare with base R. So I've been trying to a) fill that gap and b) build tools that are quite a lot superior to R's. I use R myself but I have no intention of "copying" or "replicating" R but rather figuring out what what are the fundamental data manipulation / statistical computing challenges and building tools that address them.


Claiming "DataFrame" to be two-dimensional almost caused me to discount the entire project. On the contrary, it seems like they are actually multi-dimensional (since they are essentially OLAP relations) and you should advertise them as such. That they can be viewed as two-dimensional tables is incidental.


Thinking of DataFrame as a 2D structure just aids mental visualization. With hierarchical indexing they can be arbitrarily highly-dimensional so maybe I should sell it a bit more like that. In an OLAP setting the "sparse" format may often be better than the "dense" (truly N-D) format.

It would be an interesting avenue to pursue building a "big data" on-disk OLAP engine with pandas-like semantics (e.g. expressing groupby operations with the same syntax but operating on big data on disk or across a cluster of computation nodes).


No, they are multidimensional even without hierarchical indexing. e.g.:

x y z value 0 0 0 32 0 0 1 64 0 1 0 23 0 1 1 3.14 1 0 0 4.3 etc.

This is essentially a 3D cube of data, no hierarchical indexing involved. The benefit of hierarchical indexing is that you can wrap your spatial dimensions into a single real dimension for e.g. code abstraction.

I have actually been developing a similar library for OCaml (even with hierarchical indexing). It is good to see our libraries share many of the same ideas! I wonder though, have you considered GPU acceleration? AFAIK neither Matlab nor R do this natively yet.


You can also catch wesm talk about the pandas package at AOL HQ in New York this Wednesday. I'll be there, ready to find an alternative to R.

http://www.meetup.com/nyhackr/events/28880161/?hidePromoBar=...


Thanks for your work on this, it looks great. I'd love to use python end-to-end. Any plans to support some parallelism?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: