Hacker News new | past | comments | ask | show | jobs | submit login
Python for Data Analysis, 3rd Edition (wesmckinney.com)
271 points by mariuz on July 2, 2022 | hide | past | favorite | 45 comments



The first edition of this book was a game changer. I still have my hardcopy paperback but haven’t turned to it in years. It’s probably largely out of date but I look at it fondly. Glad to see the material updated with new APIs and Python 3.10. The open access is very nice because it can easily be used for teaching and reference.

I would like to see some of Wes’ cautionary tales included into the book. Pandas feels so magical at times that you can uses it to solve anything. That can create problems. Ideas like this should be some appendix to the book.

https://wesmckinney.com/blog/apache-arrow-pandas-internals/

Lots of tools in the data space. It feels like choosing JS web front ends. I use pandas quite a lot but less frequently, as my data has gotten larger I turn to spark more. I’m also impressed by the work DuckDB is doing.


I switched from pandas to polars[0] . I'm happy with the performance gain. You still need to use Spark/dask on "big" data, but polars give you more room before reaching to these solutions.

[0] -- https://www.pola.rs/


glad it works for you but this sounds like bad advice for most people.. why change to an outlier toolchain for some percentage increase in performance and raw data capacity, when you admit yourself that many Big Data sets are not fitting there? No computer on my desk has less than 32GB RAM now -- pandas works well with that.


I think that neither "outlier toolchain" nor "some percentage increase" are fair. This benchmark [0] show significant speedup while lowering the memory needs. You still need to reach dask/spark for really big data where you need a cluster of beefy computers for your tasks.

If you use an r5d.24xlarge-like[1] instance, you can skip spark/dask for most workflows as 768 GB is plenty enough. On top of that, polars will efficiently use the 96 available cores when you are computing your join, groupby, etc.

Also polars is getting more and more popular[2]

[0] -- https://h2oai.github.io/db-benchmark/ [1] -- https://aws.amazon.com/fr/ec2/instance-types/c6a/ [2] -- https://star-history.com/#pola-rs/polars&Date


It's not so much larger data than it is analyst velocity.

Polars runs orders of magnitudes faster than pandas does. Which means EDA can be completed quicker.


I think it depends on whether you use it for operations or for data analysis. Speed is only one concern and it may not always be the most relevant concern.

A statistician/data scientist wrangling data and making plots would not have cared whether loading a CSV file takes one second or one microsecond, because they may only do it a handful of times for a project.

A data engineer has different requirements and expectations. They may need to implement an operational component that process CSV files repeatedly for billions of time a day.

If your use case is the latter, then pandas is probably not for you.


1 s vs 1 ms is not a great comparison.

Polars excels when pandas operations take 30 seconds or a minute to complete. Bringing that time down to the second or ms mark is really amazing.


I love pandas and work with quite small datasets for EDA (10^(3..6) most of the time) but even then I run into slowdowns. I don't really mind as I'm pursuing my own research rather than satisfying an employer/client, and often figuring out why something is slow turns into a useful learning experience (the canonical rite of passage for new pandas users is using lambda functions with df.apply instead of looping).

I've definitely procrastinated doing some analyses or turning prototypes into dashboards because of the potential for small slowdowns to turn into big slowdowns, so it's nice to have other options available. I'm very interested in Dask but have also been apprehensive about doing something stupid and incurring a huge bill by failing to think through my problem sufficiently.


Some percentage? I haven't used polars, but I ditched pandas after observing x20 speedup by just rewriting a piece of code to plain csv.reader. Pandas is unacceptably slow for medium data, period.


Having had the pleasure of profiling pandas code, the main bottleneck I observed was the handling of different types and edge cases.


Polars looks interesting. I’ve also come across modin[0] which lets you keep your existing pandas syntax, and just change the import statement, to speed up pandas using either Ray or Dask.

[0] https://modin.readthedocs.io/en/stable/

I don’t have much experience with this though.


Does polars have N-D labelled arrays, and if so can it perform computations on them quickly? I've been thinking of moving from pandas to xarray [0], but might consider poplars too if it has some of that functionality.

[0] https://xarray.dev/


polars only has 1D columns, it's columnar just like pandas.

IME xarray and pandas have to be used together. Neither can do what the other does. (Well, some people try to use pandas for stuff that should be numpy or xarray tasks.)


Addendum: Polars doesn't even have index so no multiindex either. I haven't gotten deep enough into polars to understand why and what it breaks, but it feels wrong to replace pandas with something without index.


Indexes don't really do anything useful that's not covered by.group_by() and .over(). There is no loss of functionality.


Right, it's still possible. But less close to hand and never automatic. Series also become less useful when they are just a sequence instead of an almost mapping (with an index) or a sparse sequence (with subset of larger index)


No experience with polars, but I've had quite a positive experience with xarray.

I still use pandas for tabular data, but anytime I have to deal with ND data, xarray is a lifesaver. No more dealing with raw, unlabeled 5-D numpy arrays, trying to remember which dimension is what.


Related:

Python for Data Analysis – A Critical Line-By-Line Reivew - https://news.ycombinator.com/item?id=15786977 - Nov 2017 (3 comments)

Ask HN: What are the best resources to learn Python for Data Analysis - https://news.ycombinator.com/item?id=12775775 - Oct 2016 (15 comments)

Python for Data Analysis - https://news.ycombinator.com/item?id=4697665 - Oct 2012 (28 comments)

Escaping The Walled Garden of Enterprise Analytics Using R and Python - https://news.ycombinator.com/item?id=4624186 - Oct 2012 (1 comment)

Python for Data Analysis (new O'Reilly book from creator of Pandas) - https://news.ycombinator.com/item?id=4020187 - May 2012 (57 comments)

Also:

Wes McKinney, the developer of Pandas - https://news.ycombinator.com/item?id=15878616 - Dec 2017 (50 comments)


Very nice to have an open access version. The paper version is really expensive if your salary isn't in dollars. I've regretted to by the print Brazilian edition. O'Reilly is my favorite technical publisher, but the non English editions are crap. My book didn't have an index, making it useless as a reference.


Still waiting for Wes to admit that the pandas API is a mess and support the development and adoption of a more dplyr-like Python library.

Pandas was a great step on the path to making Python a decent data analysis language, Wes is smarter than me, I never could have built it, but it's time to move on.


If you read my slide decks over the last 7 years or so (while I've been working actively on Arrow and sibling projects like Ibis) I've been saying exactly this.

See e.g. https://ibis-project.org/


Hey--I maintain a port of dplyr to python, called siuba[1]!

Right now it supports a decent number of verbs + SQL generation. I tried to break down why R users find pandas difficult in an RStudioConf talk last year[2].

Between siuba and tools like polars and duckdb, I'm hopeful that someone hits the data analysis sweet spot for python in the next couple years.

[1]: http://github.com/machow/siuba

[2]: https://youtu.be/w4Mi0u4urbQ


I learned data analysis in the Hadley/Tidyverse way and constantly struggle when working with pandas. I'll try siuba this week at work.

Just one question, this runs over pandas? It's possible to get the pandas syntax like dbplyr gives the SQL query syntax?


Yeah, it runs over pandas or sql databases! Here's an example querying sql:

https://siuba.readthedocs.io/en/latest/intro.html#Working-wi...

You can use the verbs collect() and show_query() like in dbplyr.

If you DM me on twitter, would love to set up time to hear about your work / walk through siuba!

https://twitter.com/chowthedog


Why wait when you can read years old blog posts on his personal blog talking about lessons learned (mistakes made) in this iteration of Pandas.

You're "waiting" for him to personally call you or something?


He did personally reply on the thread above - which was really gracious of him.


He has been saying exactly this for many years and has lead many efforts to improve both the implementation and API via projects like arrow and pyarrow.

Also, pandas was purpose built for a pretty specific domain of financial timeseries and cross-sectional data analysis at a time when the python ecosystem was much younger and very different from today.

It's not his fault it became so wildly successful! (actually it was - it's a great piece of sofware :))


I think that's part of the point of his current project: Arrow. Specifically, one of its goals is to implement the low level computations, whilst enabling more competition in the space of data analysis API design.


anyone have a suggestion for a best online course on Python programming? Wanting to focus on Python the language and getting solid at programming, not necessarily a particular library.

I’m pretty rusty at programming having last studied formally algorithms, data structures, oop in c++ 15 years ago in an undergrad compsci program.

I’ve done mostly sql coding and database work ever since but need to level up my skill set.


I use Python extensively for my analysis projects - and while Pandas is my go-to library for many things, I feel it's just very slow. I know that just doing stuff in numpy instead speeds things up considerably, and that's way before doing other optimization, but are there any other libraries out there similar to Pandas, but made for higher performance?


Curious where do you experience slowness with Pandas?

Usually slowness in Pandas will be if you start doing non vector things.

Whenever you catch yourself writing a loop in Pandas you know you've gone the wrong way.

I too use Python extensively (with Pandas among other things) and usually Pandas is perfectly fine. (havent gone over 64GB memory usage yet)

I consider Pandas to be Numpy with benefits (methods) - since Pandas dataframe is basically just a collection of Numpy arrays. Those are about as close to C type arrays you can get in Python.

The only practical problems have been with regressions among supporting libraries such as pyarrow not playing nice with numpy when working with parquet files.


Slowness from pandas mostly comes when people forget to use df.values before running their algorithm.


Polars and Modin support multiple cores and have optimizations. Modin has most of the Pandas api supported. Pyspark api for pandas supports multiple cores and clusters.


polars, but it's still developing.


You can save so much time and effort using the Google colab notebooks rather than setting up python on your own machine (as is recommended in this guide)


You can save a little bit of copy-paste that will take a few minutes at most. If you do it, you can work directly with your own files, work offline, control the hardware, etc. I think it's simpler than this guide makes out too, as it tries to minimise the amount of space used, which is often not needed. Installing Anaconda instead of Miniconda would get you pretty much set up in one step, plus a single copy-paste step if you want all the packages the book uses.


Worth mentioning Jupyter Lite in this context too.

Warning: This link will open a Jupyter notebook in your browser:

  https://jupyter.org/try-jupyter/lab/
It's worked pretty smoothly for me so far. I can't vouch for how it handles big data sets or obscure libraries, but seems like a pretty good starting point for those who are learning Python. It has become how I prefer to share simple notebooks with colleagues too.

But either of these options is nice for dealing with the situation of getting a beginner through the Python installation process. Another is WinPython, which is my preferred environment for local installation.


Downloading an installing Anaconda is pretty much painless and gives you more flexibility (and better responsiveness) than collab.


Not sure about colab, but its important to note that Anaconda is not free for commercial use.


They keep making confusing statements about this, I thought you only needed a license for CI/CD style usage of their repository?


Miniconda is free and is great for running jupyter.


I think there may be many a use case where the data being operated on cannot be shared with third parties.


Absolutely true, but learning things that hard way was worth it to me. Plus I am old-fashioned enough to like doing things on my own hardware and not necessarily wanting to share my data/code every time for reasons of security or modesty (as in embarrassingly basic). I do like what Colab offers and appreciate having all that processing power/infrastructure available.


Are there anything like this online that could run stuff written for pygame? I know lots of beginners start with Scratch, but having some type of gaming in the browser for Python would be nice.


Although this is true I find these online notebooks awfully slow




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: