Numerical computation libraries: numpy, scipy, ML libraries like sklearn, pytorc...

KeplerBoy · on Nov 3, 2021

Those teams still wouldn't switch because they wouldn't gain any performance from switching. After all CPython is just calling highly optimized, already compiled C, Fortran or Accelerator specific (CUDA, ROCm, TPU) code as far as they are concerned.

lumost · on Nov 3, 2021

many/most datascience processes end up slowing down when inevitably the data must move back to python or a python function must be invoked on some data.

A significant performance improvement in python would benefit many ds related tasks.

Filligree · on Nov 3, 2021

Very much this. Anyone who does machine learning will notice their CPU sitting at 100% of one core a significant fraction of the time.

Doesn't matter how fast a GPU you have; Python and the GIL is the bottleneck.

nerdponx · on Nov 5, 2021

This is very true, especially when pre-processing text and other unstructured data. It ends up being a lot of loops, string manipulation, and dict lookups.

Fortunately, with a tool like DVC or even Make, you usually don't have to (or want to) put that code in the same script as the actual machine learning part. So you can theoretically run the former with PyPy and the latter with CPython, if you really need to maximize both.

agravier · on Nov 3, 2021

I would, I find that there are lots of data transformations and non-deep modelling happening in python still. E.g string processing, Json munching, business rules if-this-then-remove, etc.

nerdponx · on Nov 5, 2021

Having spent a lot of time on data science teams, rewriting hot sections of text processing code in Cython to obtain acceptable performance, I can tell you that I would have gladly switched away from CPython specifically for those tasks. If you're using Conda, it's almost trivial to have a PyPy environment alongside a CPython environment in the same project. You run the data processing scripts/notebooks with the former and the machine learning stuff with the latter.

But my post was more oriented towards non-data-science uses of Python, like writing an API server or a web crawler or a TUI application. I think the "serve a prediction from a PyTorch model" part threw off the conversation a bit!