PyPy: NumPy funding and status update

wesm · on Oct 13, 2011

Maybe it's just me, but reimplementing NumPy on the PyPy platform in 6 months with an estimated 1000 hours of work seems extraordinarily ambitious. I don't mean to pooh-pooh this effort (which I would like to see happen), but the general feeling I've gotten from folks in the scientific Python community is that introducing a JIT is not going to be a magical solution for our performance problems, especially considering that many scientific Python programmers are already programming very close to the metal with Cython (or wrapping Fortran 90 with f2py). I didn't come up with this-- I'm just rehashing conversations I had at PyCodeConf last week. Other members of the SciPy community have some fairly different ideas about building a new architecture for array computing (http://conference.scipy.org/scipy2011/slides/wang_metagraph....), i.e. building a dynamic fusing compiler ("stream fusion") for array expressions.

However, having a numpy-lite in PyPy would enable a lot of people to switch to PyPy who are currently being well-served by the current version of NumPy and thus benefit from the rest of their Python code being a lot faster.

A lot of people have asked me recently if PyPy would help me with my library, pandas. My answer so far has been "even if NumPy worked on PyPy, probably not all that much". It'd be cool if I am proved wrong :)

infinite8s · on Oct 13, 2011

Even with no speedup, the main benefit would be the ability to use pypy for the rest of the "supporting" python code.

mathninja · on Oct 13, 2011

Exactly. Numpy is fast enough. Its the code calling it that is slow.

wesm · on Oct 13, 2011

NumPy is not fast enough, that's the problem (http://technicaldiscovery.blogspot.com/2011/07/speeding-up-p...). In scientific applications, the code calling it is rarely the bottleneck-- if it is you might be doing something wrong. The biggest bottlenecks I encounter are a) computation and b) data serialization / deserialization (especially if a database of some kind is involved)

wladimir · on Oct 13, 2011

Theano is also very interesting in regard of array computing, I've used it for very fast convolutional-neural-network training on GPU, but it can do many things (automatic symbolic differentiation, code generation, etc): http://deeplearning.net/software/theano/

I do really love the PyPy work for 'creating a faster Python'. I have a lot of scripts in Python that do parsing and then some work with numpy. These would hugely benefit from this.

brentp · on Oct 13, 2011

I hadn't heard about metagraph. Sounds useful. Where's the source?

As I understand it, the implementation in pypy is lazy by default and only "forces" a result when it's needed. So (again IIRC) it potentially avoids intermediates like numexpr (http://code.google.com/p/numexpr/).

cdavid · on Oct 13, 2011

The current code only implements a fairly trivial array structure in python, and it does not seem to implement any expression lazyness. But pypy should obviously make it easier to try this kind of things compared to the current numpy.

fijal · on Oct 13, 2011

It does implement lazy evaluation of array expressions and JIT to compile it on the fly to assembler. Having an assembler generator that's not too bad helps immensely with such efforts

gtaylor · on Oct 13, 2011

This is great news. There are so many great packages that depend on NumPy support.

cdavid · on Oct 13, 2011

The strategy followed by pypy to reimplement numpy from scratch makes it rather unlikely that it will support packages depending on numpy, because of the various dependencies on numpy's implementation details.

kingkilr · on Oct 13, 2011

That's what they said about reimplementing Python ;)

cdavid · on Oct 13, 2011

fair enough.

But you could also argue that this reinforces my point, as few people use pypy instead of python.

njharman · on Oct 13, 2011

recursion overload!

That reinforces the goal of porting NumPy (and PyPy's trackrecord at achieving such ports). That is all the people not using pypy cause it lacks numpy will, after this port, have the option to.

cdavid · on Oct 13, 2011

There is no question about the value of numpy on top of pypy. The issue is whether a reimplementation from scratch is the best way to achieve that.

fijal · on Oct 13, 2011

There are two blog posts as of why:

http://morepypy.blogspot.com/2011/05/numpy-in-pypy-status-an... http://morepypy.blogspot.com/2011/05/numpy-follow-up.html

It's not possible to do cool stuff with reusing - like parallelizing expressions etc. The architecture as it is now already can score 2x wins over original numpy with array expressions and we expect it to get only better with SSE and more parallelizing. This requires reimplementing numpy.

cdavid · on Oct 13, 2011

This is again a different argument. I understand it is more fun, more rewarding and more challenging to implement a new array module on top of pypy. But it is seriously doubtful that it is the best way forward to make pypy usable for libraries which depend on numpy.

Derbasti · on Oct 13, 2011

What about Scipy? Will Pypy support Scipy soon, too? Without Scipy, I don't neet Numpy.