Automatic SIMD vectorization support in PyPy

Fede_V · on Oct 21, 2015

PyPy is an absolutely amazing piece of technologies and the core developers are brilliant people - however, their main use case is not numeric python.

Their framework is (theoretically) able to trace any Python operation, no matter how dynamic, and speed it up. This means that if you want to speed up pure Python code, PyPy is really the only game in town.

The downside is that when scientists use Python, Python is used as a beautiful API on top of very optimized code written in Fortran or C. If you want to do numerically complex code in Python, you are much, much better off using numba. numba is much less ambitious than PyPy - it handles a small subset of Python (basically, NumPy) but it is very, very fast and very efficient at speeding up pure NumPy code. In my experience, 100x speed ups (over pure NumPy code) are not that uncommon.

The founder of continuum (Travis Oliphant) wrote a blog about his technical vision for a Python jit: http://technicaldiscovery.blogspot.it/2012/08/numba-and-llvm... and http://technicaldiscovery.blogspot.it/2012/07/more-pypy-disc...). Basically, the continuum team made a big bet that a very efficient JIT that targets only numerical python would be more useful than a generic JIT that can theoretically handle all of Python. For my use case (scientific coding) - numba is far superior.

maattdd · on Oct 21, 2015

As you stated, targeting numerical code (array access and primitive operations) versus general code (with modern features such as virtual dispatch, varargs, generics or even metaprog) is completely different in terms of difficulty and solutions.

It is indeed fairly easy to create a highly performant JIT for numerical operations on N-dimensional arrays (vector,matrix..etc..); and depending on your field, it might represent 99% of the execution time.

For example, I wrote a really simple JIT compiler for Matlab which performs sometimes better than raw C (and it's backed by LLVM vectorizer to generate the correct SIMD assembly). Link to the Master thesis if you are interested: https://www.dropbox.com/s/caz7d4d08xhbwcu/thesis.pdf?dl=0

fijal · on Oct 21, 2015

I'm not really going to argue with you, if you can make numba perform - great! We're aiming at providing middle ground. Situations where the code is heavy on numerics, but complex enough that numba either can't handle it at all or is not performing very well. That covers some usecases and from what you're saying very likely not yours, but that's ok. We don't have to cater to everyone and there is a place for tools like pypy (also in numerics) and tools like numba :-)

mangecoeur · on Oct 21, 2015

PyPy is really impressive, I love the idea of getting all these optimisations (SIMD, STM) for free... but as you say, numerical work means numpy, scipy, pandas, which don't work with PyPy. Even if the NumPyPy project was able to fully match the Numpy api, you still have a lot of large projects like Pandas that depend on the c api. it would be stupid to copy everything. Perhaps something can be worked out between pypy, cffi, and cython.

In the short term Numba is much more practical for numerics. In the longer term Pyston looks promising - it's actually similar to Numba in that it also uses LLVM, I imagine there could be synergy between the two...

sitkack · on Oct 21, 2015

NumPy is a protocol that dictates the layout of multidimensional arrays with really fast Fortran code that knows that layout. What needs to be copied is that memory layout protocol so that we get n:m sharing instead of n^2 duplication.

mangecoeur · on Oct 21, 2015

My point was that even if you copy the numpy protocol, you still have huge projects that depend on c-extensions that you wouldn't want to port.

Pandas is one, others include scikit-learn, scikit-image, Astropy, Bioinformatics libraries, stats libraries, etc... which all have heavy C/Cython use and depend to varying degrees on the Python C-api. Porting NumPy barely scratches the surface of scientific python.

dalke · on Oct 21, 2015

Biopython runs on pypy. It also runs under Jython. While it has C use, it is not "heavy" C use.

Going on a tangent, and though I realize it's a lost battle, I wish people would stop saying that NumPy is the base of scientific programming in Python. As Biopython shows, it isn't required for at least some of bioinformatics.

My own research[1] deals with chemical graphs, and NumPy/SciPy/etc. are nearly irrelevant to that research.

[1] For example, given a set of 100 structures, what is the largest substructure (based on the number of bonds) which is in at least 90 of the structures?

Alphasite_ · on Oct 21, 2015

Last I heard there were efforts to support the c api in pypy: https://bitbucket.org/pypy/compatibility/wiki/c-api

pjmlp · on Oct 21, 2015

The great thing is that Python JIT research is happening in multiple fronts, and maybe someday in a very far future CPython won't matter any longer (except for legacy deployments).

ngoldbaum · on Oct 21, 2015

All of the operations in the plots are being calculated using tiny arrays. Even 128 elements is pretty small. Additionally, I'd be interested to know how they compiled NumPy, and which BLAS implementation they linked against.

jcranmer · on Oct 21, 2015

The graphs are a travesty of detail, and I suspect that the more you poke it, the worse the results get. There's a definite downward trend in speedup as you increase vector size, and that is completely different from what you would naively expect (at larger vector sizes, you spend more time in the embarrassingly-parallelizable kernel)--so this suggests that the baseline itself is not scalar code but vector code.

So what the graphs end up highlighting is that the CPython (NumPy?) has decent overhead on tiny calls that don't really matter for gross performance. As you say, 128 elements is tiny: that's where I'd consider starting a graph, not ending it. At 4 elements, you're looking at about 40 clock cycles to do a simple vector-add in the scalar case (I think x86 L1 cache hits are ~3 clock cycles and there are two load/store units; if these numbers are wrong, my estimate is off). This also means that there's absolutely no measurement of the overhead of the JIT autovectorizing operation (!), which is quite significant because the usual resistance to adding autovectorizers in JITs [1] is that they're way too slow for the speedup they give.

[1] As far as I'm aware, the only other JIT of note that includes an autovectorizer is the Hotspot JVM. The approach taken by other people (e.g., .NET CLR) is to effectively expose a SIMD primitive in the bytecode and get compilers to target those via static autovectorization instead of autovectorizing at runtime.

lqdc13 · on Oct 21, 2015

The demo should really be 128 thousand elements or 128 million.

128 elements is still useful for when you have an inner loop that has a small vector operation. Most of the time this is not the case though.

mistercow · on Oct 21, 2015

> This also means that there's absolutely no measurement of the overhead of the JIT autovectorizing operation

I definitely agree that overhead should be included in the article, you wouldn't expect to see it in those graphs, because it's a once off, right?

repsilat · on Oct 21, 2015

> at larger vector sizes, you spend more time in the embarrassingly-parallelizable kernel

Vectors are only fixed-width, so even though it is an "embarrassingly parallel" problem, you still only expect it to asymptote towards a fixed speedup. Moreover, the bigger the matrix, the more likely you are to see cache size and memory bandwidth effects in your performance numbers, meaning the SIMD could be less of a win in the limit.

zurn · on Oct 21, 2015

My guess is that since Numpy with CPython is all about calling into optimized native-code kernels, there's not much PyPy can do to win in big arrays. With small vectors the C FFI overhead would dominate on CPython and PyPy has an advantage. This would be more apparent if they provided also graphs with absolute numbers and not just speedups relative to classic Numpy.

leecb · on Oct 21, 2015

They have decided to rewrite the C parts of NumPy in RPython, essentially working towards a pure Python version of Numpy.

There are a lot more details sprinkled across several pages: http://morepypy.blogspot.mx/2013/11/numpy-status-update.html http://pypy.org/numpydonate.html

fulafel · on Oct 21, 2015

Yes, so PyPy can win on small arrays because there is no per-call C FFI overhead in calling into RPython-Numpy. The smaller the operations, the greater the win over CPython.

amit_m · on Oct 21, 2015

Can anyone explain the version numbers? PyPy 2.6.1 compared to 15.11, which will eventually become 4.0.0...?

sirn · on Oct 21, 2015

AFAIK, they were considering to change the versioning scheme because the old scheme is becoming too confusing.

The old scheme (PyPy 2.x) resembles Python 2 versioning too much (PyPy 2.6 vs. Python 2.7) and may cause confusion that PyPy version correspond to Python version it implements (while in fact it is not). Say, if the next release is PyPy 2.7, some may assume it is a PyPy implementation of Python 2.7 (even though PyPy 2.6 is already Python 2.7). The situation is also confusing with Python 3 as well (PyPy3 2.6 for Python 3.2.)

I think PyPy's initial plan was to use YY.MM for versioning, so the tentative version was 15.11 but now looks like they decided to follow the old scheme but with major version > 3 instead (so the next release is PyPy 4.0.0 and PyPy3 4.0.0).

maxerickson · on Oct 21, 2015

2.6.1 is the previous release version. 15.11 is just year.month of the next release version, apparently they are moving to a new numbering scheme.

Not sure about 4.0, but I think it was the working number for the next release (to avoid confusion with Python 3, I guess they call that pypy3 and have a version number also).

frozenport · on Oct 21, 2015

Does CPython 3 do any better?