Thoughts on porting NumPy to PyPy

wesm · on Oct 17, 2011

I couldn't agree more. This is related to another recent HN posting (http://news.ycombinator.com/item?id=3104598) and discussions at PyCodeConf in Miami. There's a general feeling (which I've experienced in many of my interactions with non-scientific Python programmers) that the folks working on pure Python don't really "get" the scientific Python community. We would all understand each other better if the PyPy folks or the Python core developers spent a year, say, working at Enthought on scientific Python consulting projects or working on a core project that uses NumPy/SciPy like scikit-learn, matplotlib, statsmodels, theano, pandas, or many, many others. A small annoyance but having a matrix multiplication infix operator would actually be a huge help but the idea has met a great deal of resistance from core Python as being "too domain specific". It strikes me as very short-sighted as I think Python is well-poised to make waves in the data analysis, statistics, and high performance computing ecosystem. Having used Python to build large systems for financial applications, I am acutely aware of how Python is being used in that industry and some of the ways that it needs to adapt to be more relevant. So bravo, Travis, for speaking up!

Also, his point that Cython (http://cython.org) tends to be ignored in the broader discussion about performance computing in Python is especially flagrant when you consider how it's revolutionized the way that scipythonistas (myself included) speed up their code over the last 2-3 years.

IanOzsvald · on Oct 17, 2011

I spoke with Travis on Friday at Enthought and he opened my eyes to some of the possible problems of the pypy-numpy effort (which I've started discussing: http://mail.python.org/pipermail/pypy-dev/2011-October/00860... ). I hadn't realised that the port might exclude use of the rest of SciPy, this strikes me as a massive missed opportunity if it comes to pass. Perhaps a few other knowledgeable folk could post in the pypy-dev thread so we can reasonably understand the limitations and possibilities of the various approaches?

fuzzyman · on Oct 17, 2011

I don't really know what you mean by supporting numpy on pypy "excludes" scipy (other than in the short term). If you mean you assumed that the pypy team would port all of scipy as well as numpy, that seems like an unrealistic expectation for an initial port!

An initial port seems like the only way forward from the point of view of the pypy team. I think it is unrealistic to expect the pypy team to take on the work of changing numpy so that it is more friendly to alternative implementations. I would certainly expect them to be involved in the discussion though.

The article also seems to miss that there is work ongoing to bring pypy support to Cython.

wesm · on Oct 17, 2011

Well, the big problem to solve with SciPy is the fact that there is more, respectively, of C, C++, and Fortran in the SciPy codebase than Python (http://www.ohloh.net/p/scipy). Part of why Python has succeeded in scientific computing is integration of legacy codebases (via f2py or C extensions or...). There are probably man-years of work involved in devising a solution-- which by the time it's complete may be basically irrelevant or, worse, fragment the community. I personally think that going down this rabbit hole (porting 10 years of scientific Python libraries to PyPy) would amount to an exercise in vanity rather than producing the kinds of revolutionary changes to array-oriented computing that need to happen soon to deal with the large-scale data processing challenges of the present and future. Having recently used GPUs to speed up statistical inference algorithms by a factor of 50 or more, I am not that motivated by a JIT beating C in some cases (as Travis wrote: "C speed is the wrong target"). Many in the SciPy community are convinced that NumPy will not provide the computational foundation that we need going forward, and they are going to step up and start building the next generation NumPy (or whatever it's going to be called). We'd rather have more of the smartest computer scientists in the Python community focused on this problem (building more sophisticated data processing pipelines for use in Python) than on speeding up the Python code that by my estimation doesn't matter that much.

wladimir · on Oct 17, 2011

Have you looked at Theano ( http://deeplearning.net/software/theano/ ) ? It is a Python-based JIT for GPUs. Using Python you can build the computation pipeline symbolically, and the formulas are automatically converted to GPU code and scheduled as deemed fit (this can be extended to multiple GPUs, and could theoretically scale to an even higher level).

I think this is a promising idea for the future of array-oriented computing, as it can make use of one more level of parallelism / scaling than the current Numpy paradigm, which is limited to one operation at a time and the user providing the ordering of operations.

vgnet · on Oct 17, 2011

AFAIUI, what you're missing is that PyPy can (and often does) interface directly to C libraries, from RPython. So the prospect of re-implementing those specialized codebases isn't a real issue: only the CPython-API based wrappers would need re-implementing. I believe those are a small part of the total code you mention.

cdavid · on Oct 17, 2011

But those packages do not just depend on C libraries. They also depend on the numpy C API. If emulating the C API of CPython is too much non-fun work, I would expect the same to be true for numpy C API.

IanOzsvald · on Oct 17, 2011

'excludes' scipy in that the numpy C API won't be available so there's a massive hill to climb (as best I understand it) to support most of scipy. Re. getting the pypy team to do the whole port - agree that'd be entirely unrealistic! Re. initial port - since there is more than one way forwards and I want to financially support the project (having pledged £600 earlier), I'd like to understand the risks and benefits in the options.

fijal · on Oct 17, 2011

Note that as PyPy we deliberately don't take part in the language design discussions. We won't implement the array infix operator precisely because other people make those decisions.

kragen · on Oct 17, 2011

Why not? Don't you think the language design discussion would be improved by your knowledge?

juiceandjuice · on Oct 17, 2011

> There were a lot of young people there who understood a lot more about making web-pages and working at cool web start-ups than solving partial differential equations with arrays.

I can especially relate to this, even though I'm 26. Admittedly, I work in a microcosm though, where I'm the youngest, least educated and least experienced in my group, despite having a BS in Physics and 4 years research in particle astrophysics.

Luckily we do a little less with matrix operations, but I agree that Python, specifically PyPy, has so much potential to be the scientific computing standard and that the python community should really push for it. In a world with less python, there's so much pain involved on a regular basis setting up software like ROOT, switching between libraries of FFTs or plot libraries, installing Octave or Matlab to work nice with some bash script and dealing with OS discrepancies in getting someone else's code to run. It sucks.

If PyPy can displace that with near-C performance, the world would be a much better place.

philipn · on Oct 17, 2011

I love it when someone writes an articulate, non-inflamatory blog post with reasonable suggestions on how to improve things. It's sad how rare this is around here these days.

fuzzyman · on Oct 17, 2011

I missed the specific suggestions on how the pypy guys could improve things - other than "do a lot of work on numpy itself as well so that it better supports multiple implementations" (paraphrase obviously). That really needs to be the side of the equation that the numpy team take hold of.

cdavid · on Oct 17, 2011

I don't think Travis is suggesting to the pypy people to do all the work, but rather contribute to the discussion to help the numpy community as a whole for a way forward that makes it easier for both pypy and numpy own needs.

hartror · on Oct 17, 2011

I think they are always around but just don't make it to the front page as the link bait articles get the attention.

eastwest · on Oct 17, 2011

I just don't see adding a special matrix infix operator to Python happening. It is too specialized, and yet matrix is not special enough that it must have its own operator. Isn't this what operator overloading is for?

I suspect the real problem is NumPy's type system implementation, since data-types are not visible as different Python types.

wesm · on Oct 17, 2011

If you spent a day doing some serious linear algebra in Python you might change your tune. * (multiplication) between NumPy arrays by default does element-wise multiplication (potentially with broadcasting), which is the desired default behavior. In R, for example, you can define custom infix operators so that

a * b

is elementwise but

a %* % b

is matrix multiplication. If you write down a complicated linear algebra expression, something like

A.T ! (B.T ! C ! B).T ! A

would be a lot friendlier to scientists than the current:

dot(A.T, dot(dot(B.T, dot(C, B)).T, A))

You might just say "well suck it up" but I've got to say that doing linear algebra in Matlab is a lot easier because the linear algebra that I do with pen and paper looks pretty much exactly the same as the corresponding code. On the other hand, Matlab is super clunky compared with NumPy at doing APL-style array processing with broadcasting operations, etc. In my work I tend to do more of the latter and less of the former but whenever I implement something with a lot of matrix multiplications it takes me a lot longer in Python to get things right.

Anyway, the point is: non-scientific Python folks need to take a walk in our shoes to gain an understanding of the challenges we face on a ongoing basis.

I'm having a hard time understanding your last statement. NumPy data types (dtypes) simply tell the ndarray how to interpret the block of data associated with it (the # of bytes per item, shape, and strides).

eastwest · on Oct 17, 2011

I actually agree with you. I much prefer operator overloading than unwieldy chained function calls. But it might be tough to convince the larger Python community to add it since it is only useful to a small part of the community.

The last comment about dtypes is that the size (shape in NumPy) should be part of the type; certainly element-type should be. A 1-D vector of double should not be the same Python type as 3-D array of characters or a 500x600 matrix. This creates havoc in a dynamic language. I once spent two days tracking down a bug caused by "*" multiplying two arrays instead of two matrices when I started using NumPy. Perhaps size is too much for a dynamic language to be part of type, but surely dimension and element-type should be reflected in Python type. It absolutely does not help that the documentation is littered with type-objects and object-objects; I appreciate this is how C implementations are, but for a beginner NumPy users, it is more than a little confusing.

I have a lot of respect for people who designed and built NumPy; for multi-dimensional arrays I don't see a better approach; but too much of influence of C implementation details seep into Python interface, types and operator overloading are only the beginning of the problems. I am having second thought about the suitability of dynamic languages for large scale, high-performance computing. Type annotation could help a lot but I see there is exactly zero interest in that for NumPy.

scott_s · on Oct 17, 2011

I see your point, and it is why I tend to like operator overloading in languages.

However, Python is a general purpose language, and scientific computing is a single domain. It is used in many different domains. Consider that there are many changes that individual communities would like, and if Python granted all of those requests, the language would be a mess. That's the challenge in designing a general purpose programming language.

kragen · on Oct 17, 2011

> Python is a general purpose language, and scientific computing is a single domain.

I think this is shortsighted to the point of ignorance. "Scientific computing" here really means "performance-critical numerical computation on regular arrays". Basically all of the new things that people are doing with computers in the last five years and the next five years — machine learning and other statistics, software-defined radio, audio synthesis, real-time video processing, cool visual effects, speech recognition, machine vision, and 3-D rendering, and arguably Bitcoin — consist largely of performance-critical numerical computation on regular arrays. It's what GPUs are for. Five years ago, Numeric or NumPy was probably the best way to do that for a wide range of things, although a lot of people still use Matlab instead, and R deserves at least a mention. Today it's not clear. Five years from now there will be something much better than current NumPy, and it could be a better version of NumPy or it could be R or Matlab or Octave or something.

In short, "scientific computing" is not a single domain, but a set of capabilities increasingly important in many different domains.

scott_s · on Oct 18, 2011

I do research in high performance computing. I am well aware of the importance of performance critical, number crunching code that operates on large arrays. I wrote an entire dissertation based around better ways of expressing large computations on dense vectors and matrices for new multicores.

I am, however, self-aware enough to recognize that what I care about is a subset of what everyone in computing cares about. Scientific computing may be important in multiple places, but it is still a single domain. An important domain, sure. But Python is still used in many places were such concerns are not important. Arguments about the utility of number crunching on dense vectors and matrices are great. But the attitude I've seen in this thread is "the domain I care about is so important that my concerns should be elevated above the concerns of other domains." That is not going to fly when it comes to changing a general purpose language used in many domains.

kragen · on Oct 19, 2011

I'm sure you know enormously more than I do about how to do dense-array number crunching. My claim, though, is that it's being applied much more widely than previously, not that there are new and better ways of doing it; so there are fewer and fewer places where such concerns are not important.

In the 1960s, there were those who claimed that the benefit of recursion was not worth the complexity it added to programming languages, except in certain special domains. In the 1970s and 1980s, we had the same argument about depth-first search — SNOBOL's and Prolog's backtracking feature. It turned out that the proponents of recursion were right, and the proponents of backtracking were probably wrong. (Although the backtracking feature of SNOBOL's child Icon directly inspired Python's generators, although implementationally they're maybe more similar to CLU iterators.)

Now, when it comes to matrix and vector manipulation, should we imagine it as a single domain, or as a feature that's useful across a wide range of domains?

I'm arguing for the latter.

scott_s · on Oct 19, 2011

And I'm arguing that's still one concern among many when it comes to a language used in so many different places. The arguments here are, I think, unsympathetic to the difficulties of designing and maintaing a general purpose programming language, in particular one with the design goals of Python.

another · on Oct 17, 2011

If we define "general purpose" as "equally mediocre in all domains", then what you say makes sense.

If we define "general purpose" as "works well in as many domains as possible", then I'd argue that the correct response to an easily-remedied weakness in an important domain should be to address that weakness.

As you imply, the challenge here is to decide which requests to grant and which to deny, but the mere fact that a request is "domain-specific" (ignoring, for the moment, the questionable idea that _linear algebra_ is domain-specific) should not be enough to rule it out.

(As an aside, scientific / numeric Python is, I believe, one of its two or three most important application areas, and perhaps the most important historical factor in its success: that community has been championing Python since the days when people were using Perl for the web, or for anything else. Notice, for example, that Travis mentions working on SciPy in 1999---and the original "Numeric" package was written in 1995.)

hogu · on Oct 18, 2011

Matrix multiplication isn't specific to scientific computing, it's a pretty common and large part of mathematics, and math is pretty general purpose.

illumen · on Oct 17, 2011

I guess it would be possible to implement with a with-hack...

with numpy.doing_matrix:

    matrixy_multiply_goodness = m1 * m2

array_multiply_goodness = a1 * a2

hogu · on Oct 18, 2011

operator overloading doesn't solve the problem because you frequently mix elementwise operations with matrix multiplication and division. Converting types between numpy arrays and matrices gets cumbersome

mcherm · on Oct 17, 2011

I'm just wondering how much you were hurt by the ellipsis built-in. After all, the ellipsis built-in was (at least for many years) of absolutely no use to anyone other than those using certain libraries for scientific computing. The rest of us had to put up with the pain of having yet one more line in the documentation. Frankly, for me it didn't hurt much.

As a general rule, "Special cases aren't special enough to break the rules", but like everything in Zen it is a balance. If a very large community of Python users (and the scientific users ARE a large and important subset of Python users) say they would benefit significantly from this change, then perhaps this is the exception -- especially since the "cost" (in additional complexity) is fairly small.

kragen · on Oct 17, 2011

Perhaps also see my comment nearby about how "scientific computing" isn't really a narrow domain any more. It could add ammunition to your argument.

lucian1900 · on Oct 17, 2011

If numpy-c were to be called over ctypes, it'd be slow.The reason PyPy's numpy reimplementation is so insanely fast is the integration with the JIT.

A more plausible scenario would be both CPython and PyPy using numpy-py to call into numpy-c and numpy-pypy respectively.

ZeroGravitas · on Oct 17, 2011

Off-topic but anyone else seeing every instance of "fi" in the text replaced with a slightly overlapping AV? (I tried to cut and paste it but it pasted as just "fi").

I'm on Ubuntu Ocelot & Firefox 7.

nevermind: I'm seeing it on various websites that use ligatures. Possibly because I'm half-way through an Ubuntu upgrade.

pwang · on Oct 17, 2011

I was going to do a Cython lightning talk, but the organizers indicated that we ran out of time slots, so I could only give one talk. Half of the audience at PyCodeConf indicated they were aware of the Cython project, although perhaps it would have been useful regardless...