Yeah, that's the thing with silly examples (like my Mandelbrot program). For rea...

joshuamorton · on April 20, 2020

> Moreover, for some very common things in signal processing, like working with evens/odds or left/right parts of an array, the parallel numpy operation will create lots of temporary arrays and copies.

iiuc, this shouldn't be correct. given an ndarray A, `A[::2, 1::2]` will provide a (no-copy) view of the even rows/columns of A. Same with A[:len(A)/2] to get only half of A.

> And for what it's worth, your version of mandel should work with PyPy. So you can have your cake and eat it too.

Indeed, most of the scipy stack works with Pypy, it's great.

xscott · on April 20, 2020

As soon as you do any operations (addition, subtraction, etc...) on those views, you're going to get temporary arrays.

For instance, your Mandelbrot example doesn't even use views and it creates two temporaries the size of the entire array on each iteration:

    for j in range(255):
        t = z**2    # create a new squared array
        u = t + c   # create a new summed array
        z = u       # replace the old array

And all of this is ignoring how inconsistent numpy is about when it creates a view and when it creates a copy.

shoyer · on April 21, 2020

Unnecessary temporary arrays is definitely a major source of inefficiency when working with NumPy, but recent versions of NumPy go to heroic lengths (via Python reference counting) to avoid doing so in many cases: https://github.com/numpy/numpy/blob/v1.18.3/numpy/core/src/m...

So in this case, NumPy would actually only make one temporary copy, effectively translating the loop into the following:

    for j in range(255):
        u = z**2   # create a new squared array
        u += c     # add in-place 
        z = u      # replace the old array

joshuamorton · on April 20, 2020

Your general point is correct, although in this specific instance, replacing the loop body with

    z **= 2 
    z +=c

gets rid of the temps. But yes there are cases where that isn't possible.

shoyer · on April 21, 2020

This gets rid of temporary arrays, but this still isn't optimal if z is large. Memory locality means it's faster to apply a scalar operation like z2+c in a single pass, rather than in two separate passes.

Explicitly unrolling loopy code (e.g., in pypy or Numba) is one easy way to achieve this, but you have to write more code.

Julia has some really nice syntax that lets you write things in this clean vectorized way but still get efficient code: https://julialang.org/blog/2017/01/moredots/

joshuamorton · on April 21, 2020

Gotcha. I feel like I remember some numpy or scipy way of creating complex ufunc ops and applying them simultaneously, but maybe I'm misremembering or thinking np.vectorize was fast?