Hacker News new | past | comments | ask | show | jobs | submit login
Copying data is wasteful, mutating data is dangerous (pythonspeed.com)
230 points by feross on Jan 9, 2020 | hide | past | favorite | 153 comments



I'm not super familiar with interior mutablility, but I don't think this is the same "interior mutablility" as used in Rust.

Interior mutablility:

A, immutable, points to B.

B, immutable, points to C.

C, mutable, points to D.

D, immutable, points to E.

Even though A and B are immutable -- B points to C mutably, so you can modify its contents. Some "interior" part of A (and of B) is mutable. You can't change which C is pointed to by B (because B is immutable, or rather we only have an immutable reference to it), but it's open season on the contents of C.

What this article describes:

We have 2 operations.

We are passed in Array A.

We can either:

- Apply both operations to A, modifying it even though the "owner" may not want this.

- Copy before each operation (or as part of them), which leaves A untouched, but requires triple the memory usage.

- Cooy before (or as part of) the first operation only. The second operation can be applied in-place on the copy. This leaves A untouched and only requires double the memory.

They're totally different concepts. I don't know what you'd call the second concept -- I think the important part is that A remains untouched, so... "Avoiding side effects while reducing memory usage", or "How to make performant pure functions" ?


The second notion, where it's clear that the memory space can safely be used, is why there's interest in linear and affine type systems. These systems describe types whose value can only be read one time, and for linear types, which must be read one time.

On the surface that can seem a bit stupid, but the practical outcome is that compilers can determine when an array's contents can be mutated in place rather than overwritten.

A variable you read over and over can be thought of theoretically as one you duplicate, then read one copy of. Since the act of reading the copy destroys it, a compiler implements this by merely reading the value directly.

Passing an array into a subroutine however means the compiler either copies it, or forbids further access after passing it in. If it's the latter, the memory space is available for re-use inside the subroutine.

Rust uses affine type theory for its borrow system to provide memory safety. Function language research is looking at linear and affine types for a variety of applications including performance.


For the read case, do you mean that in nonbranching code, a read can be considered a destroy+create that cancel out? How is that legal if the reads are also writes into other variables, or copied into an array/tuple and returned to a caller? That seems a clear violation of linearity.

Also, that doesn't seem true for branching structures, but I guess technically in an immutable language loops iterations are recursive functions.


That's also the impression I have, the article really has nothing to do with Rust's interior mutability. And Rust wouldn't have the issue in the first place because whether it takes the parameter by reference, mutable reference or value would clearly explain the "contract" to the caller.

It does feel a bit closer to Clojure's transients which I guess it's what it's referring to.

However there are also a number of differences with Clojure:

* Clojure can not completely prevents the caller modifying callee data (at least using "normal" clojure collections, which are persistent)

* creating and reifying transients is O(1)

* the operation is a fairly trivial mapping of #(/ (- % low) (- high low)), which is likely what you'd have done in Clojure


This definitely made me think of Clojure transients. For reference: https://clojure.org/reference/transients

> If a pure function mutates some local data in order to produce an immutable return value, is that ok?

The idea in Clojure is that data is immutable by default but there are function-local mutation APIs specifically for optimization. The contract between caller and callee is still the same, a function never mutates your data, but transients provide improved performance and memory characteristics invisible to the caller.


What is callee data that a caller can access while the callee exists? Is this multithreeaded code?


I don't know if it was autocorrect screwing up or brainfart / fucky fingers (likely the latter I fear), but what I meant to write that clojure can completely prevent the callee from modifying caller data, since it uses persistent collections and persistent collections are naturally immutable.

Sadly by the time I realised the screwup I couldn't edit the comment anymore (one of the really annoying things about HN alongside stripping random characters from comments and the half-assed markup).


Yeah, this is just "don't create large temporaries unnecessarily". I'm not sure it warrants a name.


It does if you want research done so compilers can implement results in this space and optimise away unnecessary copies without introducing errors.


This class of automatic, best-effort optimizations is actually bad!

Say the compiler discovers an optimization like "we don't have to copy this giant array". Now either you care or you don't. If you don't care, it was wasted work. If you do care, now you have some fragile, implicit set of preconditions to maintain the optimization. How do you document these, how do you test it?

Instead, in a high performance language, we should:

1. Make the language do the algorithmic heavy lifting. Rust's annoyingly explicit affine type system enables its deterministic memory management and various safety guarantees. Its generics are explicitly monomorphic, which enable aggressive inlining. etc.

2. The compiler's optimizer sweeps up just the details. Register allocation, constant propagation, inlining, all the classics. But not inter-procedural stream-fusing tail-calling Haskell-heroism.

The idea is to make the important optimizations explicit, and allow the compiler to make things faster on the margins and avoid the obviously wasted work.


Explicitness is nice when you are building spacecraft, be but it has a real human cost. I'd rather not write all that stuff, and let my static or dynamic analyzer warn me about potential hotspots non-optimizable code.

Best-effort is fine if it's run an ecosystem where code changes are tested before deployment, which is everywhere performance matters.


Let's say someone includes a line

48362/37277295;

That's a no-op. (Some rational value you're not using - it's not even assigned to a variable name).

The compiler might warn you about it.

But do you think it should be allowed to just remove it - not even create the op?

Regardless of what the standard says, I'd expect the answer to be "sure."

What do you think?


I also see this as a bit of a premature optimization. The triple memory usage one may be significantly faster than the in-place one -- and the memory usage is temporary so unless you're in a constrained environment that's not a concern.

Avoiding unintentional or unexpected side effects is a good thing. Pure functions are great. Do whatever you can do keep your visualization helper thing pure and avoid bugs.

Then, once it works, if you have performance issues, do some profiling.

If the profiling shows this helper method as a hotspot, then it's worth optimizing.

At that point, you'd want to follow standard Python, numpy, and pandas optimization strategies -- which I'm not familiar with.

Aimless optimization like this is a waste of time. Good program structure (data is available when you need it without tonnes of calls), the right choice of data structures and algorithms (hashmap vs list), the right choice of underlying tools (Julia vs Python) and targeted optimization (profiling) are the ways to go.


The slowest thing CPUs do by far is access memory. I don't see any possibly way that, for the given example, using more memory would ever be faster. At best 3*N all fits in cache and it's not as bad, but even then it's still at least one more cache miss which means it will still be slower even if it's not by much. Temporary memory is the worst type of memory. It's cache misses & pipeline stalls for no point.

"Premature optimization" is not an excuse to write obviously bad code. Fixing this simple thing now is much faster, and simpler, than digging a big hole, jumping in it, and then hoping you can find something to pull yourself back out later.

In the same way it's important to scatter about asserts and verify parameter arguments do in fact conform to the documented contract, it's also important to always keep an eye out for openly wasteful code.


> it's also important to always keep an eye out for openly wasteful code.

No, that's the point. It's really not. If your "openly wasteful code" does its job well and is readable and maintainable, it DOES NOT MATTER that it is "wasteful".

If, on the other hand, the wasteful code causes your program to execute in 60 minutes instead of 10 seconds every time you run it, then it does matter and you should optimize it.

Why is this such a difficult concept?


If you're trying to use Knuth's quote from 1974 as your justification for that concept then you've misunderstood his argument and should read the full thing, not the single phrase.

This article is not advocating premature optimization as Knuth was discussing nor does the article's advice result in the evil that Knuth was warning against.

If you have some other evidence or paper justifying why we can ignore obvious inefficiencies please provide a source. That's a rather radical claim to make, though, which may be why you find people struggle with the concept.


My experience finds that the in-place optimization can indeed speed up numpy code. Even if you're not memory constrained, fewer arrays in play means less memory bandwidth needed.

>> Many NumPy APIs include an out keyword argument, allowing you to write the results to an existing array, often including the original one.

I'd wager that saving memory or memory bandwidth is the impetus for expanding the API to include this option.


The article starts with "You have a large chunk of data — a NumPy array, or a Pandas DataFrame — and you need to do a series of operations on it."

So it's suggesting to use these techniques when you already know that you are working with a large chunk of data, either because it was obvious from the beginning or because profiling highlighted it. In scientific computing it is common to store gigabytes of data in a single array. In fact it is common that even the compromise-solution suggested in this article, which leads to 2x memory instead of 3x, would be too much extra overhead.

But the compromise solution is nice if you can afford 2x memory overhead, so I upvoted the article.


I think the easier way to think about Rust's interior mutability is via an example: (Cell is a mechanism for interior mutability)

  struct S {
     a: Cell<u64> 
  }

  struct T {
     b: u64 
  }
To mutate field `b` on `T`, we need a mutable object or reference. However, we can mutate field `a` on `S` via an immutable reference. Typically this is used to encapsulate some kind of interior state that isn't visible outside of the struct's methods, and helps in certain situations where using a mutable reference might cause some hair pulling due to the borrow checker


> B points to C mutably

I think you mean that B points to D mutably because C is mutable. The pointing from B to C is locked in place because of the immutability of B. However, C is free to change it's pointers to some other D.


Reminds me of copy-on-write. Operation can reuse input buffer if it holds the sole reference to it. In the scenario you described, you manually ensure and write this behavior for the rest of the pipeline.


Original author here: to reduce confusion I changed the article to use a new term, "hidden mutability".


Thanks, that seems like a much better term.

With Rust-style interior mutability, any code accessing the value will see that it was mutated, but here the caller still sees a purely functional call, since the mutation is hidden.


I've heard the term "immutable facade" before for this type of thing (though granted, usually while talking about a class interface rather than standalone functions).


As a game developer working in managed languages (aka Unity/C#) this problem is one of my biggest headaches. I was hoping the article would provide divine insight, unfortunately the recommend solution doesn't really solve the problem (as I see it).

Whilst 2n array alloc is better than 3n, both are creating short lived memory objects. Do that in a hot code path and suddenly you've got tremendous garbage pressure your garbage collector is going to choke on.

One can optimise this by calculating results into a cached array, but that creates opportunities for bugs (results of previous caches persisting). I would dearly love to see a solution that allows me to code in a functional style without sacrificing performance.


Arena style allocation helps. You still get the allocations, but GC is cheap. IMO both Java and the CLR are in desperate need of multiple heaps with separate GCs to solve these problems. One big heap simply doesn't scale well. Erlang is the only example I know of that does this right. I know that the golang GC is low-latency, but I'm not familiar with if it can sustain high allocation rates.


Ponylang does as well. I always wonder about why we can't have multiple heaps. The primary argument against GCs is the stop the world pause. Unrelated high throughput code can interfere with latency critical code and therefore you have to write the whole program as if it were latency critical.


Nope.

Golang still uses one global heap; it just allows for stack based allocation in situations it can do it (avoiding putting stuff on the global heap), so copies _can_ be cheap re: GC.


The CLR also allows stack allocation, and many of the newer .NET Core libraries have been built with a focus on eliminating unnecessary allocations.


Some of the biggest of .NET Core's advances in stack allocation (and resulting performance boosts) likely won't make it into Unity until around the release of .NET 5, though. It should be interesting to see how Unity adapts when the tools show up.


>results of previous caches persisting

You clear the buffer or capture it inside an optional type. When you're done with the buffer you can reset it to the default state to be used again.

Most of the issues are logical bugs though, not flaws with the pattern. You just need to reason about the computation that you're doing to preallocate exactly how much data you need outside your hot loops.

You can read some audio source code if you want, this is an extremely common pattern. Preallocate for your computation up front by exactly how much you need (and often with cache-friendly data structures) and don't get crafty reusing buffers where you don't need to.


Are you familiar with data-oriented design? The game-developer equivalent of the array-processing scientific programmers do.

I suspect if you look at how they do things you'd find some domain-specific design patterns.

https://en.wikipedia.org/wiki/Data-oriented_design


Is there any further reading on this in the scientific programming space?



On certain use cases, you can use a pool of objects, creating long lived objects drawn and return to a pool, reusing the memory efficiently with minimal overhead. Of course, you need to be able to size such a pool to begin with.

Disclaimer : I have no experience with game development whatsoever !


The only language I know that solved this is Rust. The borrow checker assures that you don't give away a mutable reference to your array and then use it somewhere else.

E.g. you have a function that mutates and one that doesn't, you can't accidentally use the mutated array since the borrow checker won't allow you to use it afterwards.

But if you still want to use the mutated version you have to explicitly return the mutable reference to the caller, making it immediately obvious what you're dealing with.


Lua coder here, I'm also used to the problem of constructing GCed objects in hot loops. In my experience, many problems have solutions that are really hard to generalize.

In the example of the blog post, you can reduce the resulting data to the following: - The original array (already exists) - The minimum (Integer, not GCed)¹ - The maximum (Integer, not GCed)¹ - A normalize(int, int, int):int function (can be re-used)²

None of the above give the GC any additional work, but passing them around is way more annoying than just making a normalized copy of the array; and abstraction into a class that encapsulates all of them would mean you have at least one GCed object again, which you don't want.

¹ I don't know how exactly this works in C#, but in Lua integer values aren't garbage-collected, as copying them around directly is no more expensive than copying a reference, and they're immutable anyway. I assume it's the same or at least very similar in C#.

² Instinctively, I want to turn this into a closure that just takes one value and closes the min and max values, but that'd be another GCed object, which we don't want in hot loops.


Imo you have to allocate something else if you are going with an immutable paradigm. You can do clever sharing of data structures, but ultimately if you're modifying the whole thing, you're modifying the whole thing and sharing can only go so far. It seems like further advances in this space either require giving the compiler a better understanding of what is "actually" shared and in what way (i.e. Rust and knowing when mutations are externally visible), or making copying cheaper.

Mutability is always (at least as current computer architectures are concerned) going to be faster in a general sense.


You may be interested in the `im` crate for Rust, which implements efficient copy-on-write structural sharing, falling back to in-place mutation when there are no other outstanding references.

https://docs.rs/im/14.1.0/im/


> It seems like further advances in this space either require giving the compiler a better understanding of what is "actually" shared

In the context of arrays, it helps when programmers use higher-order combinators/functions, e.g. maps.

There is a mutable way to do a map, and an immutable way, and which should be used in the compiled program depends on analysis of where the variable is used.


> both are creating short lived memory objects

Might also cause some other weird side effects. Some older versions of C# (or more specifically the .net CLR) had some issues with the heap space for large objects when it got used by external code. I believe the reason was that data allocated by external libraries had to be placed sequencially on the heap due to some marshalling constraints (or maybe the LOH always allocates sequentially?). After a while the heap got fragmented and you could get into a situation where all memory checks told you that you can still allocate another large chunk but the subsequent allocation crashed in a not so graceful manner since the memory wasn't available in sequence.

Took quite a time to even understand what was happening. I didn't really follow up to how the CLR-developers solved this issue. This was ages ago and has now been fixed. Certainly an interesting problem.

C# has this awesome using statement so you can at least indirectly influence when garbage collection occurs and ensures objects are as short lived as possible.

I don't think there is a golden way to Rome. Personally I think keeping the life-time of any object as short as possible is still the way to go and it helped with our problem. The GC of C# is aggressive towards young object that it assumes to be short lived. But I would never recommend to write code with the GC in mind until it starts to be an issue. Which might be right at the start for game developing, but otherwise I don't really care too muchabout keeping temporary copies.


`using` is for resource management scopes (disposal), not garbage collection.

Some disposal actions will dereference objects (this.thing = null) and thus add garbage and therefore garbage pressure for a GC run to clean up, so it is an indirect influence, but it is very indirect, and not what `using` and `IDisposable` are meant to do.

You shouldn't rely on `using` for GC management, and making things that don't manage resources (such as file handles, sockets, etc) `IDisposable` isn't a good way to ensure they are short-lived (and doesn't add any additional signal to the GC that the objects are short-lived; some `IDisposable` objects are incredibly long-lived).

Better tips to managing short-lived objects are to keep them as small as possible, avoid unnecessary references to/from other objects, avoid accidental captures into lambda closures, and consider if stack allocation makes sense for your scenario.


> I would dearly love to see a solution that allows me to code in a functional style without sacrificing performance.

Uniqueness types would make mutation easier to track: you'd still need to write an imperative style, but less to worry about.

I think the "functional-style" stuff in this area uses compilers (i.e. accelerate or futhark) that can analyze whether copying is necessary because the programmer uses various high-level functional combinators. I think J does something like this as well (?)


If you have precise reference counts (which is often achievable in array languages because they don't have cycles), then you can dynamically do efficient in-place updates if the array you are updating has a count of 1.

Uniqueness types are only necessary if you want to make the in-place update a static guarantee (which can be crucial if the code is going to run in an environment where allocation is not possible).


Linear typing is obviously ideal, but you can also version expensive to clone objects, and use a struct to reference them with a pointer and version number.

That way if you mess up your algorithm and attempt to access the old version of an object from a stale reference, you at least get an exception.

If you use a struct it makes the pointers fat but there's no allocation.


If you need the highest performance, you're probably always going to be stuck with some ugliness.

If the code is isolated to some specific logical module, like the game AI, you can bury it in a private helper method that at least hides the details from callers.


Just use a bool parameter `inplace` (or a reference `out`), defaulting to false, but passing it as true for the "hot path" code that needs in-place modification. You get performance, non-duplicated code, and an API style that datascience and ML people are already used too from Numpy, Pandas etc.


If you need to send a temporary array through a number of operations (normalise, then vusalise, in this case), you could also set up the stream of operations and send the elements of the array through it one at a time, or in short batches.


Streaming and lazy evaluation is definitely a solution here. But do python libraries support it well?


You can't have true lazy evaluation because you still need the maximum and minimum to normalise them. But that's just two fields that you need only once.

I have no idea whether python or any other language supports this out of the box, but I would expect there are libraries that do this, and if not, it shouldn't be too hard to write this yourself. The main thing is probably that you need to separate the normalisation parameters from the normalisation itself, and after that, you can simply do something like (in javascript):

  array.forEach(e => visualise(normalise(e, min, max)));
Of course visualisation is probably going to involve a new datastructure that's going to hold this data in some way, so there's still going to be another copy of that array living somewhere. That's unavoidable.


Do game devs write custom allocators ?


The simplest allocation/deallocation pattern is to delete everything after the unit of work [0] has finished. The downside is that the "unit of work" has to be defined by your program.

[0] HTTP requests, video frames, etc


Almost always (for big budget ones).


If you can't find a good implementation for an interface perhaps the problem is the interface. Inline the function, refactor, then express the result via re-usable components. Ultimately it is a problem of code re-use vs performance. If you want divine insight it is probably to use lazy evaluation, so for the example in the article express normalize as a generator.


> But mutation is a cognitive burden: you need to think much harder about what your code is doing.

One way to alleviate the problem is to use naming convention.

  normalize(a)       # mutates in place. note: no return value.
  b = normalized(a)  # no mutation. returns a new instance. note the adjective.
Python does this in the standard library

  a.sort()
  b = sorted(a)


I like how ruby way of declaration if a method does mutate object or not but a exclamation mark at end to tell it modifies the object

a = a.sort or a.sort!


I like the idea of adding an exclamation or some other mark to the individual parameters in the method declaration since a method might mutate some but not all parameters.


Julia has this pattern as well.


in-place operations give me the heebie-jeebies.

I think this is something that is a slightly less obvious when dealing with classes. I see stuff like this in academic code all the time:

  class SomeDataObj:
    def __init__(self, data):
      self.data = data
    
    def normalize(self):
      self.data = normalize(self.data)
I'd much prefer for there to be variable `self.data_normalized` which is initialized to None or even better a getter `getNormalizedData()` that computes/caches a normalized variable.


The pythonic way would be to have sort() and sorted() which either sort in place or return a sorted copy. Best of both worlds.


Command Query Separation is a very good general idea in programming that should be followed more often. Sadly it went out of fashion in 90s when everybody got hyped about OOP.


I write an maintain a generic multidimensional array package for Go, and I had come into the same problems as the OP.

The solution was to dictate that the API is copying. But also provide ways to allow the user to manage their own memory.

e.g.

    a = tensor.New(tensor.WithShape(2,2), tensor.Of(Float64))
    b, err := a.Apply(sigmoid)                      // copies
    c, err := a.Apply(sigmoid, tensor.WithReuse(a)) // mutates
Docs: https://godoc.org/gorgonia.org/tensor#Dense.Apply


It’s convention to do this in Julia as well, every package I’ve used that mutates marks the explicitly mutating function with a ! at the end.


This is consistent with Ruby. It often includes both in the core. There's .gsub() and .gsub!() for text substitution in the String class, for example, or .select() and .select!() for the Hash class.


This is interesting to know, because Julia uses exactly the same approach[1] (now I know where it comes from). To me it seems to be the ideal solution: use the in-place normalize!(X) where performance is important, otherwise use the more convenient allocating variant normalize(X). The exclamation mark makes it clear that the array is modified.

[1] https://docs.julialang.org/en/v1/base/collections/#Base.sum!


>(now I know where it comes from)

Both Julia and Ruby got this from Scheme and it might be even older than that.


Unfortunately it isn't completely consistent, for example Array#delete and String#concat


Well, yes. Julia here is consistent with something in Ruby. In my experience it's one of the three major complaints about Ruby is how inconsistent it can be within itself.

The other two major ones are the number of core ways to the the same thing and the slowness of the interpreter. The first is a matter of taste. The second is a technical issue with the interpreter mostly independent of the language and/or a question of fitness for specific projects.


I use this convention whenever I can

I spent far too much time yesterday figuring out why nothing was being replaced, when I called replace on a string in Javascript


Or better yet only have normalize(), mark it [[nodiscard]] and it will be pretty much impossible to fuck up.


I'm surprised no-one has mentioned Haskell's ST library yet. It hits this exact use-case, and comes with some pretty sweet safeguards from the type system.

https://wiki.haskell.org/Monad/ST

It allows you mutate arrays and other stuff "on the inside" like this article suggests, but it's also able to wrap up the operation so it appears pure from the outside, so the caller doesn't have to guess or rely on naming conventions to figure out the mutability.

It's your choice on whether you want to do the copy on the way in ('thawing'), on the way out ('freezing'), or both, by using the 'in-place' variants of thaw and freeze.


Exactly. The only problem is that you need a language with a sufficiently powerful type system that can express both phantom and existential types.


> The only problem is that you need a language with a sufficiently powerful type system

why is that a problem?


Because everyone and their grandma can whip up a quick Python script after a few days of practice, whereas learning Haskell is a much bigger investment?


Problem in the sense that most typed languages don't have a sufficiently powerful/expressive type system.


This seems like a problem that would be suited for an optimizing compiler.

You could imagine some hypothetical language in which all objects are immutable, as far as the programmer is concerned - however the compiler is allowed to reuse objects behind the scenes if there is provably no way that a side-effect could be introduced.

E.g., in such a language, the first variant would be the only legal form to write the normalizing function:

  low = array1.min()
  high = array1.max()
  array2 = array1 - low
  array3 = array2 / (high - low)
  return array3
However, this could be safely turned into the following during compilation:

  low = array1.min()
  high = array1.max()
  array2 = array1 - low
  array2 /= (high - low)
  return array2
... because array2 is not visible outside the function and therefore could not cause side-effects.


"You could imagine some hypothetical language in which all objects are immutable, as far as the programmer is concerned - however the compiler is allowed to reuse objects behind the scenes if there is provably no way that a side-effect could be introduced."

There are functional languages that do this. Like many other optimizations, you get less than your intuition might naturally expect, but it does work.

A trivial case is "tail call optimization", which most immediately involves not creating a literal in-memory stack record for recursive invocations but also can easily involve the compiler re-using the memory for the arguments. IIRC Erlang will do that as much as it can in tight recursive calls.


> You could imagine some hypothetical language in which all objects are immutable, as far as the programmer is concerned - however the compiler is allowed to reuse objects behind the scenes if there is provably no way that a side-effect could be introduced.

Erlang/BEAM fits your hypothetical to some degree. Erlang the language has immutable variables; however, BEAM, the virtual machine, has mutable registers. Whether or not a given code sample results in copy or mutation depends on what information is available to the compiler, of course.


it is theoretically possible that a library like numpy or pandas could do this directly. That is, if you take each of these transformation operations and store them, but not actually run them until needed, that is, when someone accesses the individual array values, you only need to make one copy and there is no impact on the programmer to have to worry about this.

I was doing something similar for SQLAlchemy recently. SQLAlchemy is all about capturing Python operators and operands into data structures that represent intent which can be invoked later.

edit: here is a demonstration: https://gist.github.com/zzzeek/caa4a7ed94f326fbbc031acecb9d7...

edit edit: it is actually possible with numpy by using the Dask library: https://dask.org/


The problem with the library doing this is you end up with the same issue as in Haskell: you accumulate thunks and both memory use and location of CPU hotspots becomes much harder to predict and troubleshoot.


so...turn off the "thunking" feature while you are debugging.


That won't help when you're debugging the performance implications of thunking.


compare memory / callcounts / time spent with thunking turned on vs. off would give a good idea of it.

i can say for my side of things thunking is a super huge performance gain as it allows for quick gathering of expression intent and then the computation side on the other side of a cache.


You can use Futhark; it compiles to PyOpenCL and then can be called by python fairly fluently.

Accelerate is a nice example of an array library with a JIT backend.


Alas, CPython isn’t an optimizing compiler, it faithfully compiles code into bytecode which is then faithfully executed.

I don’t know the state of NumPy on PyPy since I don’t use PyPy.


It does have a peephole optimizer.


numba.jit might be able to optimise the extra copies away though, I'm reasonably sure working with numpy arrays is a pretty significant use case.


There's another project, Pythran, that does that.

I'm not sure if it's doing inlining before the laziness analysis, which would be ideal.

[1]: https://pythran.readthedocs.io/en/latest/INTERNAL.html#lazyn...


This is rather dumb. The only reason there is an intermediate array in the original copy-based code is because the normalization is badly written. After you obtain max and min, normalization is a single map() function, not two.

As for the overall problem: always choose immutability and then make it faster by first taking advantage of optimization patterns that immutability allows you to use safely. In a case like this, doing the normalization lazily is what springs to mind.

If memory allocation and de-allocation is an issue, use more explicit memory management. Though by experience it is rarely the actual problem. There's basically no chance that it is an issue here, Python interpreters aren't stupid enough to give back memory to the OS immediately after a variable falls out of scope.


Do you have any links to some examples or help on how to do this exactly?

Reason being that in our group we have a ton of normalization code that looks exactly like the 'bad' example from this post. As we use relatively large datasets (4D MRI images), memory is a bit scarce...


Pytorch has a built in normalize function: https://pytorch.org/docs/master/_modules/torchvision/transfo... It just subtracts the mean and divides by standard deviation, both in place.


How I would summarize the article is that mutation should be confined to internal objects that have not been published yet or are temporaries that are thrown away.

Theoretically, you can make every API mutate data. If the user didn't want to mutate data, they can explicitly make a copy first. Whereas if there only exists an immutable API to return new objects, then it's hard or impossible to get the benefits of mutating in place.

As far as I know, only Rust, C++, and C give support at the type-checking level to denote which functions will modify the interior of their arguments. This way, you can distinguish behaviors easily - for example in Rust:

  impl Vector {
      // The method reads this current vector and returns a new one.
      pub fn normalize(&self) -> Vector { ... }
      
      // The method reads and modifies this current vector in place.
      pub fn normalize(&mut self) { ... }
  }
The article gave this subtly flawed example in Python:

  def normalize_in_place(array: numpy.ndarray):
      low = array.min()
      high = array.max()
      array -= low
      array /= high - low
    
  def visualize(array: numpy.ndarray):
      normalize_in_place(array)
      plot_graph(array)
  
  data = generate_data()
  if DEBUG_MODE:
      visualize(data)
  do_something(data)
In Rust, it would be a type error because visualize() takes a reference but normalize_in_place() takes a mutable reference:

  fn normalize_in_place(array: &mut numpy::ndarray) {
      let low = array.min();
      let high = array.max();
      array -= low;
      array /= high - low;
  }
  
  fn visualize(array: &numpy::ndarray) {
      normalize_in_place(array);  // ERROR
      plot_graph(array);
  }
  
  let data: numpy::ndarray = generate_data();
  if DEBUG_MODE {
    visualize(&data);
  }
  do_something(&data);


> As far as I know, only Rust, C++, and C give support at the type-checking level to denote which functions will modify the interior of their arguments.

1. C and C++ allow casting away constness, which may or may not be UB depending how the parameter is defined

2. Swift also provides this ability to an extent: struct parameters have to be flagged `inout` to be mutable

I'm sure there are others.


Rust can throw away constness in unsafe blocks, which carries the same caveat emptor that const_cast carries.


Only in certain circumstances, though.


Casting away constness in C++ is something it is virtually never done, and has to be done very explicitly.


> https://github.com/search?q=const_cast&type=Code

We must have very different notions of "virtually never": https://github.com/search?q=const_cast&type=Code

> has to be done very explicitly.

That doesn't make it visible to the caller.


That is how strings work in Delphi/Pascal.

All strings have a reference count. When you change a string, and the ref count is not one, it is copied before the change. If the ref count is one, it is not copied


Seems like "both libraries make copies of the data, which means you’re using even more RAM." is a assumption (that copy data leads to more RAM usage) and I'm guessing the libraries (haven't used them myself, nor python, so not sure how feasible this is) can be implemented like a persistent data structure (https://en.wikipedia.org/wiki/Persistent_data_structure) and therefore get both immutability (which leads to less bugs, no mutability) and not that high RAM usage. Clojure is a famous example of something that implements persistent data structures.


> can be implemented like a persistent data structure

you don't get much benefit from persistent structures if every element of an array is modified, like in the examples in the post.


If the data structure has a reference count of one, you can safely mutate the data structure instead. Many persistent immutable data structure libraries use this as an additional optimization.


I missed that, sorry. Thanks for clarifying!


There are also data structures such as zippers which are immutable but minimize the amount of data you need to copy, at the cost of potentially using more space overall. Zippers are cool because they can be generalized to many different types of data structures as long as they can be expressed as algebraic data types.


If your language allows mutation:

1. Immutable first. 2. Data structures should use partial copying [1]. 3. Data structures should be lazy by default. 4. When you've profiled a bottleneck, and identified memory allocation and copying as the bottleneck culprit, then use a mutable data structure internal to the operation (copy your lazy immutable data structure once into the mutable one, taking only the portion of your lazy structure that you need, operate on the mutable structure in place, then return an immutable, lazy copy of that structure out of your operation.

This will isolate the need for mutation reasoning to performance critical parts of your application/library, and let you still reason about the rest of your program with referential integrity, while avoiding performance bottlenecks.

Never use a mutable structure in a mutable reference at the same time. You don't need both.

1. https://www.google.com/url?sa=t&source=web&rct=j&url=https:/...


In other languages this is less of an issue, because function signatures can include information about whether the argument can be modified or not. For example, in C++:

    // Definitely doesn't modify 'array', because 'array' is copied.
    vector<int> Normalize(vector<int> array);
    // Almost certainly doesn't modify 'array'.
    vector<int> Normalize(const vector<int>& array);
    // Probably *does* modify 'array' (otherwise author would have used
    // const reference instead of pointer.)
    void Normalize(vector<int>* array);
In Python every function uses the 3rd approach, so you have to signal to the user in some other way whether the argument will be modified or not. In Ruby and some other languages the convention is to append "!" to the function name, but in Python most people seem to just mention it in the docstring.


There is some convention in Python - for example `reverse` and `sort` are mutating, whereas `reversed` and `sorted` are not.

However, I don’t know if there is a good reason why the mutating versions are methods and the non-mutating ones are standalone functions.


Why does this blog post have a recaptcha?

In addition, I just finished reading the Rust book last night, and I don't believe this is what interior mutability is (at least as used in Rust).


I'm using ConvertKit for newsletter signups, and choices are either:

1. Validate emails with recaptcha (which is why you see that thing in the corner.) 2. no email validation at all, so random people get spammed with confirma-you-want-to-subscribe emails.

I am not super-happy with having to make this choice :(


For the email box, presumably he doesn’t want bots entering emails.


I see no recaptcha.



> To reduce memory usage, you can use in-place operations like +=

btw, I think it's a design flaw in python that `a += b` is not always equivalent to `a = a + b`.


As someone who has never used Python seriously I have to ask... why not?


As an optimisation, you can override += separately from +. A pretty well known oddity related to this is:

    >>> a = ([0],) # a tuple (immutable) containing a list (mutable)
    >>> a[0] = a[0] + [1]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'tuple' object does not support item assignment
    >>> a
    ([0],)
    >>> a[0] += [1]
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: 'tuple' object does not support item assignment
    >>> a
    ([0, 1],)
Though to be fair, technically nothing prevents "+" from mutating either operator (it'd just get you tarred and feathered).


Wait...did it throw an exception, yet still add the item to the list?

EDIT: Tried it myself. Sure enough, it throws the exception, but still adds the item. This seems like a bug.

EDIT Part 2: I understand now. `a[0] += [1]` is internally translated to `a[0] = a[0].__iadd__([1])`. The right-hand side is interpreted, mutating a[0] in-place as expected. But then the re-assignment happens, which throws the exception since you're trying to re-assign a value in a tuple, which is immutable.


You got it, that’s why it’s so odd.


It's because of Python's distinction (enforced mainly by convention) between references and value types.

If you are using a value type, `__iadd__`[1] can return a new value. So:

   x = 5
   y = x
   x += 3
   y is not x and y == 5
But if you're using a reference type, it is allowed to mutate the reference and return the original reference.

   x = [1, 2]
   y = x
   x += [3, 4]
   y is x and y == [1, 2, 3, 4]
[1]: https://docs.python.org/3/reference/datamodel.html#object.__...


Python defines += as the __iadd__ method which is sometimes mutating, i.e. if the object being assigned to is mutable or not.


Because I once learned that `a += b` is just a shortcut for `a = a + b`, just a syntax sugar. Now I have to constantly remind myself that it's not the case in python.

edit: I might have misunderstood your question. If you meant "why it's not equivalent" please see the falkaer's answer.


It's not equivalent but honestly this gotcha with lists will only show up in bad code and interview quizzes (i.e. bad code).


They aren't equivalent in C++ either.


My C++ is a bit rusty. Does it manifest itself in the standard library as well? The problem with python (in my view) is that this behaviour is implemented in the standard library and therefor propagates to the 3rd-party libraries by convention.


> Does it manifest itself in the standard library as well?

Yes. For example for std::string `a = a + b;` will (at least notionally) create a temporary string and then use the assignment operator. `a += b;` will not do this.


The problem and solution are broadly fine, but having done a bit of performance critical work, I usually dislike when a function allocates without a) me asking, b) me having the option to tell it not to.

I like passing in pre-allocated containers to house the results of my computation. That way I have not only predictable memory consumption, but I can also avoid costly allocation (and de-allocation) in tight loops. Having worked with Go a bit, there were times when I used this (like passing in byte slices to readers), but oftentimes you couldn't just tell a function not to allocate (which it did for safety), because you and only you knew it'd be safe to modify the structure in place.

The bottom line is that people should not be "always copy/always use pure functions" or "always use mutable structures, because performance!!!", but be aware of what the upsides and downsides of each are.


Better pattern is to follow the style of Numpy or Pandas and other libs (that OP also mentioned, but then failed to apply it in his own examples), letting the caller/user of your functions choose, if/when they want better performance, to opt-in for in-place modification, like have either:

    def normalize(data, inplace=False):
        if not inplace:
            data = data.copy()
        ...
or:

    def normalize(data, out=None):
        if out is None:
            out = data.copy()
        ...
Please, do follow this pattern for any libraries you release, having value-semantics-by-default to prevent shooting yourself in the foot + opt-in in-place mutation option for better memory performance is the right thing, and it's quite easy too with Python, Numpy and Pandas!


Julia has this as well, functions with a '!' at the end mutate the data, functions without make a copy.

So you have the functions 'filter' and 'filter!', 'sort' and 'sort!' etc.


Even if your API provides both choices, you can use the pattern described in the article to make the copying variation more memory efficient.


Scala's collections API also uses some mutable internal data structures locally (and temporarily) within some functions that present a pure API.

List[T] is one of the classic immutable data structures, and a List[T] cannot have its contents modified. But within the prependAll method, it can create a mutable List factory (ListBuffer) to do its work building the new list that it will return.

So, if you follow this (contrived) example in your IDE, calling this:

  val f = List(4, 5, 6).prependedAll[Int](Array(1, 2, 3))
will take you inside StrictOptimizedSeqOps, which does this:

  override def prependedAll[B >: A](prefix: IterableOnce[B]): CC[B] = {
    val b = iterableFactory.newBuilder[B]
    b ++= prefix
    b ++= this
    b.result()
  }
The mutable data structure doesn't escape the function, so to the caller it remains referentially transparent.


There's a 3rd option - keeping a modifications' log and applying them all destructively on just 1 temporary copy only when the result is needed.

    A = normalized(X*b - M*N*k); //creates a pipeline
    A.calculate(); //calculates all the stages destructively on the new copy


I think one way numerical computation APIs can offer optimizations of this sort is to expose a chaining API.

For eg:

chain(my_large_array).op1().op2().op3().result()

Using this, the chain() start could copy the input data once, but thereafter do a series of in place mutations in op1, op2 and op3 to improve performance.

Do numpy APIs provide such facilities?


Rust iterators combine like that if the operations are per-element (iter + map + collect).

For optimizing complex access patterns you'd need a language like Halide.


Or just use an immutable language or library and you'll have the best of both worlds. Unless you're writing an operating system. But the example was given in Python so I guess it's perfectly fine.

The idea of immutable data structure is just like strings in Java or other modern languages: it seems like a reference type but works like a value/primitive type. It maintains a constant pool that all "foo"s in the same virtual machine points to the same reference. For immutable data structures, they work in the same way.

On top of that, there's more underlying sharing mechanism. For List(1,2,3), it's actually might sharing the same List(1,2) with List(1,2,4) to minimize the waste.


This seems more like "defensive copying" than "interior mutability" to me.

Ideally you'd have a compiler that could perform optimizing magic to detect when the copy isn't needed, but that's not going to happen in CPython.


This is why you want a two-tiered API: High level and low level. The high level API chooses a static trade-off in performance, size, etc for the 80% case, and the low level API gives more explicit control over what's going on.

In this case, you could create the function normalized() for the 80% case, and normalize_in_place() for users who need to optimize this path (choosing to take on the extra responsibilities that come with it).


I think Futhark approaches this with uniqueness types. A bit strange (maybe that's just me?) but it works well with arrays.

Also it compiles to Python nicely.


What about lazy evaluation though? That could get the functions 2*A memory down to just A while still keeping the time O(n)¹

Of course, maybe the author just chose to limit this blog post to this one solution, but it wouldn't have hurt to at least mention some further optimizations :)

¹ And some O(1) data like the min and max values of the Array


Specifically on the point about Pandas, last time I checked the inplace flag did not save any memory. Under the hood, it creates a temporary copy which is copied back to the original dataframe, and then the temporary gets removed next garbage collection cycle. If that is no longer the case I would love to know about it though!


Matlab gives you these semantics by default. IMO, it's a big reason why it has been so successful with non-CS programmers. Mutation is often more intuitive, but as long as you break your code into functions, you automatically limit the scope of mutability bugs that can occur.


Isn't this solved perfectly with an affine or linear type-aware optimizing compiler? Where all data is immutable unless the types prove to the compiler that there are no mutation conflicts. I believe Rust implements this and Idris 2 does so as well with quantitative type theory.


Rust would be very different and the problem wouldn't exist in the first place, because the function would signal its behaviour and intent through taking the parameter by reference, mutable reference or value.


That's what I meant


This is just a straightforward memory optimisation; nothing to do with interior mutability.


Another option is to have a data structure that you have frozen and then you unfroze it for the hot path and then freeze it again when you exit it.


Does the Numba JIT do this? This definitely seems like a compiler optimization that I want to mechanically run on my code


Don’t functional data structures basically “diff” data, which is a reasonable compromise?


I have been mutating data for decades without any unexpected behavior gremlins.


Another option is for the language's runtime to use copy-on-write (COW):

https://en.wikipedia.org/wiki/Copy-on-write

So basically every buffer can be made immutable and then only copied when written to, using something like refcounts and diff trees to internally track mutations from an immutable parent buffer. Languages like Clojure due this to manage state. I believe the Redux also does this internally, although I've only studied it and haven't had a chance to use it professionally.

In practice, this looks like a language where everything is value-based instead of reference-based.

So like, in C# where you pass objects by reference, a COW language passes everything by const value. Then an imperative language statically analyzes the code to detect mutation (my own theory is that this can never be accomplished with 100% certainty, at least not with a simple implementation). So basically buffers act like Git instead of byte arrays, creating a new snapshot with every mutation.

Or functional languages can disallow mutation altogether, or reduce the code into its most minimal representation and handle mutation at special breaks in execution (like monads). I'm grossly oversimplifying all of this and probably using the wrong terminology, but am more than 50% confident that I can derive what I'm talking about, so will just go with it hah.

I think that the complex imperative implementation I'm talking about is probably Rust. The simple functional implementation would look like a pure immutable Lisp running within the Actor model, handling state only through IO, and dropping monads altogether. The catch being that complex functional code might have so many breaks in execution that it would practically be happening on every single line and begin to look like an imperative language with all its flaws. This is the primary reason why I shy away from impure functional languages like Haskell, because I don't think they have solved this core issue of mutation clearly enough (not to mention, I think functional language syntax is generally write-only and not readable by others or yourself in 6 months hahah). As far as I'm concerned, this is still an open problem.

In another life, I want to make an imperative language that's primarily immutable, that uses higher order functions (or better yet, statically analyzes code for side effects and converts foreach loops into higher order functions automagically), and transpiles to a purely immutable Lisp. It would be associative array-based and look like Javascript. The idea being that we could write code in the human-readable imperative style and let the computer distill it down to its pure functional equivalent.


> not to mention, I think functional language syntax is generally write-only and not readable by others or yourself in 6 months

This is rather tangential, and sorry to hijack your message to get on my soapbox, but I and many others who program in Haskell and a dynamic language (Python in my case) find that Haskell code we've written is far easier to come back to than code we've written in the dynamic language.


Ya I tend to agree actually, but in the real world I keep getting scolded for being too "functional". I think that it might come down to the easy != simple argument, which the whole CS industry is still grappling with in other areas as well.


I also think an imperative language that's preliminary immutable, that uses higher order functions (and some other techniques to get performance). I think this is the future of programming and can be taken much further than current languages.

<shameless self promotion> I'm working on a language for this. Right now it uses LLVM to compile to native and uses HAMT (see Clojure) to store data. It looks similar to Javascript. https://github.com/Floydlang/floyd

There are no references in Floyd, only values. This opens possibilities of great performance without complex/risk analysis of ptrs and aliasing.

The next phase is to create an intermediate AST (equivalent to your idea about compile-to-LISP) that lets programmer/tool tweak how the data structures map to the HW.

I'd love to pick your brain about these ideas!


> The idea being that we could write code in the human-readable imperative style and let the computer distill it down to its pure functional equivalent

Here we are, folks pulling their hairs for decades wrt how to most efficiently trans/compile higher-order FP idioms to minimal-overhead imperative --- and you want to go the other way around!

Kidding aside, the "functional sub-language" of cell-lang.net has a feature like this, functions are pure but you get the imperative sugars (loops, branches) while preserving purity guarantees like referential transparency.


For reference, Redux doesn't "do anything" internally. It only calls the reducer function _you_ provide, and you are supposed to follow the rules of immutable updates inside that reducer:

https://redux.js.org/recipes/structuring-reducers/immutable-...


Copy on write?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: