Rust: Dropping heavy things in another thread can make your code 10000x faster

fpgaminer · on May 30, 2020

Some important things I think people should note before blindly commenting:

* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.

* I didn't see any difference on my machine between a debug build and a release build.

* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.

* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.

* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

ckcheng · on May 30, 2020

>As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

This reminds me of the exploding ultimate GC technique [1]:

> on-board software for a missile...chief software engineer said "Of course it leaks". ... They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits its target or at the end of its flight, the ultimate in garbage collection is performed without programmer intervention.

[1]: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...

all-fakes · on May 30, 2020

Yes, Java also has a garbage collector that does nothing, called the Epsilon GC, intended for short-lived programs and references for garbage collector benchmarks.[0]

[0]: https://blogs.oracle.com/javamagazine/epsilon-the-jdks-do-no...

inopinatus · on May 31, 2020

The bumper sticker is “Ada programmers prove the algorithm terminates.”

blub · on May 31, 2020

I guess the bigger question is why they use dynamic memory allocation in the first place.

rubber_duck · on May 30, 2020

>As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

And then someone tries to use your compiler as a service (code analysis, change triggered compiler) and it's a dead end

qzw · on May 30, 2020

Well, then that’s not the original use case anymore, and it’ll have to be re-engineered. In the meantime it may have been used for years and the perf difference may have saved many developer-years collectively across its user base. Surely you’re not suggesting that the compiler developers should be prematurely optimizing for future use cases that they may not even have envisioned.

rubber_duck · on May 30, 2020

Avoiding leaks is not optimisation, it's a matter of correctness - not freeing memory is an optimisation based on a very shortsighted assumption that is not practical for any new language (modern languages are expected to come with language server support)

nelhage · on May 31, 2020

We used precisely this optimization in [sorbet](https://sorbet.org), a brand-new type checker for Ruby, which also contains a high-performance LSP server.

We wrote the entire thing (and tested, using ASAN and fuzzers and other techniques) to avoid leaking memory, and then strategically inserted [the equivalent of a rust `mem::forget`](https://github.com/sorbet/sorbet/blob/0aae56e73c7680ec6053b3...) into the end of the `main` driver during standalone mode, to avoid calling those destructors when we're about to exit anyways.

This optimization is definitely still relevant for new systems today.

jpitz · on May 30, 2020

Correctness means adherence to the spec, not some contrived absolute truth.

viraptor · on May 30, 2020

But the spec usually has some implicit assumptions. Usually it's "app doesn't leak memory" in the same way nobody explicitly specifies "result of an addition of natural numbers should match ...".

We don't go around saying "oh, you didn't want modulo 5 arithmetic? You should've put that in the spec, not rely on some contrived absolute truth".

jpitz · on May 30, 2020

Okay, but we're talking about an application, a piece of software who's primary intention is to run for a short period of time, parse a text file and transform it. It will be ran many many many times a day by developers whose time is expensive. Language servers are a very new concept in terms of being part of the day-to-day tool chains for most developers. Trading garbage collection for compile time was absolutely a winning strategy for compiler writers. It made them money.

pjmlp · on June 1, 2020

Language servers go back all the way to Xerox PARC workstations and Lisp machines.

C++ had its very first language servers via Lucid's Energize C++ and Visual Age for C++ v4.

Here is the 1993 video and related paper from Lucid.

https://www.youtube.com/watch?v=pQQTScuApWk

http://www.dreamsongs.com/Cadillac.html

And some information regarding VA, unfortunately most is missing from online world.

https://www.ecomstation.it/pido2/home/esterni/vac40os2.pdf (codestore)

http://www.edm2.com/index.php/VisualAge_C%2B%2B_4.0_Review

It is also why Delphi, VB and C++ Builder were already such a pleasure to use versus the Makefile and vi world of UNIX.

viraptor · on May 31, 2020

"We made dodgy choices to be competitive" is a valid explanation. They don't start being dodgy when new use cases come around and technical debt bites you in the butt.

I see not freeing as a clever bug/workaround with nice positive side effects. Not as a clever solution.

nitwit005 · on May 31, 2020

But the memory doesn't leak, it gets freed when the program exits.

The user doesn't care that each allocation is paired with another call to free it up. They just care if the program runs quickly and doesn't use too much memory overall.

viraptor · on May 31, 2020

Unless the system doesn't have enough memory to allocate for large compilation units and starts swapping allocated/dead memory and slowing down massively. Unless the user runs a CI where as many things should be compiling at the same time as possible and slowly freeing memory in 3 concurrent processes is better than running only 1 at a time. Unless they have spare cores, but no spare memory. Unless ...

> They just care ...

It all depends on who "they" are and what they want to achieve.

jashmatthews · on May 30, 2020

It’s usually the opposite! Correctness work usually has the implicit assumption of infinite memory.

rwallace · on May 31, 2020

> nobody explicitly specifies "result of an addition of natural numbers should match ..."

Yes they do. In most programming languages, by default integer arithmetic is modulo 2^64 (at best). If you want arbitrary precision arithmetic, you have to explicitly specify that.

jashmatthews · on May 30, 2020

Why do you say that? Even if you call free immediately after a piece of memory is no longer needed, malloc won’t release that immediately anyway.

If this is incorrect, then every modern malloc implementation is incorrect.

random314 · on May 30, 2020

You have not provided any refutation to the OPs argument.

pdimitar · on May 30, 2020

I am suggesting they apply good practices. I'd never imagine that compilers were actually doing what was stated -- sounds awful.

I understand it's tradeoffs and we all have real-world limitations to contend with -- but again, of all the corners that could be cut that's exactly the one I didn't imagine they would.

Nasty.

correnos · on May 30, 2020

Of the three compilers I've worked on in-depth, only one of them had a "normal" memory management scheme.

One of them was unburdened by any thought of freeing stuff, and relied entirely on the application exiting for cleanup. This was very convenient to work with, and never ended up posing an issue.

Another used a series of allocation arenas, where certain arenas would be cleared at certain points in the compiler pipeline. This made for both speedy alloc/freeing and avoided leaks, since you weren't at risk of "forgetting" a data structure. It was also a major headache to keep track of exactly what the longest lifetime of a long-lived datastructure might be, and to pick an arena that won't be cleared in the meantime. Unfortunately the programs compiled with this compiler were large enough that we certainly couldn't have gotten away with just leaking memory; we sometimes OOMed as-is!

The third used standard C++ memory management. This compiler was quite simple, and the vast majority of its data used stack-based lifetimes. For a more complex compiler this would've become a headache.

I think that all of these compilers chose the correct allocation strategy for what they were doing. "Good practices" aren't as universal as we might like to believe, they depend entirely on the context in which a tool is designed to operate. And yes, we can guard to some extent against that context changing, but for the most part that's why we keep getting paid.

pdimitar · on May 30, 2020

Judging by the downvotes I am getting (with zero explanation as to why), I'd say that many misunderstood the "good practices" part and took offence, as if I said there's only one good practice.

I was taught -- including at the start of my career when I used exclusively C/C++ (about 18.5y ago) -- to take care of all resources I was using and not rely on runtimes.

I understand and appreciate different usages but to me doing a proper cleanup was the sane default for most programmers. And that's all what I was saying.

Obviously, as one digs deeper in a specialised area where more and more efficiency is demanded then they have to reach for tools that most of us wouldn't normally. That's quite normal and was always interesting for me to read about.

grayclhn · on May 31, 2020

I didn't downovte, but asides like, "sounds awful" and "nasty" definitely read as judgemental. Your other comments on this thread make you seem much more open minded than this one did.

thebean11 · on May 30, 2020

Can you articulate why it's a bad practice? If it works better than alternatives and it's documented, not really sure what the issue is.

I don't think it's even that uncommon. I believe some HFT firms run Java with a huge amount of RAM and GC disabled, and get around it by just rebooting the software occasionally.

To me writing software like that is fair game, I don't see the point in being dogmatic about "how things should be done".

nitrogen · on June 3, 2020

This thread has a lot of branches, so I am just picking one for this general reply:

I recall reading somewhere, years ago, that some OSes couldn't be relied upon to release unfreed memory when a process terminated. In those contexts, fastidious freeing would be important even in short-lived processes.

In my own C code I tend to free everything so that I get a clean trace from Valgrind and don't risk masking legitimate bugs, but I typically write long-running daemons.

jfkebwjsbx · on May 31, 2020

It is bad practice if your code cannot turn on deallocation for debugging purposes or library-like usage.

pdimitar · on May 30, 2020

Mostly because I look at it from the angle of one-off / general purpose / CLI programs. If one such has to run for 10-30 seconds and its memory just keeps growing and growing with the idea of throwing it all away at the end and letting the OS handle it, it might become disruptive for other programs on the machine.

For specialised apps and servers it's of course a perfectly good practice.

rcxdude · on May 30, 2020

Deallocation at the end of a program's execution can substantially add to its runtime, and it's entirely waste. It's a much more common strategy than you might think.

pdimitar · on May 30, 2020

You are right, I indeed didn't know it was that common.

But still, in a world where languages and runtimes are also judged by their ability to run in lambda/serverless setups, I'd think this practice will start being obsolete, wouldn't you think?

(What I mean is that I imagine that any serverless function that runs in severely constrained and measured environments like the AWS Lambda would gain a significant edge over the competition if it did an eager cleanup. Should allow more of them to work in parallel?)

setr · on May 30, 2020

I don't understand; a lambda function executes and closes, and any leak gets handled by lambda runtime (by freeing the whole lambda function)-- exactly the same as running a cli tool. It seems to me that the server-less context is actually exactly where you'd employ this strategy: the programs by definition cannot be long-lived, so unless they generate so much garbage as to oom in that timeline, there is no need to deallocate anything yourself.

The leak to watch out for is reduced to the persisted data (eg in S3) between executions

pdimitar · on May 30, 2020

Yeah, this is what I was saying -- that it could produce too much garbage if no manual free-ing of memory is done.

I suppose, judging by the downvotes, that some find that perspective naive so I'll just cut it here because no productive discussion is happening. :(

setr · on May 31, 2020

the purpose/pattern is to ignore deallocation when the program finishes -- not during runtime. If you have 30MB of space, and your program will only ever use 20MB, why bother?

You only need to care when you're at risk of OOM'ing during runtime. If you're generating too much garbage, you either need to start cleaning up after yourself, or you need to generate less of it -- but it doesn't matter which choice you take from the perspective of the OS or the server-less function.

The only reason I can think of for server-less to impact the validity of this pattern is from it's lower memory capacity -- but otherwise it's the same constraints. And anyways, it's probably not nice to generate a few GB of garbage even if the user has the space

folmar · on May 30, 2020

Most compiler are not designed to run in daemon mode, specifically it's non-issue since their startup is normally fast. And the compiler and runtime are a different thing.

pdimitar · on May 30, 2020

I realise that, but nowadays language servers are a pretty normal practice in no small amount of areas.

smabie · on May 31, 2020

You mean like LSP? Why would a compiler and a language server be integrated in the first place? Clearly a LSP server wouldn't use a no-freeing strategy, but there's no reason why it would cause issue with a compiler. A compiler must terminate, assuming parsing isn't turing complete (like Perl) and that the source code is finite. This is why it's okay for it to leak memory.

steveklabnik · on May 31, 2020

> Why would a compiler and a language server be integrated in the first place?

Because a language server needs to do a lot of the same work as a compiler.

This is an eventual hope for rustc. For now, the latest language server and it share a bunch of libraries, but language servers are effectively compiler frontends.

pdimitar · on May 31, 2020

Yep, I guess I conflated compilers and LSP. I imagined compilers would have daemon modes. Seems like I was wrong.

michaelcampbell · on May 31, 2020

> I'd never imagine that compilers were actually doing what was stated -- sounds awful.

And how many millions of iterations have been done successfully in that "awful" system?

The very fact that you never imagined it I think says a lot.

pdimitar · on May 31, 2020

Well, I have been taught to take good manual care of all used resources. It is kind of a cognitive shock when you see all your training hand-waved away with "let the OS handle my mess of allocated objects that I'll never call `free` on". It's kind of disappointing on some level. :)

As I acknowledged in other comments of mine downthread, I understand that different situations require different tradeoffs. It's just that forgoing memory deallocation wasn't one of them in my head.

ycombobreaker · on May 30, 2020

In a world where processes can fork-and-exec, nothing about "as a service" changes that. The compiler would just be reinvoked as needed. Converting it into a persistent process breaks a lot more than just allocation optimizations.

rubber_duck · on May 30, 2020

But you want to share state and only update state incrementally on edit to get any reasonable level of performance for stuff like language server code analysis.

all-fakes · on May 30, 2020

I'm not sure what the performance overhead would be, but that state can easily be stored on disk.

tsimionescu · on May 31, 2020

I'd bet it's a lot harder to rework the compiler to update its state in place than it is to rework it to free memory.

yrro · on May 30, 2020

As distasteful as leaky code is, is it that bad to run it in a separate process? You get a bit more robustness against crashes as well.

rubber_duck · on May 30, 2020

You want to be able to incrementally update state to get performance out of incremental code analysis (eg. language server implemention for IDE)

folmar · on May 30, 2020

There are more ways to go then that: AST fragment caching, intermediate representation fragment caching and so on. Incremental updates fit some languages better than the others.

hinkley · on May 30, 2020

One of my “favorite” snags in perf analysis is that periodicity in allocations can misattribute the cost of allocations to the wrong function.

If I allocate just enough memory, but not too much, then pauses for defragmentation of free space may be costed to the code that calls me.

A solution to this that I’ve seen in soft real time systems is to amortize cleanups across all allocations. Every allocation performs n steps of a cleanup process prior to receiving a block of memory. In which case most of the bad actors have to pay part of the cost of memory overhead.

Might be good for Rust to try something in that general realm, or in the cleanup side may be easier to tack on. On free, set a ceiling for operations and queue what is left. That would at least peak shave.

rini17 · on May 30, 2020

Doing unrelated cleanup sounds like flushing CPU cache per every allocation.

hinkley · on May 30, 2020

This post is already about unrelated cleanup, so I'm not sure what other escape hatch you imagine people taking. Do you have a suggestion?

You can tell that it's unrelated cleanup because if it were related, then the cost of freeing wouldn't be noteworthy. It would be cache hot because you would be visiting it for the second time. In which case we'd be talking about why you are scanning a giant object on an event loop in the first place. That's not what's at issue. What's at issue is that you've been handed this great bomb of uncached data from someone else and now you're stuck doing the janitorial work.

Freeing an object of arbitrary size is effectively an unbounded operation. Cache invalidation has a very high cost, sure, but it's still bounded.

Putting a limit on the amount of work you do, you could stop before purging the entire cache. You could use a smaller limit on doing work for someone else and control invalidation there, too.

papaf · on May 30, 2020

This deallocation trick is neat but in C and C++ you could use a memory pool to do this.

In theory, you could also use a memory pool in Rust but I think the standard library uses malloc without some way of overriding this behaviour.

vvanders · on May 30, 2020

You can also use the typed-arena crate[0] or roll your own if you're feeling like cracking open unsafe.

[0] https://crates.io/crates/typed-arena

all-fakes · on May 30, 2020

Typed-arena is a bit limited, because all the allocated values must be of the same type. bumpalo[0] removes that restriction, but it also doesn't run destructors, which can lead to memory and other resource leaks. Another worry I have with bumpalo is that I don't think it's been reviewed thoroughly for unsoundness.

Still, bumpalo has been used to great effect in dodrio[1], a React-like library for Rust with really good performance.

[0]: https://crates.io/crates/bumpalo

[1]: https://github.com/fitzgen/dodrio

orf · on May 30, 2020

You can change the global allocator in any rust project. You can write your own easy enough, or use one like jemalloc

josephg · on May 30, 2020

Sure; but when you’re using arenas and things like that you usually you will want different objects allocated into different pools (or with different lifetime properties). Rust only lets you pick one allocator for the entire process, so you can’t specify “all the children of this data structure go in arena A, and this other allocation goes into a traditional heap”.

It’s more awkward, but I much prefer Zig’s approach here where everything that allocates takes an allocator as a parameter. Usually the allocator is specified at compile time - in which case zig can generate identical code to the rust compiler. But when you want the flexibility, it’s there.

Aside from compilers, this is heavily used in video games where there’s often a lot of objects that get allocated per frame, and can be discarded all together. And in that case rust’s lifetime tracking would be a huge asset. The dovecot email server (C) also makes superb use of a mix of memory containers for different tasks. Given how messy email parsing is, dovecot is an absolute pleasure to read.

steveklabnik · on May 30, 2020

Not in an abstract way, but you can do this. The compiler has (had? It’s been a while) two different major arenas, in my understanding. There’s just no non-global allocator trait yet, so it’s not easy to abstract over.

andrepd · on May 31, 2020

>everything that allocates takes the allocator as a parameter

This is also the C++ approach

afiori · on May 31, 2020

As someone that is interested in the topic and in Zig's "provide your own allocator" approach I have a question: would it be possible to make an allocator wrapper that moves values to be deallocated to a different thread?

As far as I know it would require both Rust's borrowing semantics and Zig's architectural choice.

From purely my own fan-boy perspective Zig approach is something that I would have really liked for Rust to adopt (I have no idea of which one came first chronologically).

steveklabnik · on May 31, 2020

It is not an allocator wrapper, but https://docs.rs/defer-drop/1.0.1/defer_drop/index.html

francasso · on May 30, 2020

I don't get this. Usually I either don't care about how stuff is allocated, or when I care I want to be specific about what allocator should be used where. In the second situation just setting a global allocator wouldn't cut it most of the time, I would want support for customization when I instantiate a data structure.

bauerd · on May 30, 2020

Think with LD_PRELOAD it is always possible to override the allocator?

jfkebwjsbx · on May 31, 2020

That wouldn’t help. The point is to use local allocators which have extra information about the data structures and usage of the memory.

This is routinely done in medium to large C++ programs for different reasons (performance, debuggability...).

dan-robertson · on May 30, 2020

Only if it is dynamically linked

wongarsu · on May 30, 2020

In a toy raytracer I once wrote in C++, switching from malloc to custom memory pools for small fixed-size objects was a big performance boost. Making free() a noop was another big performance boost, both for deallocation and allocation. Turns out sequentially handing out memory from a big chunk of memory is much easier than keeping track of and reusing empty slots, and it keeps sequentially allocated objects in sequential memory locations.

folmar · on May 30, 2020

jemalloc is the answer to the question you did not ask

estebank · on May 30, 2020

In Rust you could just call `mem::forget` on whatever heavy thing that you're no longer using is before it would get dropped, but then the programmer is effectively responsible for that memory leak not becoming a problematic leak during refactors.

Edit: this will also break any code that relies on Drop being called for clean up, but that is already a "suspect"/incorrect pattern because there are no assurances that it will ever run.

kd5bjo · on May 30, 2020

> There are no assurances that it will ever run.

Yes and no. Whenever control leaves a code block, Rust automatically calls the drop() method of all values still owned by that block. There is no guarantee that control will exit every block (cf. Turing), but a moderately exceptional circumstance needs to occur for this not to happen, like an infinite loop.

rcxdude · on May 30, 2020

There's also no guarantee the object won't be moved elsewhere, including into a context which never gets freed (for example, you can construct mem::forget using entirely safe code by forming a cycle of reference-counted boxes). That said, Rust has support for such a concept through the Pin trait (which essentially guarantees an object will not get moved until it is 'Unpinned').

mcguire · on May 30, 2020

Isn't that a problem for the RAII approach?

projektfu · on May 30, 2020

Yeah, I like Apple’s (Next’s) approach of pool allocation for each run through the event loop. Defer dealloc, drop pool at the end.

saurik · on May 30, 2020

Apple's pools are for helping manage reference counts of returned objects (via autorelease) but aren't doing pool allocation (as any of those objects could escape, so you can't do the fast thing of just deallocating the pool). The normal memory allocator is used while in the scope of an autorelease pool.

projektfu · on May 30, 2020

True, I was thinking the same after I posted. Maybe I’m thinking more of Apache’s approach and how it gets some of the benefits of a fork-per-request server without the overhead.

andrekandre · on May 31, 2020

right, so if you do

@autoreleasePool { [[[SomeObject alloc] init] autorelease]; }//someobject is release after exiting this scope

//inside main runloop, not inside @autorelease{} [[[AnotherObject alloc] init] autorelease];

//AnotherObject will be release at end of main runloop

[[NotAutoreleased alloc] init]; //will leak past runloop iteration

is that understanding correct?

mcguire · on May 30, 2020

Unless your destructors do more than deallocation, in which case you will leak whatever other resource you're managing.

MaxBarraclough · on May 30, 2020

A pool can invoke destructors when it is cleared. Might take a bit of overhead (if the pool is to support arbitrary classes), but you could retain the fast pointer-bump allocation.

ShroudedNight · on May 30, 2020

Not Rust, C++, but in case anybody wants an example: https://github.com/eclipse/omr/commit/fd99ca42fdbed76cc00e68...

TR::Region is the slab allocator used by the JIT in OpenJ9/OMR. The linked commit adds functionality for calling destructors of arbitrary types allocated in the Region.

MaxBarraclough · on May 31, 2020

Interesting, thanks. Seems to require use of a common base class though, i.e. to use a class with this pool, you have to inherit from Destructable, or else create a subclass that does. Seems ugly. Perhaps that's the only way it can be done in C++ though.

projektfu · on June 1, 2020

It does create an object with that base class for the allocator but returns an object of the type you want from the allocator, using a template (Region::Instance<T>). So the API user only sees the types it's interested in.

CoolGuySteve · on May 30, 2020

Even just keeping a free list and deallocating it’s elements at an idle time is probably cheaper and faster than spawning a thread.

jomohke · on May 30, 2020

You can easily use a custom global allocator in Rust:

    #[global_allocator]
    static GLOBAL: MyAllocator = MyAllocator;

pjmlp · on May 30, 2020

In C++/WinRT the same approach is taken, because you cannot just use a memory pool for COM.

indemnity · on May 30, 2020

Isn’t a Rust “move” implemented as a bit wise copy (e.g. memcpy call)? I see people claiming move has no cost but I’m not sure that is true.

justinpombrio · on May 30, 2020

EDIT: See kevincox's reply. Rust will bitwise copy the containing type, which is typically very cheap. For example, it you move a String, it will copy the String struct, which contains a couple pointers and a length (or something along those lines). Importantly, it will not copy the underlying char array.

I was thinking of the following code, where I believe the assignment to y is actually free. Though apparently this isn't called a "move".

    let x = <<large owned type like [char; 1000]>>;
    let y = x;

More info: https://doc.rust-lang.org/rust-by-example/scope/move.html

kevincox · on May 30, 2020

Semantically it is a bitwise copy.

In the example from the article it is probably actually a copy because the value was originally on the parent thread's stack, which will be reused after the function returns, so the value will need to be copied to the new thread's stack.

However it is important to not that it isn't a deep/recursive bitwise copy. It just needs to copy the HashMap itself (which is probably a handful of words).

So yes, it is doing a bitwise copy, but this is also very cheap. It will be much, much cheaper than spawning the thread.

ahupp · on May 30, 2020

That's not exactly true. In C++ terms, the example code is moving a value:

  std::map<...> foo;
  someFunction(std::move(foo));

And not moving a pointer like:

  std::unique_ptr<std::map<...>> foo = ...;
  someFunction(std::move(foo));

So it copies sizeof(std::map<...>), not a pointer.

pjmlp · on June 1, 2020

Not necessary, std::map::map(&&) is free to move just the internal pointers and bookkeeping data, only dumb implementations would move the memory blocks where the map data is actually stored, as per ISO C++ requirements.

ahupp · on June 1, 2020

The comment originally said it was the cost of a pointer copy, and I was pointing out it's not a pointer. eg., sizeof(std::map<std::string,std::string>) is 48, which is the "pointers and bookeeping" you mention.

pjmlp · on June 2, 2020

Ah sorry for the misinterpretation. Should have payed more attention.

dathinab · on May 30, 2020

What is bit-wise copied is the pointer to the memory.

I.e. a `HashMap` struct, or `Vec` struct don't directly contain the data.

For example the `Vec` is defined internally as something similar to:

`struct Vec<T> { data: *mut [T], capacity: usize, len: usize, marker: PhantomData<T> }`

(Slightly simplified, not actual Vec type).

So a move of a Vec copies at most 3 usize (24 bytes on 64bit systems), similar thinks apply for a HashMap.

Additionally the copy can often be elided through compiler optimizations.

As a interesting side note a new empty Vec/HashMap will not actually allocate any memory, only once elements get added it will start doing so. This is why it crates vec's of vecs of length 1. Or else it wouldn't need to do "number of element" free calls.

tsimionescu · on May 31, 2020

So the Vec type is not storing data on the stack, like C++'s std::vector can?

cesarb · on May 31, 2020

Exactly, neither Rust's Vec nor Rust's String (which is actually a Vec in disguise) store data within the struct itself, it's always stored in an external allocation. This is intentional, to avoid extra branches on whether the data is inline or not. If you have a lot of small vectors, and using less memory is more important for your use case than avoiding the extra branches, there are alternatives like https://crates.io/crates/smallvec that you can use.

steveklabnik · on May 30, 2020

Semantically, it is a bit wise copy, yes.

However, these copies can often be elided by optimizations.

ndesaulniers · on May 31, 2020

> As an aside, compilers have used the trick of not free-ing data structures before

In Clang, this flag is `-Xclang -disable-free`. Not from a Jedi...

GuB-42 · on June 1, 2020

I don't know if Rust can do it (unsafe?) but in C and C++, I sometimes end up writing a custom allocator. It is often one of the most significant optimizations.

For example, I had the "million strings" problem once, literally millions. The solution was to put every string into a single large buffer and the pointers in another buffer. Not only I could deallocate everything at once but I also saved a bit of RAM by not aligning (not needed for strings).

jfkebwjsbx · on May 31, 2020

> That very thing can easily happen in the real world.

Only if badly designed. That is why it is contrived!

> While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon

A program dealing with a million elements of any kind should not be performing naive allocations to begin with.

> we do deallocation trickery in the real world

Skipping deallocations is an optimization, not a design pattern.

In other words, the code needs to keep the ability to perform the deallocation for debugging, testing, usage as a library, etc.

chowells · on May 30, 2020

This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.

It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.

But overall, there's nothing really unexpected here, if you're familiar with memory management.

Jasper_ · on May 30, 2020

One of my favorite papers by Bacon et al expands on this intuition that garbage collection and reference counting are opposite tradeoffs in many respects, and gives a formal theory for it. My views on gc/rc haven't been the same since.

http://researcher.watson.ibm.com/researcher/files/us-bacon/B...

pron · on May 30, 2020

That's a great paper, but one important thing to point out is that some production-grade tracing GCs are on the sophisticated end of that paper, while almost all reference counting GCs are on the rather primitive end. Given the same amount of effort, it's easier to get a reasonable result with reference-counting, but there are industrial-strength tracing GCs out there that have had a lot of effort put into them.

jeffdavis · on May 30, 2020

"Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant."

Takes time proportional the live set times the number of GC runs that happen while the objects are alive. In other words, the longer the objects live, the more GC runs have to scan that object (assuming there is enough activity to trigger the GC), and the worse GC looks.

titzer · on May 31, 2020

This is most decidedly not true for generational GCs, and for concurrent GCs, the tracing work happens asynchronously and in parallel, on other cores, not taking time on the main thread.

jeffdavis · on May 31, 2020

You still have to scan the oldest generation at some point; otherwise (assuming some activity) you'll be leaking memory there over time.

Moving the work to another core doesn't really change my statement.

Aside: I wonder if it's feasible to switch to reference counting for the oldest generation? That would move the problem back to deallocation. I'm not sure if that's a good trade-off, but it would be interesting.

tsimionescu · on May 31, 2020

It's still true for generational GC, but the costs are more complicated. Even objects in the oldest generation are periodically scanned in most implementations (though there are some, like Clozure CL's GC, that never scan the old generation automatically).

But generational GCs do improve massively on this problem.

Reelin · on May 30, 2020

> the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path

This can also trivially be done in other languages. Atomically append your pointer to a queue of "large things that need to be freed" and move on as though you had actually called free.

Within a particularly time sensitive loop you can even opt to place pointers into a preallocated array locally. Then once per loop iteration swap that array with the thread handling the deallocations for you. It eats up a bit of CPU time but can significantly reduce latency.

jacobparker · on May 30, 2020

OP said safely; what you're describing isn't safe in, say, C++ in the same sense that it is in Rust.

Reelin · on May 30, 2020

Isn't that essentially a tautology? Manual memory management and threading in such languages lacks safety guarantees to begin with.

im3w1l · on May 30, 2020

A lot of C++ code depends on deallocation order for correctness. Like a destructor may want to say bye-bye to a pointed-to-object, and if you reverse order of deallocation, that pointer may be dangling.

Consider this code

    {
        Window a;
        ClickHandler* b = new ClickHandler(&a);
        delete b;
    }

Let's say b tries to deregister itself when it's deleted. This code will work as written. But if you defer the deletion of b, then stack allocated Window a may already be gone.

Reelin · on May 31, 2020

I never meant to imply that you could use such tricks without considering the lifetimes of references. Manual memory management already requires being mindful of those though so it's nothing unique to the described situation. Yes Rust offers an advantage here, but it's the same one it offers everywhere else so it doesn't seem relevant to me.

That being said, it seems like most cases where such an approach is worthwhile involve large nested data structures that involve lots of pointer chasing to traverse. All cases I've encountered so far were bulk stores that didn't involve actively interacting with external objects.

im3w1l · on May 31, 2020

Replying to myself: I guess you could run the destructors in order, but keep the memory around a bit longer.

loufe · on May 30, 2020

I've not worked with any language thus far without automatic garbage collecting, so this was definitely a neat read for me. It sounds rather elegant.

burpsnard · on May 30, 2020

It's worth popping the hood and getting your fingers dirty. C was written in an era where memory was a scarce and precious resource to be grudgingly used if absolutely necessary

tsimionescu · on May 31, 2020

Please note that C is quite a few years older than the first GC languages. LISP1.5, ALGOL-68 and APL all had garbage collectors before C even existed.

Not to say that it's not worth it to learn manual memory management as well, but it is important not to think that GCs are a fancy modern tool, and that greybeards would never touch one. There were greybeards using punch cards and programming in a GC language, with output printed out on paper.

arcticbull · on May 30, 2020

> This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

That doesn't seem to make intuitive sense. A GC has the same problem.

A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers. You're beginning your comparison after the mark step, which isn't a fair assessment since what Rust is doing is akin both both the mark and sweep phases.

The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

The difference between a GC and this kind of memory management is that the GC does the traversal later, at some point, non-deterministically. Rust allows you to decide between deallocating it in place, immediately, or deferring it to a different thread.

chowells · on May 30, 2020

I said generational/compacting collector. You're talking about a mark and sweep collector.

A generational/compacting collector traverses pointers from the live roots, and copies everything it finds to the start of its memory space, and then declares the rest unused. If there is 1GB of unused memory, it's irrelevant. Only the things that can be reached are even examined.

As I said, this has the opposite problem. When the live set becomes huge, this can drag performance. When the live set is small, it doesn't matter how much garbage it produces, performance is fast.

arcticbull · on May 30, 2020

How are finalizers invoked if the structure isn't traversed? Would it just be optimized away none of the objects have finalizers? Hence my suggestion about the area allocators being a better point of comparison.

zucker42 · on May 30, 2020

Java is an example of a language with a generational copy collector by default. Most objects in Java don't have a finalizer, since after all the main point of, for example, destructors in C++ is to make sure you don't leak memory, which the GC solves. But when the `finalize` method is used is causes significant overhead.

> Objects with finalizers (those that have a non-trivial finalize() method) have significant overhead compared to objects without finalizers, and should be used sparingly. Finalizeable objects are both slower to allocate and slower to collect. At allocation time, the JVM must register any finalizeable objects with the garbage collector, and (at least in the HotSpot JVM implementation) finalizeable objects must follow a slower allocation path than most other objects. Similarly, finalizeable objects are slower to collect, too. It takes at least two garbage collection cycles (in the best case) before a finalizeable object can be reclaimed, and the garbage collector has to do extra work to invoke the finalizer. [1]

Sure, you're technically correct that if the objects all had finalizers that did the same thing as C++ destructors, it would be equivalent, but because of the existence of a GC we don't have to do any work for most objects. A GC is equivalent to an arena allocator in this sense.

Another point is the C++/Rust pattern of each object recursively freeing the objects it owns presumably leads to slower deallocation, because in the general case it involves pointer following and non-local access.

[1] https://www.ibm.com/developerworks/java/library/j-jtp01274/i...

pron · on May 30, 2020

Which is why finalizers in Java have been officially deprecated [1] and might be removed altogether in a future release.

[1]: https://docs.oracle.com/en/java/javase/14/docs/api/java.base...

cesarb · on May 31, 2020

Unfortunately, the alternative they recommend (Cleaner) only exists since Java 9, so it cannot be used by libraries that wish to remain compatible with Java 8 (which, in my experience, is most or all of them; so far, I haven't seen any Java library which has dropped compatibility with Java 8, and many only recently dropped compatibility with Java 6 or 7).

altfredd · on May 31, 2020

The actual alternative is PhantomReference, which exists since ancient times. Cleaner simply provides safe way to use PhantomReferences, which are rather tricky to use (by Java standards).

msclrhd · on May 30, 2020

Destructors in C++ aren't just for making sure you leak memory. They are used for many lifetime controlled things such as: 1. general resource cleanup (file handle, database connection, etc.) using RAII (Resource Aquisition Is Initialization); 2. tracing function entry/exit.

mcguire · on May 30, 2020

Apparently, that doesn't work in Rust: https://news.ycombinator.com/item?id=23363647

the8472 · on May 30, 2020

It does work in rust. you just cannot rely on Drop for memory-safety. If you mem::forget a struct that holds onto some other resource then all that means is that you're committing that resource to the lifetime of the process. We usually call that a leak but it can be intentional.

the_why_of_y · on May 31, 2020

By that argument, it doesn't work in C++ either because std::unique_ptr::release() exists.

pdpi · on May 30, 2020

In the JVM the equivalent to RAII is implemented with try-with-resource/Autocloseable instead.

cesarb · on May 31, 2020

In my experience, try-with-resources is inferior to RAII, since it's very easy to forget (that is, it's very easy to do "Foo x = bar();" instead of "try (Foo x = bar()) { ... }"), leading to resource leaks. More than once I have added a finalizer to an AutoCloseable class just to warn loudly if close() hasn't been called; unfortunately, there's no way that I know of to suppress the finalizer once close() is called, so the GC still has to do all the work to call the finalizer even when the class was used correctly.

jfkebwjsbx · on May 31, 2020

> after all the main point of, for example, destructors in C++ is to make sure you don't leak memory, which the GC solves

No, C++ destructors are used for finalizers, not memory management.

Memory deallocation is a particular use case for finalizers which is avoided when performance is a concern.

> Another point is the C++/Rust pattern of each object recursively freeing the objects it owns presumably leads to slower deallocation, because in the general case it involves pointer following and non-local access.

No, a program that does pointer chasing and has to deallocate many small allocations is badly designed. If you are going to do that, using a GC language would be much better.

zucker42 · on May 31, 2020

> No, C++ destructors are used for finalizers, not memory management.

In the original comment I may have overstated this. I was ignoring the other uses of destructors because the context of the discussion was memory management. But memory management is a huge portion of what destructors do in C++. Consider a vector of strings (`vector<string>`). The destructor deallocates the memory for all the strings, then deallocates the memory for the vector.

> No, a program that does pointer chasing and has to deallocate many small allocations is badly designed. If you are going to do that, using a GC language would be much better.

How do you deallocate all of the nodes in a binary tree? As far as I can tell, you either have to pointer chase or use a custom allocation strategy. At some point with the second option, you're basically creating an ad-hoc garbage collector.

But I think we may be in vigorous agreement here, since my comment was about the general tradeoff between garbage collection and tracing data structures. I was trying to defend the original assertion that with tracing, "freeing the structure takes time proportional to the number of pointers in the structure it has to chase" while "garbage collection [or at least copy collection] takes time proportional to the live set".

jfkebwjsbx · on June 1, 2020

> Consider a vector of strings (`vector<string>`). The destructor deallocates the memory for all the strings, then deallocates the memory for the vector.

A vector (or binary tree as you mentioned later) of heap allocated elements in a performance critical path is a likely candidate to be redesigned if possible.

> As far as I can tell, you either have to pointer chase or use a custom allocation strategy. At some point with the second option, you're basically creating an ad-hoc garbage collector.

A custom, ad-hoc strategy is the way to go if you need them dynamically allocated and performance at the same time, yeah.

I wouldn't call that a GC, though. A GC usually refers to a global solution.

> But I think we may be in vigorous agreement here

Sounds like it!

tsimionescu · on May 31, 2020

> No, a program that does pointer chasing and has to deallocate many small allocations is badly designed. If you are going to do that, using a GC language would be much better.

It is debatable whether using it is good or bad design, but at least the C++ std lib does offer a data structure which requires exactly this kind of deallocation: std::list and std::forward_list. And, given the cache characteristics of array VS linked list implementations, I would guess that most uses of std::list occur for huge lists that get written to often, as that seems to be the only case where the big-O advantage of list inster/delete would actually materialize into any performance benefit over a std::vector.

mcguire · on May 30, 2020

Maintain a collection of objects with finalizers associated with the generation; when an object is moved out of the generation, move the finalizer record with it. Then, just before the generation is discarded, run the finalizers.

If the finalizers do something stupid like resurrect the object, have the runtime system notify someone with the authority to go beat the programmer with a stick.

Avi-D-coder · on May 31, 2020

Interestingly for the Rust GC I'm working both Drop and finalizers will be supported.

Do you happen to offer beat the programmer with a stick as a service?

mcguire · on June 1, 2020

Unfortunately, no, but you are going to need it.

I was fooling with a cycle-collected reference counter for Rust for a while, but eventually gave up because of the difficulty in handling finalizers.

pcl · on May 30, 2020

Many languages allow a class to avoid declaring a finalized, for just this reason.

The JVM is notable in this regard. And thus, most classes that compile to Java bytecode offer no-finalizer semantics.

Reelin · on May 30, 2020

My understanding is that finalizers are special cased for a generational GC, have a tendency to introduce significant overhead, and (as with any GC scheme) generally run at unpredictable times. My impression is that RAII idioms are strongly discouraged in conjunction with most GC ecosystems.

tsimionescu · on May 31, 2020

Actually, RAII idioms are generally encouraged, but finalizers are not. For the rare few classes that hold resources in GC languages, the Resource is usually Acquired at Initialization as well, and released in a special method (IDisposable.Dispose() in C#, AutoCloseable.Close() in Java). There is also usually some special syntax for automatically calling this syntax when the object exits some scope, though that scope usually has to be declared by the user (using(), try-with-resources, with() in Python etc.).

The biggest difference I know of is that the 'holds-resources' property does not propagate automatically like it does in C++. It's not that hard to always remember to call using(file = new File()) [...]. However, it's much easier to forget that you have a File field in your class which you initialize in the constructor, and so your class must itself be declared IDisposable/AutoCloseable etc.

Reelin · on May 31, 2020

This is a really good point. I had completely overlooked various deterministic resource cleanup idioms as being equivalent to RAII in all but name.

tsimionescu · on May 31, 2020

Even mark-and-sweep collectors only mark starting from GC roots, they don't mark through unreachable objects.

pron · on May 30, 2020

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not

Yes, but in practice tracing in a tracing GC is done concurrently and with the help of GC barriers that don't require synchronization and so are generally cheaper than the common mechanisms for reference-counting GC.

> and to invoke finalizers

As others have said, finalizers are very uncommon and, in fact, have been deprecated in Java.

tsimionescu · on May 31, 2020

No, they are actually fundamentally wrong. GCs never scan garbage - they only scan objects that are referenced from a GC root.

Note that the problem appears in a different place: if your large structure is actually not garbage, then every GC pass will have to scan it to see what other objects it is keeping alive.

pron · on May 31, 2020

Not every GC pass. Some modern GCs (like OpenJDK's default GC, G1) track mutation and maintain "remembered sets." They only need to scan objects that have changed since the last scan (they're actually even more efficient than that, but this is a simplification). In addition, when this scanning is required, it is performed concurrently to the application, so the technique of moving the work to a separate thread is done automatically.

Reelin · on May 30, 2020

> The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

Isn't that incompatible with RAII though?

arcticbull · on May 30, 2020

You can handle this in Rust pretty neatly with lifetimes. There's a bunch of crates that do this. [1]

[1] https://crates.io/crates/typed-arena

Reelin · on May 30, 2020

That's a neat library but as far as I can tell it doesn't avoid any traversal or cleanup code. It appears to delay the cleanup so it all happens at once. That's certainly useful, but if you have RAII the traversal still has to happen at some point.

winstonewert · on May 30, 2020

It avoids it if your type has no-op drop implementation. So if you use typed-arena for objects which don't own resources, they all get dropped in one massive deallocation and don't have to be traversed.

EDIT: and then I noticed that you mentioned RAII... Right, if the object own some sort of resources that doesn't apply.

Reelin · on May 30, 2020

No worries. And to clarify, in context the point is that traversal fundamentally can't be avoided in the case of RAII. This defeats (what I see as) the primary use case of an arena allocator - deallocating an arbitrarily large chunk of contiguous memory in O(1) time regardless of object count.

Of course this is all somewhat tangential to the original topic of generational GCs, where the RAII idiom also has significant negative impacts. The performance characteristics would otherwise be O(n) based on the live set and thus similar to an arena allocator in terms of the ability to dispose of an arbitrarily large number of objects efficiently.

winstonewert · on May 30, 2020

I'm not sure its fair to say it defeats that use case. It only defeats that use case if you need RAII. Furthermore, in my experience, an arena is most useful when you allocate lots of small objects which is the least likely to case to need RAII.

Reelin · on May 31, 2020

I think we're violently agreeing?

RAII is the only use case I was speaking of there. Way back up thread the original post I responded to was talking about arena allocators in comparison to a generational GC and finalizers. The point is that a generational GC doesn't handle finalizers well, but neither does an arena allocator!

winstonewert · on May 31, 2020

Yes. We are violently agreeing

all-fakes · on May 30, 2020

Yes. Some libraries, like typed-arena (and Rust's built-in Vec), will traverse the structure and call the drop implementations, and others, like bumpalo, will just leak the resources.

tsimionescu · on May 31, 2020

> That doesn't seem to make intuitive sense. A GC has the same problem.

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers.

All garbage collectors start from live objects and only scan those. Then, whatever objects they have not scanned get collected. In copying collectors (like most generational ones), this means that garbage is never touched.

In the mark-and-sweep algorithms, the mark phase still never touches the unreachable objects. However, the sweep phase does need to return those objects to the free list, so it will have to walk them. It will still not do it the same way as malloc/free, as it can walk the heap in order and free unmarked objects as it encounters them, no need to follow pointers, so it may still have better cache performance.

Finalizers introduce extra difficulty, but still the behavior is fundamentally different. What usually happens is that objects which have finalizers are usually remembered in a special list which acts as a GC root itself. When they are only reachable from that list, they get marked so that the finalize will run (usually on a special Finalizer thread). When the Finalizer is finished, and assuming the object was not resurrected, they get removed from the Finalizer list, and now they are not reachable from anywhere at all, so the next GC will finally clean them up. Usually, there is also some API for user code to mark a Finalizable object as 'finalized', which essentially removed it from the Finalizer list early and allows it to be collected as normal, without going through the above process.

And yes, having a large number of finalizable objects in your memory is usually considered a very bad idea. Generally, they are only recommended as a fail-safe measure: you are supposed to do explicit cleaning, but as a fail-safe, to avoid your program crashing in production if a connection or file leak was missed, you also have the Finalizer to throw buckets of water out of your boat (but you should really notice that it is happening and plug that leak, rather then relying on the bucketeer).

mcguire · on May 30, 2020

Finalizers/destructors do not work well in garbage collected languages, for that very reason.

saagarjha · on May 30, 2020

Usually in a background thread ;)

arcticbull · on May 30, 2020

Indeed. The difference is this is happening deterministically, in place (with an optional deferral). A garbage collector has all sorts of different trade-offs.

littlestymaar · on May 30, 2020

The title is slightly wrong: it's not going to make your code faster, it's going to reduce latency on the given thread.

It maybe a net win if this is the UI thread of a desktop app, but overall, it will come at a performance cost: because modern allocators have thread-local memory pools, and now you're moving away from it. And if you're running you code on a NUMA system (most server nowadays), when moving from one thread to another, you can end up freeing non-local memory instead of local one. Also, you won't have any backpressure on your allocations, and you are susceptible to run out of memory (especially because your deallocations now occur more slowly than they should)

Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.

pshc · on May 31, 2020

Yes it’ll reduce latency, but doesn’t it also increase parallelism? A single-threaded program ought to improve overall, unless the extra overhead you mentioned dominates. A parallel program might improve or not.

I think if you wanted to do deferred destruction right, ideally you’d mod an allocator to have functions like (alloc_local, alloc_global, free_now, free_deferred) to avoid exhausting memory. Traits could make this ergonomic.

Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.

tsimionescu · on May 31, 2020

> Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.

I think the point is that, if the same thread is doing both allocation and de-allocation, the thread is naturally prevented from allocating too much by the work it must do to de-allocate. If you move the de-allocation to another thread, your first thread may now be allocating like crazy, and the de-allocation thread may not be able to keep up.

In a real GC system, this is not that much of a problem, as the allocator and de-allocator can work with each other (if the allocator can't allocate any more memory, it will generally pause until the de-allocator can provide more memory before failing). But in this naive implementation, the allocator thread can exhaust all available memory and fail, even though there are a lot of objects waiting in the de-allocation queue.

pshc · on May 31, 2020

Ah, I see what you mean, thanks!

littlestymaar · on May 31, 2020

Yes, that's a great summary!

kevingadd · on May 31, 2020

this could make your code faster by providing more consistent control flow: the main thread is always doing Work and your gc threads are always cleaning up dead objects. this provides fewer, better-predicted branches and code that's more likely to stay in the icache.

most gc based environments use dedicated threads for gc and finalizers, this is one reason to do so

edit: to be more specific:

your normal flow is to alloc at the top of your function, and at the bottom you dealloc. so in basically every case you are paying the cost of deallocs, but if the alloc is conditional the dealloc is now also conditional which is more branches to predict. the dealloc is probably also handled by functions so you have jumps/calls eating up branch prediction table space

in the gc/offloaded dealloc scenario, your deallocs on the work thread are no longer conditional because you're just handing addresses off to the gc. if your gc is STW you've added 'if (stop_requested) stop()' branches throughout your workload, but those are effectively 0-cost because stop_requested is always false (when it's true, the cost of the mispredict has no significance because your thread is about to suspend). the gc thread is always doing the same thing or waiting, and again when it's about to wait a branch mispredict cost has no significance.

cesarb · on May 30, 2020

Just be careful, because moving heavy things to be dropped to another thread can change the semantics of the program. For instance, consider what happens if within that heavy thing you had a BufWriter: unless its buffer is empty, dropping it writes the buffer, so now your file is being written and closed in a random moment in the future, instead of being guaranteed to have been sent to the kernel and closed when the function returns.

And it can even be worse if it's holding a limited resource, like a file descriptor or a database connection. That is, I wouldn't recommend using this trick unless you're sure that the only thing the "heavy thing" is holding is memory (and even then, keep in mind that memory can also be a limited resource).

lostmyoldone · on May 30, 2020

I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

Granted the way the type system work you usually know the type of a variable quite well, but could this happen with opaque types?

I'm very much out of my depth, but it felt like one of those things that could really bite you if you are unaware, as happened with finalizers in Java decades ago.

masklinn · on May 30, 2020

> I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

If you're the one creating the structure, you could opt it out of Send, that'd make it… not sendable. So it wouldn't be able to cross thread-boundaries. For instance Rc is !Send, you simply can not send it across a thread-boundary (because it's a non-threadsafe reference-counting handle).

If you don't control the type, then you'd have to wrap it (newtype pattern) or remember to manually mem::drop it. The latter would obviously have no safety whatsoever, the former you might be able to lint for I guess, though even that is limited or complicated (because of type inference the problematic type might never get explicitly mentioned).

the8472 · on May 30, 2020

Considering that writing files can also block the process you probably don't want to have that in your latency-sensitive parts either, so you'll have to optimize that one way or another anyway.

For the more general problem you have can also dedicate more threads to the task or apply backpressure.

ablu · on May 31, 2020

A while ago I stumbled over a proposal to move a shared pointer (this was C++ code) to a thread in order to trigger the freeing of a legacy data structure there (the multi-thousand delete calls caused the watchdog of the main thread to fail). However, keeping the shared pointer reference in the main thread for too long resulted in the possibility that the "clean-up" thread ran while the main thread still had a hold on the shared pointer... Resulting in a low chance of the "clean-up" thread doing nothing and the main thread still locking up. People here got taught to use shared pointers to prevent memory management errors, but it can really cause a lot of unexpected non-determinism when used blindly.

dirtydroog · on May 31, 2020

shared_ptr all the things? If so, they may as well write in Java.

ablu · on May 31, 2020

Even ignoring the shared_ptr abuse, at least for allocating and freeing many small objects Java would most probably be (a lot?) faster than non-optimized allocations and frees in native binaries. But in my case it was legacy code running on an embedded device.

usefulcat · on May 30, 2020

It seems like the caller should ensure that the buffer is written before giving away ownership. Also, what happens if there is an error writing during finalization/destruction/etc? Seems like you'd want to find out about such errors earlier if at all possible.

pierrebai · on May 30, 2020

I've seen variations on this trick multiple times. Using threads, using a message sent to self, using a list and a timer to do the work "later", using a list and waiting for idle time...

They all have one thing in common: pampering over a bad design.

In the particular example given, the sub-vector probably come from a common source. One could keep a big buffer (a single allocation) and an array of internal pointers. For example of such a design to hold a large array of text strings, see for example this blog entry and its associated github repo:

    https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
    https://github.com/pierrebai/FastTextContainer

Roughly it is this:

    struct TextHolder
    {
        const char* common_buffer;
        std::vector<const char*> internal_pointers;
    };

This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.

viraptor · on May 31, 2020

Yes. There's also a number of pool/arena allocators in rust which could be used here instead to drop All entries at once.

saagarjha · on May 30, 2020

Why would you ever write a get_size function that drops the object you call it on? Surely in an actual, non-contrived usecase spawning another thread and letting the drop occur there would just be plain worse?

Reelin · on May 30, 2020

I think the contrived use case is just for illustrative purposes? If I'm understanding correctly, the combination of cleanup code and deallocation can sometimes consume enough time that it's worth dispatching it on another thread. That's hardly specific to Rust though.

As you note that will certainly add some overhead, although that could be minimized by not spawning a fresh thread each time. It could easily reduce latency for a thread the UI is waiting on in many cases.

tedunangst · on May 30, 2020

It would be helpful to see an example from a real application, too.

masklinn · on May 30, 2020

A very large Vec<String> (say a few million non-empty strings) would do I'd guess, Rust would drop the Vec which would recursively drop each String.

burpsnard · on May 30, 2020

I start thinking about c++ 'placement new' and COBOL copy record structures.

And aquire/release objects from resource pools that manage themselves. And message queue handlers. And pass-by-reference, and object pools..

epage · on May 30, 2020

I believe this is contrived to prove a point.

And this isn't just a help in these contrived examples. I believe process cleanup (an extreme case of cleaning up objects) is one of cases where garbage collection performs better because it doesn't have to unwind the stack, call cleanup functions that are not in the cache, and make a lot of `free` calls to the allocator.

I vaguely remember reading about Google killing processes rather than having them clean up correctly, relying on the OS to properly clean up any resources of significance.

Now this doesn't mean you should do this in all cases. Profile first, see if you can avoid the large objects, and then look into deferred de-allocations ... if the timing of resource cleanup meets your application's guarantees.

seventh-chord · on May 30, 2020

Killing a process without freeing all allocations is, as far as I can tell, routine in C. Especially for memory it makes no sense "freeing" allocations, the whole memory space is getting scrapped anyways. Of course, once you add RAAI the compiler cant reason about which destructors it can skip on program exit, and if programmers are negligent of this you get programs that are slow to close.

estebank · on May 30, 2020

> Killing a process without freeing all allocations is, as far as I can tell, routine in C.

Many times by accident :)

> if programmers are negligent of this you get programs that are slow to close.

I wouldn't call that negligence, just not fully optimized.

gpderetta · on May 30, 2020

exit(2) will only call destructors of static objects. Quick_exit not even those.

Reelin · on May 30, 2020

> killing processes rather than having them clean up correctly, relying on the OS

I recall Firefox preventing cleanup code from running when you quit a few years ago. Prior to that, quitting with a lot of pages open (ie hundreds) could cause it to lock up for quite some time.

pjmlp · on May 30, 2020

Not at all, Herb Sutter has a CppCon talk about this kind of optimisations.

It is also the approach taken by C++/WinRT, COM and UWP components get moved into a background cleaning thread, to avoid application pauses on complex data structures reaching zero count.

nickm12 · on May 30, 2020

I took this to be a contrived example to illustrate the point. I could imagine a process that creates a big data structure (e.g. parse an xml file), pulls some data out, and then drops the data structure. If you want to use that data sooner, you can push the cleanup off your thread.

ashtonkem · on May 30, 2020

It’s a contrived example to demonstrate the technique.

Areading314 · on May 30, 2020

Right there is no reason to pass ownership to a function like this.

epage · on May 30, 2020

For those wanting a real world example where this can be useful:

I am writing a static site generator. When run in "watch" mode, it deletes everything and starts over (I'd like to reduce these with partial updates but can't always do it). Moving that cleanup to a thread would make "watch" more responsive.

elcomet · on May 30, 2020

That's not really the same issue that is mentionned in the article though, is it ?

The issue from the article would be solved by just passing a reference to the variable.

In your case, cleanup is an action that needs to be done before writing new files. So you have to wait for cleanup anyway, don't you ?

ashtonkem · on May 31, 2020

That's not true.

Typically any server with a watch functionality will have a mutable reference to the data that's being watched. When you change that data out you're both changing the mutable reference, and also deallocating any memory that was previously used. One could separate these two steps, moving the watched data to another variable that's dropped in another thread, if you wanted.

firethief · on May 30, 2020

Why can't it cleanup right after the work?

pmontra · on May 30, 2020

Or no cleanup at all. A CLI command that runs for a very short time can allocate memory to perform its job, print the result and exit. Then the OS releases all the memory of the process. No idea if Rust can work like this.

ReactiveJelly · on May 30, 2020

"Watch mode" for static site gens would mean you leave the process running and let it rebuild the site whenever a file changes, probably 10s to 100s of times in a typical run

estebank · on May 30, 2020

std::mem::forget, which doesn't run destructors:

https://doc.rust-lang.org/std/mem/fn.forget.html

ncmncm · on May 30, 2020

There is nothing unique to Rust about this; it is a very old technique. It is usually much inferior to the "arena allocator" method, where all the discarded allocations are coalesced and released in a single, cheap operation that could as well be done without another thread. That method is practical in many languages, Rust possibly included. C++ supports it in the Standard Library, for all the standard containers.

If important work must be done in the destructors, it is still better to farm the work out to a thread pool, rather than starting another thread. Again, C++ supports this in its Standard Library, as I think Rust does too.

One could suggest that the only reason to present the idea in Rust is the cynical one that Rust articles get free upvotes on HN.

ShroudedNight · on May 31, 2020

> C++ supports it in the Standard Library, for all the standard containers.

I don't know what the situation is today, but in the past, the GCC standard library containers had non-trivial destructors when running in debug mode. Ensuring their proper invocation was required to avoid dangling pointers in their book keeping. Non-obvious and painful to debug.

jeffdavis · on May 30, 2020

Speedup numbers should be given when optimizing constant factors -- e.g. "I made this operation 5X faster using SIMD" or "By employing readahead, I sped up this file copy by 10X".

The points raised in this article are really different:

* don't do slow stuff in your latency-critical path

* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)

* dropping can be slow

The first and second points are good, but not really related to rust, deallocations, or the number 10000.

The last point is worth discussing, but still not really related to the number 10000 and barely related to rust. Rust encourages an eager deallocation strategy (kind of like C), whereas many other languages would use a more deferred strategy (like many GCs).

It seems like deferred (e.g. GC) would be better here, because after the main object is dropped, the GC doesn't bother to traverse all of the tiny allocations because they are all dead (unreachable by the root), and it just discards them. But that's not the full story either.

It's not terribly common to build up zillions of allocations and then immediately free them. What's more common is to keep the structure (and its zillions of allocations) around for a while, perhaps making small random modifications, and then eventually freeing them all at once. If using a GC, while the large structure is alive, the GC needs to scan all of those objects, causing a pause each time, which is not great. The eager strategy is also not great: it only needs to traverse the structure once (at deallocation time), but it needs to individually deallocate.

The answer here is to recognize that all of the objects in the structure will be deallocated together. Use a separate region/arena/heap for the entire structure, and wipe out that region/arena/heap when the structure gets dropped. You don't need to traverse anything while the structure is alive, or when it gets dropped.

In rust, probably the most common way to approximate this is by using slices into a larger buffer rather than separate allocations. I wish there was a little better way of doing this, though. It would be awesome if you could make new heaps specific to an object (like a hash table), then allocate the keys/values on that heap. When you drop the structure, the memory disappears without traversal.

heftig · on May 30, 2020

If I seriously wanted to move object destruction off-thread, I would use at least a dedicated thread with a channel, so I could make sure the dropper is done at some point (before the program terminates, at the latest). It also avoids starting and stopping threads constantly.

Something like this: https://play.rust-lang.org/?version=stable&mode=debug&editio...

You could have an even more advanced version spawning tasks into something like rayon's thread pool, I assume.

ReactiveJelly · on May 30, 2020

Someone is working on this as a direct response to this blog:

https://www.reddit.com/r/rust/comments/go4xcp/new_crate_defe...

And yes, spawning a thread for every drop is horrible. It's just to prove the concept. The defer_drop crate uses a global worker thread.

maxton · on May 30, 2020

I'm not very familiar with Rust, but I don't understand why you wouldn't just use a reference-to-HeavyThing as the function argument, so that the object isn't moved and then dropped in the `get_size` function?

ehsanu1 · on May 30, 2020

If you never drop it, you have a memory leak. If the caller drops it, it's still the same as the `get_size` dropping it in terms of performance impact.

Generally you'd only pass ownership when that's needed for some reason. So this toy example might not be realistic but it does demonstrate the performance impact.

epage · on May 30, 2020

For these contrived cases, yes, you would just pass a reference to the function but I think the point is to simplify the case down to demonstrate a point.

burpsnard · on May 31, 2020

In the olden days, it was just out.flush(); out.close();

heavenlyblue · on May 30, 2020

So the caller of the function still needs to free HeavyThing in the same thread.

Cyph0n · on May 30, 2020

You’re spot on: this is simply a bad example that you would never see in a real application.