Hacker News new | past | comments | ask | show | jobs | submit login
Rust: Dropping heavy things in another thread can make your code 10000x faster (abramov.io)
481 points by timooo on May 30, 2020 | hide | past | favorite | 273 comments



Some important things I think people should note before blindly commenting:

* The example code is obviously contrived. The real gist is that massive deallocations in the UI thread cause lag, which the example code proves. That very thing can easily happen in the real world.

* I didn't see any difference on my machine between a debug build and a release build.

* The example is preforming 1 _million_ deallocations. That's why it's so pathological. It's not just a "large" vector. It's a vector of 1 million vectors. While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon, and which would likely suffer the same performance penalty.

* Rust is not copying anything, nor duplicating the structures here. In the example code the structures would be moved, not copied, which costs nothing. The deallocation is taking up 99% of the time.

* As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.


>As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

This reminds me of the exploding ultimate GC technique [1]:

> on-board software for a missile...chief software engineer said "Of course it leaks". ... They added this much additional memory to the hardware to "support" the leaks. Since the missile will explode when it hits its target or at the end of its flight, the ultimate in garbage collection is performed without programmer intervention.

[1]: https://devblogs.microsoft.com/oldnewthing/20180228-00/?p=98...


Yes, Java also has a garbage collector that does nothing, called the Epsilon GC, intended for short-lived programs and references for garbage collector benchmarks.[0]

[0]: https://blogs.oracle.com/javamagazine/epsilon-the-jdks-do-no...


The bumper sticker is “Ada programmers prove the algorithm terminates.”


I guess the bigger question is why they use dynamic memory allocation in the first place.


>As an aside, compilers have used the trick of not free-ing data structures before, because it provides a significant performance boost. Instead of calling free on all those billions of tiny data structures a compiler would generate during its lifetime, they just let them leak. Since a compiler is short lived its not a problem, they get a free lunch (pun unintended), and the OS takes care of cleaning up after all is said and done. My point is that this post isn't theoretical, we do deallocation trickery in the real world.

And then someone tries to use your compiler as a service (code analysis, change triggered compiler) and it's a dead end


Well, then that’s not the original use case anymore, and it’ll have to be re-engineered. In the meantime it may have been used for years and the perf difference may have saved many developer-years collectively across its user base. Surely you’re not suggesting that the compiler developers should be prematurely optimizing for future use cases that they may not even have envisioned.


Avoiding leaks is not optimisation, it's a matter of correctness - not freeing memory is an optimisation based on a very shortsighted assumption that is not practical for any new language (modern languages are expected to come with language server support)


We used precisely this optimization in [sorbet](https://sorbet.org), a brand-new type checker for Ruby, which also contains a high-performance LSP server.

We wrote the entire thing (and tested, using ASAN and fuzzers and other techniques) to avoid leaking memory, and then strategically inserted [the equivalent of a rust `mem::forget`](https://github.com/sorbet/sorbet/blob/0aae56e73c7680ec6053b3...) into the end of the `main` driver during standalone mode, to avoid calling those destructors when we're about to exit anyways.

This optimization is definitely still relevant for new systems today.


Correctness means adherence to the spec, not some contrived absolute truth.


But the spec usually has some implicit assumptions. Usually it's "app doesn't leak memory" in the same way nobody explicitly specifies "result of an addition of natural numbers should match ...".

We don't go around saying "oh, you didn't want modulo 5 arithmetic? You should've put that in the spec, not rely on some contrived absolute truth".


Okay, but we're talking about an application, a piece of software who's primary intention is to run for a short period of time, parse a text file and transform it. It will be ran many many many times a day by developers whose time is expensive. Language servers are a very new concept in terms of being part of the day-to-day tool chains for most developers. Trading garbage collection for compile time was absolutely a winning strategy for compiler writers. It made them money.


Language servers go back all the way to Xerox PARC workstations and Lisp machines.

C++ had its very first language servers via Lucid's Energize C++ and Visual Age for C++ v4.

Here is the 1993 video and related paper from Lucid.

https://www.youtube.com/watch?v=pQQTScuApWk

http://www.dreamsongs.com/Cadillac.html

And some information regarding VA, unfortunately most is missing from online world.

https://www.ecomstation.it/pido2/home/esterni/vac40os2.pdf (codestore)

http://www.edm2.com/index.php/VisualAge_C%2B%2B_4.0_Review

It is also why Delphi, VB and C++ Builder were already such a pleasure to use versus the Makefile and vi world of UNIX.


"We made dodgy choices to be competitive" is a valid explanation. They don't start being dodgy when new use cases come around and technical debt bites you in the butt.

I see not freeing as a clever bug/workaround with nice positive side effects. Not as a clever solution.


But the memory doesn't leak, it gets freed when the program exits.

The user doesn't care that each allocation is paired with another call to free it up. They just care if the program runs quickly and doesn't use too much memory overall.


Unless the system doesn't have enough memory to allocate for large compilation units and starts swapping allocated/dead memory and slowing down massively. Unless the user runs a CI where as many things should be compiling at the same time as possible and slowly freeing memory in 3 concurrent processes is better than running only 1 at a time. Unless they have spare cores, but no spare memory. Unless ...

> They just care ...

It all depends on who "they" are and what they want to achieve.


It’s usually the opposite! Correctness work usually has the implicit assumption of infinite memory.


> nobody explicitly specifies "result of an addition of natural numbers should match ..."

Yes they do. In most programming languages, by default integer arithmetic is modulo 2^64 (at best). If you want arbitrary precision arithmetic, you have to explicitly specify that.


Why do you say that? Even if you call free immediately after a piece of memory is no longer needed, malloc won’t release that immediately anyway.

If this is incorrect, then every modern malloc implementation is incorrect.


You have not provided any refutation to the OPs argument.


I am suggesting they apply good practices. I'd never imagine that compilers were actually doing what was stated -- sounds awful.

I understand it's tradeoffs and we all have real-world limitations to contend with -- but again, of all the corners that could be cut that's exactly the one I didn't imagine they would.

Nasty.


Of the three compilers I've worked on in-depth, only one of them had a "normal" memory management scheme.

One of them was unburdened by any thought of freeing stuff, and relied entirely on the application exiting for cleanup. This was very convenient to work with, and never ended up posing an issue.

Another used a series of allocation arenas, where certain arenas would be cleared at certain points in the compiler pipeline. This made for both speedy alloc/freeing and avoided leaks, since you weren't at risk of "forgetting" a data structure. It was also a major headache to keep track of exactly what the longest lifetime of a long-lived datastructure might be, and to pick an arena that won't be cleared in the meantime. Unfortunately the programs compiled with this compiler were large enough that we certainly couldn't have gotten away with just leaking memory; we sometimes OOMed as-is!

The third used standard C++ memory management. This compiler was quite simple, and the vast majority of its data used stack-based lifetimes. For a more complex compiler this would've become a headache.

I think that all of these compilers chose the correct allocation strategy for what they were doing. "Good practices" aren't as universal as we might like to believe, they depend entirely on the context in which a tool is designed to operate. And yes, we can guard to some extent against that context changing, but for the most part that's why we keep getting paid.


Judging by the downvotes I am getting (with zero explanation as to why), I'd say that many misunderstood the "good practices" part and took offence, as if I said there's only one good practice.

I was taught -- including at the start of my career when I used exclusively C/C++ (about 18.5y ago) -- to take care of all resources I was using and not rely on runtimes.

I understand and appreciate different usages but to me doing a proper cleanup was the sane default for most programmers. And that's all what I was saying.

Obviously, as one digs deeper in a specialised area where more and more efficiency is demanded then they have to reach for tools that most of us wouldn't normally. That's quite normal and was always interesting for me to read about.


I didn't downovte, but asides like, "sounds awful" and "nasty" definitely read as judgemental. Your other comments on this thread make you seem much more open minded than this one did.


Can you articulate why it's a bad practice? If it works better than alternatives and it's documented, not really sure what the issue is.

I don't think it's even that uncommon. I believe some HFT firms run Java with a huge amount of RAM and GC disabled, and get around it by just rebooting the software occasionally.

To me writing software like that is fair game, I don't see the point in being dogmatic about "how things should be done".


This thread has a lot of branches, so I am just picking one for this general reply:

I recall reading somewhere, years ago, that some OSes couldn't be relied upon to release unfreed memory when a process terminated. In those contexts, fastidious freeing would be important even in short-lived processes.

In my own C code I tend to free everything so that I get a clean trace from Valgrind and don't risk masking legitimate bugs, but I typically write long-running daemons.


It is bad practice if your code cannot turn on deallocation for debugging purposes or library-like usage.


Mostly because I look at it from the angle of one-off / general purpose / CLI programs. If one such has to run for 10-30 seconds and its memory just keeps growing and growing with the idea of throwing it all away at the end and letting the OS handle it, it might become disruptive for other programs on the machine.

For specialised apps and servers it's of course a perfectly good practice.


Deallocation at the end of a program's execution can substantially add to its runtime, and it's entirely waste. It's a much more common strategy than you might think.


You are right, I indeed didn't know it was that common.

But still, in a world where languages and runtimes are also judged by their ability to run in lambda/serverless setups, I'd think this practice will start being obsolete, wouldn't you think?

(What I mean is that I imagine that any serverless function that runs in severely constrained and measured environments like the AWS Lambda would gain a significant edge over the competition if it did an eager cleanup. Should allow more of them to work in parallel?)


I don't understand; a lambda function executes and closes, and any leak gets handled by lambda runtime (by freeing the whole lambda function)-- exactly the same as running a cli tool. It seems to me that the server-less context is actually exactly where you'd employ this strategy: the programs by definition cannot be long-lived, so unless they generate so much garbage as to oom in that timeline, there is no need to deallocate anything yourself.

The leak to watch out for is reduced to the persisted data (eg in S3) between executions


Yeah, this is what I was saying -- that it could produce too much garbage if no manual free-ing of memory is done.

I suppose, judging by the downvotes, that some find that perspective naive so I'll just cut it here because no productive discussion is happening. :(


the purpose/pattern is to ignore deallocation when the program finishes -- not during runtime. If you have 30MB of space, and your program will only ever use 20MB, why bother?

You only need to care when you're at risk of OOM'ing during runtime. If you're generating too much garbage, you either need to start cleaning up after yourself, or you need to generate less of it -- but it doesn't matter which choice you take from the perspective of the OS or the server-less function.

The only reason I can think of for server-less to impact the validity of this pattern is from it's lower memory capacity -- but otherwise it's the same constraints. And anyways, it's probably not nice to generate a few GB of garbage even if the user has the space


Most compiler are not designed to run in daemon mode, specifically it's non-issue since their startup is normally fast. And the compiler and runtime are a different thing.


I realise that, but nowadays language servers are a pretty normal practice in no small amount of areas.


You mean like LSP? Why would a compiler and a language server be integrated in the first place? Clearly a LSP server wouldn't use a no-freeing strategy, but there's no reason why it would cause issue with a compiler. A compiler must terminate, assuming parsing isn't turing complete (like Perl) and that the source code is finite. This is why it's okay for it to leak memory.


> Why would a compiler and a language server be integrated in the first place?

Because a language server needs to do a lot of the same work as a compiler.

This is an eventual hope for rustc. For now, the latest language server and it share a bunch of libraries, but language servers are effectively compiler frontends.


Yep, I guess I conflated compilers and LSP. I imagined compilers would have daemon modes. Seems like I was wrong.


> I'd never imagine that compilers were actually doing what was stated -- sounds awful.

And how many millions of iterations have been done successfully in that "awful" system?

The very fact that you never imagined it I think says a lot.


Well, I have been taught to take good manual care of all used resources. It is kind of a cognitive shock when you see all your training hand-waved away with "let the OS handle my mess of allocated objects that I'll never call `free` on". It's kind of disappointing on some level. :)

As I acknowledged in other comments of mine downthread, I understand that different situations require different tradeoffs. It's just that forgoing memory deallocation wasn't one of them in my head.


In a world where processes can fork-and-exec, nothing about "as a service" changes that. The compiler would just be reinvoked as needed. Converting it into a persistent process breaks a lot more than just allocation optimizations.


But you want to share state and only update state incrementally on edit to get any reasonable level of performance for stuff like language server code analysis.


I'm not sure what the performance overhead would be, but that state can easily be stored on disk.


I'd bet it's a lot harder to rework the compiler to update its state in place than it is to rework it to free memory.


As distasteful as leaky code is, is it that bad to run it in a separate process? You get a bit more robustness against crashes as well.


You want to be able to incrementally update state to get performance out of incremental code analysis (eg. language server implemention for IDE)


There are more ways to go then that: AST fragment caching, intermediate representation fragment caching and so on. Incremental updates fit some languages better than the others.


One of my “favorite” snags in perf analysis is that periodicity in allocations can misattribute the cost of allocations to the wrong function.

If I allocate just enough memory, but not too much, then pauses for defragmentation of free space may be costed to the code that calls me.

A solution to this that I’ve seen in soft real time systems is to amortize cleanups across all allocations. Every allocation performs n steps of a cleanup process prior to receiving a block of memory. In which case most of the bad actors have to pay part of the cost of memory overhead.

Might be good for Rust to try something in that general realm, or in the cleanup side may be easier to tack on. On free, set a ceiling for operations and queue what is left. That would at least peak shave.


Doing unrelated cleanup sounds like flushing CPU cache per every allocation.


This post is already about unrelated cleanup, so I'm not sure what other escape hatch you imagine people taking. Do you have a suggestion?

You can tell that it's unrelated cleanup because if it were related, then the cost of freeing wouldn't be noteworthy. It would be cache hot because you would be visiting it for the second time. In which case we'd be talking about why you are scanning a giant object on an event loop in the first place. That's not what's at issue. What's at issue is that you've been handed this great bomb of uncached data from someone else and now you're stuck doing the janitorial work.

Freeing an object of arbitrary size is effectively an unbounded operation. Cache invalidation has a very high cost, sure, but it's still bounded.

Putting a limit on the amount of work you do, you could stop before purging the entire cache. You could use a smaller limit on doing work for someone else and control invalidation there, too.


This deallocation trick is neat but in C and C++ you could use a memory pool to do this.

In theory, you could also use a memory pool in Rust but I think the standard library uses malloc without some way of overriding this behaviour.


You can also use the typed-arena crate[0] or roll your own if you're feeling like cracking open unsafe.

[0] https://crates.io/crates/typed-arena


Typed-arena is a bit limited, because all the allocated values must be of the same type. bumpalo[0] removes that restriction, but it also doesn't run destructors, which can lead to memory and other resource leaks. Another worry I have with bumpalo is that I don't think it's been reviewed thoroughly for unsoundness.

Still, bumpalo has been used to great effect in dodrio[1], a React-like library for Rust with really good performance.

[0]: https://crates.io/crates/bumpalo

[1]: https://github.com/fitzgen/dodrio


You can change the global allocator in any rust project. You can write your own easy enough, or use one like jemalloc


Sure; but when you’re using arenas and things like that you usually you will want different objects allocated into different pools (or with different lifetime properties). Rust only lets you pick one allocator for the entire process, so you can’t specify “all the children of this data structure go in arena A, and this other allocation goes into a traditional heap”.

It’s more awkward, but I much prefer Zig’s approach here where everything that allocates takes an allocator as a parameter. Usually the allocator is specified at compile time - in which case zig can generate identical code to the rust compiler. But when you want the flexibility, it’s there.

Aside from compilers, this is heavily used in video games where there’s often a lot of objects that get allocated per frame, and can be discarded all together. And in that case rust’s lifetime tracking would be a huge asset. The dovecot email server (C) also makes superb use of a mix of memory containers for different tasks. Given how messy email parsing is, dovecot is an absolute pleasure to read.


Not in an abstract way, but you can do this. The compiler has (had? It’s been a while) two different major arenas, in my understanding. There’s just no non-global allocator trait yet, so it’s not easy to abstract over.


>everything that allocates takes the allocator as a parameter

This is also the C++ approach


As someone that is interested in the topic and in Zig's "provide your own allocator" approach I have a question: would it be possible to make an allocator wrapper that moves values to be deallocated to a different thread?

As far as I know it would require both Rust's borrowing semantics and Zig's architectural choice.

From purely my own fan-boy perspective Zig approach is something that I would have really liked for Rust to adopt (I have no idea of which one came first chronologically).


It is not an allocator wrapper, but https://docs.rs/defer-drop/1.0.1/defer_drop/index.html


I don't get this. Usually I either don't care about how stuff is allocated, or when I care I want to be specific about what allocator should be used where. In the second situation just setting a global allocator wouldn't cut it most of the time, I would want support for customization when I instantiate a data structure.


Think with LD_PRELOAD it is always possible to override the allocator?


That wouldn’t help. The point is to use local allocators which have extra information about the data structures and usage of the memory.

This is routinely done in medium to large C++ programs for different reasons (performance, debuggability...).


Only if it is dynamically linked


In a toy raytracer I once wrote in C++, switching from malloc to custom memory pools for small fixed-size objects was a big performance boost. Making free() a noop was another big performance boost, both for deallocation and allocation. Turns out sequentially handing out memory from a big chunk of memory is much easier than keeping track of and reusing empty slots, and it keeps sequentially allocated objects in sequential memory locations.


jemalloc is the answer to the question you did not ask


In Rust you could just call `mem::forget` on whatever heavy thing that you're no longer using is before it would get dropped, but then the programmer is effectively responsible for that memory leak not becoming a problematic leak during refactors.

Edit: this will also break any code that relies on Drop being called for clean up, but that is already a "suspect"/incorrect pattern because there are no assurances that it will ever run.


> There are no assurances that it will ever run.

Yes and no. Whenever control leaves a code block, Rust automatically calls the drop() method of all values still owned by that block. There is no guarantee that control will exit every block (cf. Turing), but a moderately exceptional circumstance needs to occur for this not to happen, like an infinite loop.


There's also no guarantee the object won't be moved elsewhere, including into a context which never gets freed (for example, you can construct mem::forget using entirely safe code by forming a cycle of reference-counted boxes). That said, Rust has support for such a concept through the Pin trait (which essentially guarantees an object will not get moved until it is 'Unpinned').


Isn't that a problem for the RAII approach?


Yeah, I like Apple’s (Next’s) approach of pool allocation for each run through the event loop. Defer dealloc, drop pool at the end.


Apple's pools are for helping manage reference counts of returned objects (via autorelease) but aren't doing pool allocation (as any of those objects could escape, so you can't do the fast thing of just deallocating the pool). The normal memory allocator is used while in the scope of an autorelease pool.


True, I was thinking the same after I posted. Maybe I’m thinking more of Apache’s approach and how it gets some of the benefits of a fork-per-request server without the overhead.


right, so if you do

@autoreleasePool { [[[SomeObject alloc] init] autorelease]; }//someobject is release after exiting this scope

//inside main runloop, not inside @autorelease{} [[[AnotherObject alloc] init] autorelease];

//AnotherObject will be release at end of main runloop

[[NotAutoreleased alloc] init]; //will leak past runloop iteration

is that understanding correct?


Unless your destructors do more than deallocation, in which case you will leak whatever other resource you're managing.


A pool can invoke destructors when it is cleared. Might take a bit of overhead (if the pool is to support arbitrary classes), but you could retain the fast pointer-bump allocation.


Not Rust, C++, but in case anybody wants an example: https://github.com/eclipse/omr/commit/fd99ca42fdbed76cc00e68...

TR::Region is the slab allocator used by the JIT in OpenJ9/OMR. The linked commit adds functionality for calling destructors of arbitrary types allocated in the Region.


Interesting, thanks. Seems to require use of a common base class though, i.e. to use a class with this pool, you have to inherit from Destructable, or else create a subclass that does. Seems ugly. Perhaps that's the only way it can be done in C++ though.


It does create an object with that base class for the allocator but returns an object of the type you want from the allocator, using a template (Region::Instance<T>). So the API user only sees the types it's interested in.


Even just keeping a free list and deallocating it’s elements at an idle time is probably cheaper and faster than spawning a thread.


You can easily use a custom global allocator in Rust:

    #[global_allocator]
    static GLOBAL: MyAllocator = MyAllocator;


In C++/WinRT the same approach is taken, because you cannot just use a memory pool for COM.


Isn’t a Rust “move” implemented as a bit wise copy (e.g. memcpy call)? I see people claiming move has no cost but I’m not sure that is true.


EDIT: See kevincox's reply. Rust will bitwise copy the containing type, which is typically very cheap. For example, it you move a String, it will copy the String struct, which contains a couple pointers and a length (or something along those lines). Importantly, it will not copy the underlying char array.

I was thinking of the following code, where I believe the assignment to y is actually free. Though apparently this isn't called a "move".

    let x = <<large owned type like [char; 1000]>>;
    let y = x;
More info: https://doc.rust-lang.org/rust-by-example/scope/move.html


Semantically it is a bitwise copy.

In the example from the article it is probably actually a copy because the value was originally on the parent thread's stack, which will be reused after the function returns, so the value will need to be copied to the new thread's stack.

However it is important to not that it isn't a deep/recursive bitwise copy. It just needs to copy the HashMap itself (which is probably a handful of words).

So yes, it is doing a bitwise copy, but this is also very cheap. It will be much, much cheaper than spawning the thread.


That's not exactly true. In C++ terms, the example code is moving a value:

  std::map<...> foo;
  someFunction(std::move(foo));
And not moving a pointer like:

  std::unique_ptr<std::map<...>> foo = ...;
  someFunction(std::move(foo));
So it copies sizeof(std::map<...>), not a pointer.


Not necessary, std::map::map(&&) is free to move just the internal pointers and bookkeeping data, only dumb implementations would move the memory blocks where the map data is actually stored, as per ISO C++ requirements.


The comment originally said it was the cost of a pointer copy, and I was pointing out it's not a pointer. eg., sizeof(std::map<std::string,std::string>) is 48, which is the "pointers and bookeeping" you mention.


Ah sorry for the misinterpretation. Should have payed more attention.


What is bit-wise copied is the pointer to the memory.

I.e. a `HashMap` struct, or `Vec` struct don't directly contain the data.

For example the `Vec` is defined internally as something similar to:

`struct Vec<T> { data: *mut [T], capacity: usize, len: usize, marker: PhantomData<T> }`

(Slightly simplified, not actual Vec type).

So a move of a Vec copies at most 3 usize (24 bytes on 64bit systems), similar thinks apply for a HashMap.

Additionally the copy can often be elided through compiler optimizations.

As a interesting side note a new empty Vec/HashMap will not actually allocate any memory, only once elements get added it will start doing so. This is why it crates vec's of vecs of length 1. Or else it wouldn't need to do "number of element" free calls.


So the Vec type is not storing data on the stack, like C++'s std::vector can?


Exactly, neither Rust's Vec nor Rust's String (which is actually a Vec in disguise) store data within the struct itself, it's always stored in an external allocation. This is intentional, to avoid extra branches on whether the data is inline or not. If you have a lot of small vectors, and using less memory is more important for your use case than avoiding the extra branches, there are alternatives like https://crates.io/crates/smallvec that you can use.


Semantically, it is a bit wise copy, yes.

However, these copies can often be elided by optimizations.


> As an aside, compilers have used the trick of not free-ing data structures before

In Clang, this flag is `-Xclang -disable-free`. Not from a Jedi...


I don't know if Rust can do it (unsafe?) but in C and C++, I sometimes end up writing a custom allocator. It is often one of the most significant optimizations.

For example, I had the "million strings" problem once, literally millions. The solution was to put every string into a single large buffer and the pointers in another buffer. Not only I could deallocate everything at once but I also saved a bit of RAM by not aligning (not needed for strings).


> That very thing can easily happen in the real world.

Only if badly designed. That is why it is contrived!

> While that may seem contrived, consider a vector of 1 million strings, something that's not too uncommon

A program dealing with a million elements of any kind should not be performing naive allocations to begin with.

> we do deallocation trickery in the real world

Skipping deallocations is an optimization, not a design pattern.

In other words, the code needs to keep the ability to perform the deallocation for debugging, testing, usage as a library, etc.


This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant.

It's actually a lot to be said for rust that the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path.

But overall, there's nothing really unexpected here, if you're familiar with memory management.


One of my favorite papers by Bacon et al expands on this intuition that garbage collection and reference counting are opposite tradeoffs in many respects, and gives a formal theory for it. My views on gc/rc haven't been the same since.

http://researcher.watson.ibm.com/researcher/files/us-bacon/B...


That's a great paper, but one important thing to point out is that some production-grade tracing GCs are on the sophisticated end of that paper, while almost all reference counting GCs are on the rather primitive end. Given the same amount of effort, it's easier to get a reasonable result with reference-counting, but there are industrial-strength tracing GCs out there that have had a lot of effort put into them.


"Generational/compacting GC has the opposite problem. Garbage collection takes time proportional to the live set, and the amount of memory collected is unimportant."

Takes time proportional the live set times the number of GC runs that happen while the objects are alive. In other words, the longer the objects live, the more GC runs have to scan that object (assuming there is enough activity to trigger the GC), and the worse GC looks.


This is most decidedly not true for generational GCs, and for concurrent GCs, the tracing work happens asynchronously and in parallel, on other cores, not taking time on the main thread.


You still have to scan the oldest generation at some point; otherwise (assuming some activity) you'll be leaking memory there over time.

Moving the work to another core doesn't really change my statement.

Aside: I wonder if it's feasible to switch to reference counting for the oldest generation? That would move the problem back to deallocation. I'm not sure if that's a good trade-off, but it would be interesting.


It's still true for generational GC, but the costs are more complicated. Even objects in the oldest generation are periodically scanned in most implementations (though there are some, like Clozure CL's GC, that never scan the old generation automatically).

But generational GCs do improve massively on this problem.


> the ownership system lets you transfer freeing responsibility off-thread safely and cheaply in order to not have it block the critical path

This can also trivially be done in other languages. Atomically append your pointer to a queue of "large things that need to be freed" and move on as though you had actually called free.

Within a particularly time sensitive loop you can even opt to place pointers into a preallocated array locally. Then once per loop iteration swap that array with the thread handling the deallocations for you. It eats up a bit of CPU time but can significantly reduce latency.


OP said safely; what you're describing isn't safe in, say, C++ in the same sense that it is in Rust.


Isn't that essentially a tautology? Manual memory management and threading in such languages lacks safety guarantees to begin with.


A lot of C++ code depends on deallocation order for correctness. Like a destructor may want to say bye-bye to a pointed-to-object, and if you reverse order of deallocation, that pointer may be dangling.

Consider this code

    {
        Window a;
        ClickHandler* b = new ClickHandler(&a);
        delete b;
    }
Let's say b tries to deregister itself when it's deleted. This code will work as written. But if you defer the deletion of b, then stack allocated Window a may already be gone.


I never meant to imply that you could use such tricks without considering the lifetimes of references. Manual memory management already requires being mindful of those though so it's nothing unique to the described situation. Yes Rust offers an advantage here, but it's the same one it offers everywhere else so it doesn't seem relevant to me.

That being said, it seems like most cases where such an approach is worthwhile involve large nested data structures that involve lots of pointer chasing to traverse. All cases I've encountered so far were bulk stores that didn't involve actively interacting with external objects.


Replying to myself: I guess you could run the destructors in order, but keep the memory around a bit longer.


I've not worked with any language thus far without automatic garbage collecting, so this was definitely a neat read for me. It sounds rather elegant.


It's worth popping the hood and getting your fingers dirty. C was written in an era where memory was a scarce and precious resource to be grudgingly used if absolutely necessary


Please note that C is quite a few years older than the first GC languages. LISP1.5, ALGOL-68 and APL all had garbage collectors before C even existed.

Not to say that it's not worth it to learn manual memory management as well, but it is important not to think that GCs are a fancy modern tool, and that greybeards would never touch one. There were greybeards using punch cards and programming in a GC language, with output printed out on paper.


> This is the standard problem with tracing data structures to free them. You frequently run into it with systems based on malloc/free or reference counting. The underlying problem is that freeing the structure takes time proportional to the number of pointers in the structure it has to chase.

That doesn't seem to make intuitive sense. A GC has the same problem.

A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers. You're beginning your comparison after the mark step, which isn't a fair assessment since what Rust is doing is akin both both the mark and sweep phases.

The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

The difference between a GC and this kind of memory management is that the GC does the traversal later, at some point, non-deterministically. Rust allows you to decide between deallocating it in place, immediately, or deferring it to a different thread.


I said generational/compacting collector. You're talking about a mark and sweep collector.

A generational/compacting collector traverses pointers from the live roots, and copies everything it finds to the start of its memory space, and then declares the rest unused. If there is 1GB of unused memory, it's irrelevant. Only the things that can be reached are even examined.

As I said, this has the opposite problem. When the live set becomes huge, this can drag performance. When the live set is small, it doesn't matter how much garbage it produces, performance is fast.


How are finalizers invoked if the structure isn't traversed? Would it just be optimized away none of the objects have finalizers? Hence my suggestion about the area allocators being a better point of comparison.


Java is an example of a language with a generational copy collector by default. Most objects in Java don't have a finalizer, since after all the main point of, for example, destructors in C++ is to make sure you don't leak memory, which the GC solves. But when the `finalize` method is used is causes significant overhead.

> Objects with finalizers (those that have a non-trivial finalize() method) have significant overhead compared to objects without finalizers, and should be used sparingly. Finalizeable objects are both slower to allocate and slower to collect. At allocation time, the JVM must register any finalizeable objects with the garbage collector, and (at least in the HotSpot JVM implementation) finalizeable objects must follow a slower allocation path than most other objects. Similarly, finalizeable objects are slower to collect, too. It takes at least two garbage collection cycles (in the best case) before a finalizeable object can be reclaimed, and the garbage collector has to do extra work to invoke the finalizer. [1]

Sure, you're technically correct that if the objects all had finalizers that did the same thing as C++ destructors, it would be equivalent, but because of the existence of a GC we don't have to do any work for most objects. A GC is equivalent to an arena allocator in this sense.

Another point is the C++/Rust pattern of each object recursively freeing the objects it owns presumably leads to slower deallocation, because in the general case it involves pointer following and non-local access.

[1] https://www.ibm.com/developerworks/java/library/j-jtp01274/i...


Which is why finalizers in Java have been officially deprecated [1] and might be removed altogether in a future release.

[1]: https://docs.oracle.com/en/java/javase/14/docs/api/java.base...


Unfortunately, the alternative they recommend (Cleaner) only exists since Java 9, so it cannot be used by libraries that wish to remain compatible with Java 8 (which, in my experience, is most or all of them; so far, I haven't seen any Java library which has dropped compatibility with Java 8, and many only recently dropped compatibility with Java 6 or 7).


The actual alternative is PhantomReference, which exists since ancient times. Cleaner simply provides safe way to use PhantomReferences, which are rather tricky to use (by Java standards).


Destructors in C++ aren't just for making sure you leak memory. They are used for many lifetime controlled things such as: 1. general resource cleanup (file handle, database connection, etc.) using RAII (Resource Aquisition Is Initialization); 2. tracing function entry/exit.


Apparently, that doesn't work in Rust: https://news.ycombinator.com/item?id=23363647


It does work in rust. you just cannot rely on Drop for memory-safety. If you mem::forget a struct that holds onto some other resource then all that means is that you're committing that resource to the lifetime of the process. We usually call that a leak but it can be intentional.


By that argument, it doesn't work in C++ either because std::unique_ptr::release() exists.


In the JVM the equivalent to RAII is implemented with try-with-resource/Autocloseable instead.


In my experience, try-with-resources is inferior to RAII, since it's very easy to forget (that is, it's very easy to do "Foo x = bar();" instead of "try (Foo x = bar()) { ... }"), leading to resource leaks. More than once I have added a finalizer to an AutoCloseable class just to warn loudly if close() hasn't been called; unfortunately, there's no way that I know of to suppress the finalizer once close() is called, so the GC still has to do all the work to call the finalizer even when the class was used correctly.


> after all the main point of, for example, destructors in C++ is to make sure you don't leak memory, which the GC solves

No, C++ destructors are used for finalizers, not memory management.

Memory deallocation is a particular use case for finalizers which is avoided when performance is a concern.

> Another point is the C++/Rust pattern of each object recursively freeing the objects it owns presumably leads to slower deallocation, because in the general case it involves pointer following and non-local access.

No, a program that does pointer chasing and has to deallocate many small allocations is badly designed. If you are going to do that, using a GC language would be much better.


> No, C++ destructors are used for finalizers, not memory management.

In the original comment I may have overstated this. I was ignoring the other uses of destructors because the context of the discussion was memory management. But memory management is a huge portion of what destructors do in C++. Consider a vector of strings (`vector<string>`). The destructor deallocates the memory for all the strings, then deallocates the memory for the vector.

> No, a program that does pointer chasing and has to deallocate many small allocations is badly designed. If you are going to do that, using a GC language would be much better.

How do you deallocate all of the nodes in a binary tree? As far as I can tell, you either have to pointer chase or use a custom allocation strategy. At some point with the second option, you're basically creating an ad-hoc garbage collector.

But I think we may be in vigorous agreement here, since my comment was about the general tradeoff between garbage collection and tracing data structures. I was trying to defend the original assertion that with tracing, "freeing the structure takes time proportional to the number of pointers in the structure it has to chase" while "garbage collection [or at least copy collection] takes time proportional to the live set".


> Consider a vector of strings (`vector<string>`). The destructor deallocates the memory for all the strings, then deallocates the memory for the vector.

A vector (or binary tree as you mentioned later) of heap allocated elements in a performance critical path is a likely candidate to be redesigned if possible.

> As far as I can tell, you either have to pointer chase or use a custom allocation strategy. At some point with the second option, you're basically creating an ad-hoc garbage collector.

A custom, ad-hoc strategy is the way to go if you need them dynamically allocated and performance at the same time, yeah.

I wouldn't call that a GC, though. A GC usually refers to a global solution.

> But I think we may be in vigorous agreement here

Sounds like it!


> No, a program that does pointer chasing and has to deallocate many small allocations is badly designed. If you are going to do that, using a GC language would be much better.

It is debatable whether using it is good or bad design, but at least the C++ std lib does offer a data structure which requires exactly this kind of deallocation: std::list and std::forward_list. And, given the cache characteristics of array VS linked list implementations, I would guess that most uses of std::list occur for huge lists that get written to often, as that seems to be the only case where the big-O advantage of list inster/delete would actually materialize into any performance benefit over a std::vector.


Maintain a collection of objects with finalizers associated with the generation; when an object is moved out of the generation, move the finalizer record with it. Then, just before the generation is discarded, run the finalizers.

If the finalizers do something stupid like resurrect the object, have the runtime system notify someone with the authority to go beat the programmer with a stick.


Interestingly for the Rust GC I'm working both Drop and finalizers will be supported.

Do you happen to offer beat the programmer with a stick as a service?


Unfortunately, no, but you are going to need it.

I was fooling with a cycle-collected reference counter for Rust for a while, but eventually gave up because of the difficulty in handling finalizers.


Many languages allow a class to avoid declaring a finalized, for just this reason.

The JVM is notable in this regard. And thus, most classes that compile to Java bytecode offer no-finalizer semantics.


My understanding is that finalizers are special cased for a generational GC, have a tendency to introduce significant overhead, and (as with any GC scheme) generally run at unpredictable times. My impression is that RAII idioms are strongly discouraged in conjunction with most GC ecosystems.


Actually, RAII idioms are generally encouraged, but finalizers are not. For the rare few classes that hold resources in GC languages, the Resource is usually Acquired at Initialization as well, and released in a special method (IDisposable.Dispose() in C#, AutoCloseable.Close() in Java). There is also usually some special syntax for automatically calling this syntax when the object exits some scope, though that scope usually has to be declared by the user (using(), try-with-resources, with() in Python etc.).

The biggest difference I know of is that the 'holds-resources' property does not propagate automatically like it does in C++. It's not that hard to always remember to call using(file = new File()) [...]. However, it's much easier to forget that you have a File field in your class which you initialize in the constructor, and so your class must itself be declared IDisposable/AutoCloseable etc.


This is a really good point. I had completely overlooked various deterministic resource cleanup idioms as being equivalent to RAII in all but name.


Even mark-and-sweep collectors only mark starting from GC roots, they don't mark through unreachable objects.


> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not

Yes, but in practice tracing in a tracing GC is done concurrently and with the help of GC barriers that don't require synchronization and so are generally cheaper than the common mechanisms for reference-counting GC.

> and to invoke finalizers

As others have said, finalizers are very uncommon and, in fact, have been deprecated in Java.


No, they are actually fundamentally wrong. GCs never scan garbage - they only scan objects that are referenced from a GC root.

Note that the problem appears in a different place: if your large structure is actually not garbage, then every GC pass will have to scan it to see what other objects it is keeping alive.


Not every GC pass. Some modern GCs (like OpenJDK's default GC, G1) track mutation and maintain "remembered sets." They only need to scan objects that have changed since the last scan (they're actually even more efficient than that, but this is a simplification). In addition, when this scanning is required, it is performed concurrently to the application, so the technique of moving the work to a separate thread is done automatically.


> The only way to drop an extensively nested structure like this any faster than traversing it would be an arena allocator, and forgetting about the entire arena.

Isn't that incompatible with RAII though?


You can handle this in Rust pretty neatly with lifetimes. There's a bunch of crates that do this. [1]

[1] https://crates.io/crates/typed-arena


That's a neat library but as far as I can tell it doesn't avoid any traversal or cleanup code. It appears to delay the cleanup so it all happens at once. That's certainly useful, but if you have RAII the traversal still has to happen at some point.


It avoids it if your type has no-op drop implementation. So if you use typed-arena for objects which don't own resources, they all get dropped in one massive deallocation and don't have to be traversed.

EDIT: and then I noticed that you mentioned RAII... Right, if the object own some sort of resources that doesn't apply.


No worries. And to clarify, in context the point is that traversal fundamentally can't be avoided in the case of RAII. This defeats (what I see as) the primary use case of an arena allocator - deallocating an arbitrarily large chunk of contiguous memory in O(1) time regardless of object count.

Of course this is all somewhat tangential to the original topic of generational GCs, where the RAII idiom also has significant negative impacts. The performance characteristics would otherwise be O(n) based on the live set and thus similar to an arena allocator in terms of the ability to dispose of an arbitrarily large number of objects efficiently.


I'm not sure its fair to say it defeats that use case. It only defeats that use case if you need RAII. Furthermore, in my experience, an arena is most useful when you allocate lots of small objects which is the least likely to case to need RAII.


I think we're violently agreeing?

RAII is the only use case I was speaking of there. Way back up thread the original post I responded to was talking about arena allocators in comparison to a generational GC and finalizers. The point is that a generational GC doesn't handle finalizers well, but neither does an arena allocator!


Yes. We are violently agreeing


Yes. Some libraries, like typed-arena (and Rust's built-in Vec), will traverse the structure and call the drop implementations, and others, like bumpalo, will just leak the resources.


> That doesn't seem to make intuitive sense. A GC has the same problem.

> A garbage collector has to traverse the data structure in a similar way to determine whether it (and it's embedded keys and values) are part of the live set or not, and to invoke finalizers.

All garbage collectors start from live objects and only scan those. Then, whatever objects they have not scanned get collected. In copying collectors (like most generational ones), this means that garbage is never touched.

In the mark-and-sweep algorithms, the mark phase still never touches the unreachable objects. However, the sweep phase does need to return those objects to the free list, so it will have to walk them. It will still not do it the same way as malloc/free, as it can walk the heap in order and free unmarked objects as it encounters them, no need to follow pointers, so it may still have better cache performance.

Finalizers introduce extra difficulty, but still the behavior is fundamentally different. What usually happens is that objects which have finalizers are usually remembered in a special list which acts as a GC root itself. When they are only reachable from that list, they get marked so that the finalize will run (usually on a special Finalizer thread). When the Finalizer is finished, and assuming the object was not resurrected, they get removed from the Finalizer list, and now they are not reachable from anywhere at all, so the next GC will finally clean them up. Usually, there is also some API for user code to mark a Finalizable object as 'finalized', which essentially removed it from the Finalizer list early and allows it to be collected as normal, without going through the above process.

And yes, having a large number of finalizable objects in your memory is usually considered a very bad idea. Generally, they are only recommended as a fail-safe measure: you are supposed to do explicit cleaning, but as a fail-safe, to avoid your program crashing in production if a connection or file leak was missed, you also have the Finalizer to throw buckets of water out of your boat (but you should really notice that it is happening and plug that leak, rather then relying on the bucketeer).


Finalizers/destructors do not work well in garbage collected languages, for that very reason.


Usually in a background thread ;)


Indeed. The difference is this is happening deterministically, in place (with an optional deferral). A garbage collector has all sorts of different trade-offs.


The title is slightly wrong: it's not going to make your code faster, it's going to reduce latency on the given thread.

It maybe a net win if this is the UI thread of a desktop app, but overall, it will come at a performance cost: because modern allocators have thread-local memory pools, and now you're moving away from it. And if you're running you code on a NUMA system (most server nowadays), when moving from one thread to another, you can end up freeing non-local memory instead of local one. Also, you won't have any backpressure on your allocations, and you are susceptible to run out of memory (especially because your deallocations now occur more slowly than they should)

Main takeaway: if you use it blindly it's an anti-pattern, but it can be a good idea in its niche: the UI thread of a GUI.


Yes it’ll reduce latency, but doesn’t it also increase parallelism? A single-threaded program ought to improve overall, unless the extra overhead you mentioned dominates. A parallel program might improve or not.

I think if you wanted to do deferred destruction right, ideally you’d mod an allocator to have functions like (alloc_local, alloc_global, free_now, free_deferred) to avoid exhausting memory. Traits could make this ergonomic.

Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.


> Also I admit I don’t understand why “you won’t have any backpressure on your allocations,” shouldn’t deferred destruction give you more backpressure if anything? I am probably confused.

I think the point is that, if the same thread is doing both allocation and de-allocation, the thread is naturally prevented from allocating too much by the work it must do to de-allocate. If you move the de-allocation to another thread, your first thread may now be allocating like crazy, and the de-allocation thread may not be able to keep up.

In a real GC system, this is not that much of a problem, as the allocator and de-allocator can work with each other (if the allocator can't allocate any more memory, it will generally pause until the de-allocator can provide more memory before failing). But in this naive implementation, the allocator thread can exhaust all available memory and fail, even though there are a lot of objects waiting in the de-allocation queue.


Ah, I see what you mean, thanks!


Yes, that's a great summary!


this could make your code faster by providing more consistent control flow: the main thread is always doing Work and your gc threads are always cleaning up dead objects. this provides fewer, better-predicted branches and code that's more likely to stay in the icache.

most gc based environments use dedicated threads for gc and finalizers, this is one reason to do so

edit: to be more specific:

your normal flow is to alloc at the top of your function, and at the bottom you dealloc. so in basically every case you are paying the cost of deallocs, but if the alloc is conditional the dealloc is now also conditional which is more branches to predict. the dealloc is probably also handled by functions so you have jumps/calls eating up branch prediction table space

in the gc/offloaded dealloc scenario, your deallocs on the work thread are no longer conditional because you're just handing addresses off to the gc. if your gc is STW you've added 'if (stop_requested) stop()' branches throughout your workload, but those are effectively 0-cost because stop_requested is always false (when it's true, the cost of the mispredict has no significance because your thread is about to suspend). the gc thread is always doing the same thing or waiting, and again when it's about to wait a branch mispredict cost has no significance.


Just be careful, because moving heavy things to be dropped to another thread can change the semantics of the program. For instance, consider what happens if within that heavy thing you had a BufWriter: unless its buffer is empty, dropping it writes the buffer, so now your file is being written and closed in a random moment in the future, instead of being guaranteed to have been sent to the kernel and closed when the function returns.

And it can even be worse if it's holding a limited resource, like a file descriptor or a database connection. That is, I wouldn't recommend using this trick unless you're sure that the only thing the "heavy thing" is holding is memory (and even then, keep in mind that memory can also be a limited resource).


I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

Granted the way the type system work you usually know the type of a variable quite well, but could this happen with opaque types?

I'm very much out of my depth, but it felt like one of those things that could really bite you if you are unaware, as happened with finalizers in Java decades ago.


> I only know a very little rust, but since it's generally a good practice to never defer writing (or other side effects) to an ambiguous future point in time - with memory allocations as the only plausible exception - is there any way in rust to make sure one doesn't accidentally move complex objects with drop side-effects into other threads?

If you're the one creating the structure, you could opt it out of Send, that'd make it… not sendable. So it wouldn't be able to cross thread-boundaries. For instance Rc is !Send, you simply can not send it across a thread-boundary (because it's a non-threadsafe reference-counting handle).

If you don't control the type, then you'd have to wrap it (newtype pattern) or remember to manually mem::drop it. The latter would obviously have no safety whatsoever, the former you might be able to lint for I guess, though even that is limited or complicated (because of type inference the problematic type might never get explicitly mentioned).


Considering that writing files can also block the process you probably don't want to have that in your latency-sensitive parts either, so you'll have to optimize that one way or another anyway.

For the more general problem you have can also dedicate more threads to the task or apply backpressure.


A while ago I stumbled over a proposal to move a shared pointer (this was C++ code) to a thread in order to trigger the freeing of a legacy data structure there (the multi-thousand delete calls caused the watchdog of the main thread to fail). However, keeping the shared pointer reference in the main thread for too long resulted in the possibility that the "clean-up" thread ran while the main thread still had a hold on the shared pointer... Resulting in a low chance of the "clean-up" thread doing nothing and the main thread still locking up. People here got taught to use shared pointers to prevent memory management errors, but it can really cause a lot of unexpected non-determinism when used blindly.


shared_ptr all the things? If so, they may as well write in Java.


Even ignoring the shared_ptr abuse, at least for allocating and freeing many small objects Java would most probably be (a lot?) faster than non-optimized allocations and frees in native binaries. But in my case it was legacy code running on an embedded device.


It seems like the caller should ensure that the buffer is written before giving away ownership. Also, what happens if there is an error writing during finalization/destruction/etc? Seems like you'd want to find out about such errors earlier if at all possible.


I've seen variations on this trick multiple times. Using threads, using a message sent to self, using a list and a timer to do the work "later", using a list and waiting for idle time...

They all have one thing in common: pampering over a bad design.

In the particular example given, the sub-vector probably come from a common source. One could keep a big buffer (a single allocation) and an array of internal pointers. For example of such a design to hold a large array of text strings, see for example this blog entry and its associated github repo:

    https://www.spiria.com/en/blog/desktop-software/optimizing-shared-data/
    https://github.com/pierrebai/FastTextContainer
Roughly it is this:

    struct TextHolder
    {
        const char* common_buffer;
        std::vector<const char*> internal_pointers;
    };

This is of course addressing the example, but the underlying message is generally applicable: change your flawed design, don't hide your flaws.


Yes. There's also a number of pool/arena allocators in rust which could be used here instead to drop All entries at once.


Why would you ever write a get_size function that drops the object you call it on? Surely in an actual, non-contrived usecase spawning another thread and letting the drop occur there would just be plain worse?


I think the contrived use case is just for illustrative purposes? If I'm understanding correctly, the combination of cleanup code and deallocation can sometimes consume enough time that it's worth dispatching it on another thread. That's hardly specific to Rust though.

As you note that will certainly add some overhead, although that could be minimized by not spawning a fresh thread each time. It could easily reduce latency for a thread the UI is waiting on in many cases.


It would be helpful to see an example from a real application, too.


A very large Vec<String> (say a few million non-empty strings) would do I'd guess, Rust would drop the Vec which would recursively drop each String.


I start thinking about c++ 'placement new' and COBOL copy record structures.

And aquire/release objects from resource pools that manage themselves. And message queue handlers. And pass-by-reference, and object pools..


I believe this is contrived to prove a point.

And this isn't just a help in these contrived examples. I believe process cleanup (an extreme case of cleaning up objects) is one of cases where garbage collection performs better because it doesn't have to unwind the stack, call cleanup functions that are not in the cache, and make a lot of `free` calls to the allocator.

I vaguely remember reading about Google killing processes rather than having them clean up correctly, relying on the OS to properly clean up any resources of significance.

Now this doesn't mean you should do this in all cases. Profile first, see if you can avoid the large objects, and then look into deferred de-allocations ... if the timing of resource cleanup meets your application's guarantees.


Killing a process without freeing all allocations is, as far as I can tell, routine in C. Especially for memory it makes no sense "freeing" allocations, the whole memory space is getting scrapped anyways. Of course, once you add RAAI the compiler cant reason about which destructors it can skip on program exit, and if programmers are negligent of this you get programs that are slow to close.


> Killing a process without freeing all allocations is, as far as I can tell, routine in C.

Many times by accident :)

> if programmers are negligent of this you get programs that are slow to close.

I wouldn't call that negligence, just not fully optimized.


exit(2) will only call destructors of static objects. Quick_exit not even those.


> killing processes rather than having them clean up correctly, relying on the OS

I recall Firefox preventing cleanup code from running when you quit a few years ago. Prior to that, quitting with a lot of pages open (ie hundreds) could cause it to lock up for quite some time.


Not at all, Herb Sutter has a CppCon talk about this kind of optimisations.

It is also the approach taken by C++/WinRT, COM and UWP components get moved into a background cleaning thread, to avoid application pauses on complex data structures reaching zero count.


I took this to be a contrived example to illustrate the point. I could imagine a process that creates a big data structure (e.g. parse an xml file), pulls some data out, and then drops the data structure. If you want to use that data sooner, you can push the cleanup off your thread.


It’s a contrived example to demonstrate the technique.


Right there is no reason to pass ownership to a function like this.


For those wanting a real world example where this can be useful:

I am writing a static site generator. When run in "watch" mode, it deletes everything and starts over (I'd like to reduce these with partial updates but can't always do it). Moving that cleanup to a thread would make "watch" more responsive.


That's not really the same issue that is mentionned in the article though, is it ?

The issue from the article would be solved by just passing a reference to the variable.

In your case, cleanup is an action that needs to be done before writing new files. So you have to wait for cleanup anyway, don't you ?


That's not true.

Typically any server with a watch functionality will have a mutable reference to the data that's being watched. When you change that data out you're both changing the mutable reference, and also deallocating any memory that was previously used. One could separate these two steps, moving the watched data to another variable that's dropped in another thread, if you wanted.


Why can't it cleanup right after the work?


Or no cleanup at all. A CLI command that runs for a very short time can allocate memory to perform its job, print the result and exit. Then the OS releases all the memory of the process. No idea if Rust can work like this.


"Watch mode" for static site gens would mean you leave the process running and let it rebuild the site whenever a file changes, probably 10s to 100s of times in a typical run


std::mem::forget, which doesn't run destructors:

https://doc.rust-lang.org/std/mem/fn.forget.html


There is nothing unique to Rust about this; it is a very old technique. It is usually much inferior to the "arena allocator" method, where all the discarded allocations are coalesced and released in a single, cheap operation that could as well be done without another thread. That method is practical in many languages, Rust possibly included. C++ supports it in the Standard Library, for all the standard containers.

If important work must be done in the destructors, it is still better to farm the work out to a thread pool, rather than starting another thread. Again, C++ supports this in its Standard Library, as I think Rust does too.

One could suggest that the only reason to present the idea in Rust is the cynical one that Rust articles get free upvotes on HN.


> C++ supports it in the Standard Library, for all the standard containers.

I don't know what the situation is today, but in the past, the GCC standard library containers had non-trivial destructors when running in debug mode. Ensuring their proper invocation was required to avoid dangling pointers in their book keeping. Non-obvious and painful to debug.


Speedup numbers should be given when optimizing constant factors -- e.g. "I made this operation 5X faster using SIMD" or "By employing readahead, I sped up this file copy by 10X".

The points raised in this article are really different:

* don't do slow stuff in your latency-critical path

* threads are a nice way to unload slow stuff that you don't need done right away (especially if you have spare cores)

* dropping can be slow

The first and second points are good, but not really related to rust, deallocations, or the number 10000.

The last point is worth discussing, but still not really related to the number 10000 and barely related to rust. Rust encourages an eager deallocation strategy (kind of like C), whereas many other languages would use a more deferred strategy (like many GCs).

It seems like deferred (e.g. GC) would be better here, because after the main object is dropped, the GC doesn't bother to traverse all of the tiny allocations because they are all dead (unreachable by the root), and it just discards them. But that's not the full story either.

It's not terribly common to build up zillions of allocations and then immediately free them. What's more common is to keep the structure (and its zillions of allocations) around for a while, perhaps making small random modifications, and then eventually freeing them all at once. If using a GC, while the large structure is alive, the GC needs to scan all of those objects, causing a pause each time, which is not great. The eager strategy is also not great: it only needs to traverse the structure once (at deallocation time), but it needs to individually deallocate.

The answer here is to recognize that all of the objects in the structure will be deallocated together. Use a separate region/arena/heap for the entire structure, and wipe out that region/arena/heap when the structure gets dropped. You don't need to traverse anything while the structure is alive, or when it gets dropped.

In rust, probably the most common way to approximate this is by using slices into a larger buffer rather than separate allocations. I wish there was a little better way of doing this, though. It would be awesome if you could make new heaps specific to an object (like a hash table), then allocate the keys/values on that heap. When you drop the structure, the memory disappears without traversal.


If I seriously wanted to move object destruction off-thread, I would use at least a dedicated thread with a channel, so I could make sure the dropper is done at some point (before the program terminates, at the latest). It also avoids starting and stopping threads constantly.

Something like this: https://play.rust-lang.org/?version=stable&mode=debug&editio...

You could have an even more advanced version spawning tasks into something like rayon's thread pool, I assume.


Someone is working on this as a direct response to this blog:

https://www.reddit.com/r/rust/comments/go4xcp/new_crate_defe...

And yes, spawning a thread for every drop is horrible. It's just to prove the concept. The defer_drop crate uses a global worker thread.


I'm not very familiar with Rust, but I don't understand why you wouldn't just use a reference-to-HeavyThing as the function argument, so that the object isn't moved and then dropped in the `get_size` function?


If you never drop it, you have a memory leak. If the caller drops it, it's still the same as the `get_size` dropping it in terms of performance impact.

Generally you'd only pass ownership when that's needed for some reason. So this toy example might not be realistic but it does demonstrate the performance impact.


For these contrived cases, yes, you would just pass a reference to the function but I think the point is to simplify the case down to demonstrate a point.


In the olden days, it was just out.flush(); out.close();


So the caller of the function still needs to free HeavyThing in the same thread.


You’re spot on: this is simply a bad example that you would never see in a real application.


One thing I just noticed is that the example doesn't make sure to actually run the new thread to completion before the main thread exists.

This means that if you do a "drop in other thread" and then main exists, the drop might never run. Which is often fine as the exit of main causes process termination and as such will free the memory normally anyway.

But it would be a problem one some systems where memory cleanup on process exit is less reliable. Through such systems are more rare by now I think.


It would have to be a non-desktop system.

I'm pretty sure Linux will always free process-private memory, and threads, and file descriptors when a process exits.

The only things that can leak in typical cases are some kinds of shared memory and maybe child processes?


It'd be interesting to implement this on a type that would defer all of these drop threads (or one big drop threads built off a bunch of futures) until the end of some major action, like sending the http response on an actix-web thread. Could be a great way to get the fastest possible response time, since then the client has their response before any delay on cleanup.


There is no such thing as a free lunch here, so it would reduce the unloaded response time but should have no effect (or a negative impact) on a highly loaded server's response time. I'm finding this out when benchmarking a message passing/queue management system. Anything I do to defer work onto a separate threadpool improves latency up to a point, then reduces throughput.


If you're bottlenecked, then certainly. There's no free lunch, but for us, problems that can be solved by simply scaling up the resources on the host as relatively cheap as free vs expensive developer time. When we're purely focused on sales and not anywhere close to hitting a full mem/cpu bottleneck, this would bee good.

This situation you describe sounds a lot like dealing with garbage-collection cycles, so you give a good recommendation on something to watch out for, as rust performing at the level of a GC'd language removes a big reason for choosing rust.


Looks like Evan Wallace ran into the same issue in practice in esbuild

https://news.ycombinator.com/item?id=22336284

I actually originally wrote esbuild in Rust and Go, and Go was the clear winner.

The parser written in Go was both faster to compile and faster to execute than the parser in Rust. The Go version compiled something like 100x faster than Rust and ran at something around 10% faster (I forget the exact numbers, sorry). Based on a profile, it looked like the Go version was faster because GC happened on another thread while Rust had to run destructors on the same thread.

ESBuild is a really impressive performance-oriented project:

https://github.com/evanw/esbuild

The Rust version also had other problems. Many places in my code had switch statements that branched over all AST nodes and in Rust that compiles to code which uses stack space proportional to the total stack space used by all branches instead of just the maximum stack space used by any one branch: https://github.com/rust-lang/rust/issues/34283

(copy of lobste.rs comment)


I used to do this sometimes with C++ when I realized that clearing out a vector with lots of objects was slow. Is Rust basically based on unique_ptr? One problem with this approach was that you still had to wait for these threads when the application would shut down.


> Is Rust basically based on unique_ptr?

Rust is based on ownership and statically checked move semantics (by default though can be opted out). So each item has a single owner (which is why Rust deals very badly with graphs, and more generally any situation where ownership is unclear) and the compiler will prevent you from using a moved object (unlike C++).

Separately it has a smart pointer which is the dual of unique_ptr (Box), with the guarantee noted above:

    let b = Box::new(1);
    drop(b);
    println!("{}", b);
will not compile because the second line moves the box, after which it can't be used because it's been removed entirely from this scope.


> which is why Rust deals very badly with graphs, and more generally any situation where ownership is unclear

To be fair, so do 90+% of programmers. Much of rust's benefit in safe code is training programmers to avoid code like that where possible, and spreading design patterns that avoid it.


Rust basically gives the compiler understanding of unique_ptr and prevents you from using it after you’ve moved it.


Would you have to keep track of these threads in Rust? I have done a lot of desktop development where you have to be aware of what happens during shutdown. Seems a lot of server guys write their code under the assumption that it will never shut down.


You would need to add `thread.join()` at the end of main, or have some RAII guard that does it for you.

In practice that's probably optional, because the heap and all resources are usually torn down with the process anyway. Important things, like saving data or committing transactions, shouldn't be done in destructors.


The answer to this question is the same for any language without a heavy runtime. You can choose to join the worker thread, detach it, or kill it.


> One problem with this approach was that you still had to wait for these threads when the application would shut down.

If you know that an object will live for the rest of the program and not need any finalization logic, Rust allows you to "leak" it and save that overhead on shutdown.


You could have just one thread and to kill it at exit. Do not start new threads for each closure that drops object, send closures into one special thread instead.


Out of curiosity, how did you do that in C++?


It depends. Either iterate over the vector and delete the objects or just call clear(). Obviously you have to be sure that nobody else is accessing it at the same time.


There's a worse case in deallocation. Tracing through a data structure being released for a long-running program can cause page faults, unused data having been swapped out. This is part of why some programs take far too long to exit.


Completely different dynamic (because no Rust GC), but this reminds me of how Twitch made their server, written in Go, a lot faster by allocating a bunch of dummy memory at the beginning so the garbage collector doesn't trigger nearly as often:

https://news.ycombinator.com/item?id=21670110


The java equivalent to the Go case would simply be adjusting the -Xms flag. The Go approach is a needlessly convoluted because the runtime doesn't offer any tuning knobs.

As for the rust case, if you squint then it's similar to a concurrent collector.


Contrived examples like this are ridiculous. Creating such a heavy thing is likely even more expensive than tearing it down. So unless you create it on a separate thread, you probably shouldn't be freeing it on a separate one. It's not going to solve your interactivity problem. If you are creating the object on a separate thread then it's already going to be natural to free it on a separate one too.


Something is better than nothing.


Maybe some sort of “collector” could come by a clean up “garbage” memory periodically to improve performance…


I wonder if it might be possible for OS's to provide a fast, asynchronous way of deallocating memory.


Does anyone know how would this work in Go?



I have no idea, but my guess is that it doesn't matter, as the deallocation is being done by the garbage collection.


Minor typo, `froget` instead of `forget`


Is this applicable for Go as well ?


The obvious solution would be to borrow the HeavyThing instead of having it dropped inside the function.


I guess choosing when or how a program deallocates is important in a language that's close to the metal.

Rust tries to be zero cost while providing abstractions that make it seemingly a high level language but ultimately things like this show that it's not exactly zero cost because abstractions can incur hidden penalties. There needs to be some internal syntax that allows a rust user to explicitly control deallocation when needed.

If I started reading code where people would randomly move a value into another thread and essentially do nothing I would be extremely confused. Any language that begins to rely on "trick" or "hacks" as standard patterns exposes a design flaw.

Maybe if rust provided special syntax that a function can be decorated with so that it does deallocation in another thread automatically? Or maybe an internal function called drop_async...? This would make this pattern an explicit part of the language rather than a strange hack/trick.


Pass by reference?


If you pass by reference the heavy object won't be dropped. If your goal is to drop a heavy object, this is a cool way to do it.


If freeing the data structure in question takes this long, how much time are you wasting duplicating the data structure?


This code doesn't duplicate it. In Rust when a variable is sent as an argument to a function it's "ownership" moves to be in the scope of that function.

https://doc.rust-lang.org/book/ch04-01-what-is-ownership.htm...


You're missing my point. Unless the only thing you want to do with your giant data structure is measure its size, you're not going to be passing ownership of your only copy of it into the get_size function. You're going to be passing in a copy -- hence the cost of duplicating everything.


Rust is stricter about aliasing than C++ is.

Vectors are the size of 3 pointers (data, size, capacity), so I guess 24 bytes on x64.

Even if the move requires a memcpy, it's only copying that 24 bytes - The heap allocation is not copied, because there are never two owners of the vector at once.


It is just an example. You can think of "measuring size" here as getting the result of a long computation that involves a lot of allocations. After you get the result, you no longer care about the intermediate stuff – i.e. all the allocations. You certainly don't want to duplicate them, you just want to get rid of them, and the author tells that you might not want to deallocate (drop) them in the UI thread.

If it helps you, you might want to imagine the contents of the get_size function as being the end part of a longer calc_foo function. What's really missing the point is focusing so hard on the part that the example even contains a call to size() of a collection.


In the different contrived case where it gets copied, you'd instead change this code to take an immutable reference to it, and compute the size of that. Or you'd call .size() instead of calling this function!


I’m actually very curious why it takes this long; is Rust memseting the buffer when dropping it?

Edit: it seems like turning on optimizations seems to improve the situation quite a bit. Not sure why they were profiling the debug build.


> I’m actually very curious why it takes this long; is Rust memseting the buffer when dropping it?

Regardless of memset and optimizations, consider a particularly complicated object which lives on the heap and contains hundreds of other nested objects (which themselves contain nested objects, etc). Now imagine that a significant fraction of them make use of RAII. That cleanup code can't be elided.

That being said, it's a pretty bad example if they were actually profiling the debug build ...


> it seems like turning on optimizations seems to improve the situation quite a bit.

I'm not seeing that on my local machine? Were you comparing on the Playground which would be quite variable in its results?

    > cargo build
       Compiling foo v0.1.0 (/private/tmp/foo)
        Finished dev [unoptimized + debuginfo] target(s) in 0.42s
    > ./target/debug/foo
    drop in another thread 52.121µs
    drop in this thread 514.687233ms
    >
    >
    > cargo build --release
       Compiling foo v0.1.0 (/private/tmp/foo)
        Finished release [optimized] target(s) in 0.47s
    > ./target/release/foo
    drop in another thread 48.418µs
    drop in this thread 548.005373ms


I saw an increase of about 2x on my computer, though I didn't take too much effort to control for noise.


> Edit: it seems like turning on optimizations seems to improve the situation quite a bit. Not sure why they were profiling the debug build.

This is the most important point in the thread, since it invalidates the results for most purposes.


Not completely, it's still 2-3 orders of magnitude slower.


You're right, I expected it would make a bigger difference


The thing is it's not slow because rust is doing anything wrong or unoptimized, is slow because cleaning up insane amounts of memory allocations is slow.

Also if you run this:

``` fn main() { ::std::thread::spawn(move || { println!("end")}); println!("Hello, world!"); } ```

You might notice that "end" might not be printed because the main thread exists before it prints and terminates the process. This means that the dropping might actually not happen if it's at the end of the program and nothing is faster then not doing the work.

Also it's a not uncommon pattern in small user facing CLI to leak (memory) resources, as they (should) be cleaned up with the process termination.


hmm my first thought its, having to do that is a lot like c and cleaning up my own allocations. This feels like something rust should automatically do for me?


Rust will automatically clean up data that’s left scope, but you can also manually accomplish this by the “drop” function, which is only necessary if you want to cleanup explicitly, such as in a different thread.

Interestingly, the drop function is actually user-creatable. It’s actually an empty function with a very permissive Non-reference argument. The semantics of ownership in Rust makes that sufficient to trigger memory cleanup.


In C, if you forget to clean up, you have a memory leak which is hard to track down. In Rust, if you don't do this, you're not sacrificing memory leaks, only performance. A profiler can tell you when you should drop asynchronously.


>A profiler can tell you when you should drop asynchronously

Is there any profiler that does this today?

What are the drawbacks with asynchronous drops?


> Is there any profiler that does this today?

It requires interpretation, but yes.

> What are the drawbacks with asynchronous drops?

You pay some overhead for enqueuing work for later. Taken too far, lock contention / false sharing can make performance way worse.

The allocators and destructors must be thread safe if offloading to a worker thread.

The core running your thread is likely to have some of this data in L1/L2/L3 cache, which might not be true for whatever core would deque the work for asyncronously dropping.

It can be harder to attribute the costs of dropping to the right code if it all gets mixed up into a single work queue cleaned up by opaque worker threads when profiling.

If you don't use some kind of backpressure mechanism, allocations can potentially outpace deallocations and run you out of memory.

----

So, concrete example: Using Telemetry - a flamegraph style realtime profiler requiring invasive annotations - I was able to track down the cause of a framerate hitch in a game I was working on, to the sudden release of several graphics resources in a game. During events which would significantly restyle the look of some of the terrain, we'd eat several 10s/100s of milliseconds of overhead freeing things - more than enough to cause us to miss vsync. Would've stuck out like a sore thumb in any profiler capable of giving you a rough idea of the stack(s) involved in a 100ms timeframe that you can correlate to a vsync miss / missed frames.

D3D9 isn't thread safe (although freeing resources might've been?), but I didn't need to offload the work onto another thread just to amortize the cost over a few frames. Instead, a simple work queue did the trick. Problem solved! New problem: level transitions took significantly longer when doing mass frees of the same resources - more than doubling the cost of deallocation IIRC, for reasons I never did fully understand. Cache thrashing of some sort? We were still maxing out the core running the main thread with mostly cleanup logic...

Final code we shipped with used a hybrid solution that would choose between syncronous (high-throughput) and asyncronous (non-stalling) cleanup logic depending on what was happening in-game. Worked like a charm. Of course, this logic was hideously project specific and unable to be automatically chosen correctly for you by the programming language...



As I understand it, rust is automatically cleaning up, and that can cause glitchy timing. The clever hack is that rust lets you shunt that cleanup process off to another thread when you're the sole owner of that object. You can do the same thing in C, but unlike rust, the cost of cleanup is not hidden by the syntax.


On further reflection... I'm curious about how allocators would handle this -- if you return from this context only to make another heavyweight object, it seems like you'd be trading glitchy timing with allocator contention.


Because it's impossible to do this automatically in the general case.

In particular, types may not be sendable to other threads, or may have side effects on dropping, and in those cases you would need to rearchitect the code before you can apply this technique.

Also this technique adds overhead, so it should never be used (including not doing it conditionally) if you don't care about latency or if the objects are always small, and the compiler cannot know whether that is the case.


Oh my good god.

I'm hoping this is down to developer naivety rather than being a feature of rust.


It's not a feature of Rust; it's a "feature" of the way we design operating systems and processors. This is the same in C.


The same could happen in C++, I think. Destructors are supposed to be called recursively.


1) he should pass by reference to avoid the extra copy. So in his example yes it’s dev naivety

2) but somewhere somehow this object will deallocate, so his trick of putting it to another thread would work if the deal location takes awhile. Same for cpp if you have a massive object in a unique ptr. So it’s not a rust issue


Where's the extra copy? I don't see one. He's moving the struct into the function, getting size and then dropping it.


> avoid the extra copy

there is no copy happening here


In other words, Rust's automagical memory deallocation is NOT a zero-cost abstraction:

  fn get_len1(things: HeavyThings) -> usize {
      things.len()
  }

  fn get_len2(things: HeavyThings) -> usize {
      let len = things.len();
      thread::spawn(move || drop(things));
      len
  }
The OP shows an example in which a function like get_len2 is 10000x faster than a function like get_len1 for a hashmap with 1M keys.

See also this comment by chowells: https://news.ycombinator.com/item?id=23362925


No the zero-cost refers to the abstraction (and runtime cost), which still is zero-cost. Deallocating is part of the normal work load not the abstraction.

Also this isn't rust specific. Most (all?) RAII languages are affected and many GC approaches have this effect, too. Some do add additional abstraction to magically always or sometimes put the de-allocation into another thread.

But de-allocating in another thread is not generally good or bad. There are a lot of use-cases where doing so is rather bad or can't be done (in case TLS is involved). Rust and other similar RAII languages at least let you decide what you want to do.

Now it's (I think) generally known that certain kinds (not all) of GC do make some thinks simpler for GUI-like usage. Through they also tend to have less control.

Note that it's a common pattern for small user CLI facing tools (which are not GC'ed) to leak resources instead of cleaning them up properly. You can do so in rust too if you want but it's a potential problem for longer running applications.

Also here is a faster get `get_len` then both which is also more idiomatic rust then both:

``` fn get_len1(things: &HeavyThings) -> usize { things.len() } ```

If you have a certain thread (e.g. UI thread) in which you never want to do any cleanup work you can consider using a container like:

``` struct DropElsewhere<T: Send>(pub Option<T>); impl<T: Send> Drop for DropElsewhere<T> { fn drop(&mut self) { if let Some(value) = self.take() { thread::spawn(move || drop(value)); } } } ```

You can optimize this with `ManualDrop` to have close to zero-runtime overhead (removes the `take` and `if let` part).


> No the zero-cost refers to the abstraction (and runtime cost), which still is zero-cost. Deallocating is part of the normal work load not the abstraction.

Yeah, you're right. In hindsight my comment was poorly thought-out and poorly written.


Nothing about how Rust handles deallocation is magical in any way.

It's also definitely a zero-cost abstraction as I can see because the manual solution that's equivalent to get_len1() would be to essentially call free() on things. That would ultimately suffer from the same problem.


Yeah, you're right. In hindsight this was a poorly thought-out and poorly written post on my part.


Why should i ever need to drop a heavy object for only getting a size? Not in C++ and also not in Rust, the diffent thread idea is just creativ stupidity, sorry


It seems that this would be a great reason to not pass the entire heavy object through your function, and to instead pass it as a reference. When passing an object (rather than a reference to an object) there's a lot more work going on both in function setup, and in object dropping. I'm not a rust guru, so I don't know the precise wording, but it's simple enough to realize that if this function, as claimed, must drop all the sub-objects within the `HeavyObject` type, then those objects must have been copied from the original object.

If you instead define the function to take in a reference (by adding just two `&` characters into your program), the single-threaded case is now almost 100x faster than the multithreaded case.

Here's a link to a Rust Playground with just those two characters changed: https://play.rust-lang.org/?version=stable&mode=debug&editio...

Note that the code that drops the data in a separate thread is not timing the amount of time your CPU is spinning, dropping the data. So while this does decrease the latency of the original thread, the best solution is to avoid copying and then freeing large, complex objects as much as possible. While it is of course necessary to do this sometimes, this particular example is just not one of them. :)

As an aside, I'm somewhat surprised that the Rust compiler isn't inlining and eliminating all the copying and dropping; this would seem to be a classic case where compiler analysis should be able to determine that `a.size()` should be computable without copying `a`, and it should be able to eliminate the function call cost as well. Manually doing this gives the exact same timing as my gist above, so I assume that this is happening when passing a reference, but not happening when passing the object itself.


As already mentioned, Rust wasn't copying anything; the `HashMap` is not a `Copy`-able type, so it was just moved around (it's also not very large: all its items are behind a pointer to the heap).

All you did was move the drop from the `fn_that_drops_heavy_things` to the end of `main`, where it is outside the timing function.


> but it's simple enough to realize that if this function, as claimed, must drop all the sub-objects within the `HeavyObject` type, then those objects must have been copied from the original object.

Untrue. Rust uses move semantics (or shallow copys for types that implement the Copy trait via e.g. memcpy - no you can't customize this!) Deep copies require explicitly calling methods like ".clone()". So HashMap's pointers and sizes do get memcpyed... 56 bytes on the 64-bit playground currently.

This is similar to how std::move(...)ing a std::unordered_map in C++ nulls out the old object and just copies the pointers of the container - not a deep copy of the subobjects - which in similar C++ code would turn the main thread's destructor into a noop.

The main difference from C++ is: Rust handles this at the language level instead, and doesn't call the dtor at all on the main thread at all if the value was moved. No need for manual movement logic - it is the default, for everything. Also unlike C++, it also prohibits you from using the old moved-from object at compile time, preventing bugs.


Rust isn't copying anything; everything in the original code would be a move.


If your function takes a reference to the object, something still needs to free it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: