Reference counting is usually very expensive, because even reading a variable up...

kragen · on Oct 16, 2021

WRT the multithreading issue, https://lwn.net/SubscriberLink/872869/0e62bba2db51ec7a/ mentions that using atomic memory operations in CPython for Py_INCREF and Py_DECREF slows down the entire CPython interpreter by 60%.

naasking · on Oct 15, 2021

> Reference counting is usually very expensive, because even reading a variable updates the reference count

There are many papers out there on how to elide most ref count operations on locals, and runtimes that use ref counting for automatic memory management typically defer ref count updates in various clever ways (like Nim IIRC). You give up some latency for significant throughput gains.

kragen · on Oct 15, 2021

astrange · on Oct 19, 2021

> Reference counting is usually very expensive, because even reading a variable updates the reference count, and ending a scope involves testing the reference count of every variable defined inside the scope and conditionally deallocating the referent.

This is true, but if you can prove things about lifetimes then you can eliminate RC operations with that.

Also, GC is expensive for large processes because 1. peak memory is typically higher 2. you have to scan all of memory for pointers 3. that memory may have been swapped out. Of course this can be optimized too.

kragen · on Oct 19, 2021

> This is true, but if you can prove things about lifetimes then you can eliminate RC operations with that.

Yes, this is true, and in fact with Rust's lifetime analysis you can get excellent performance with reference counting in many places. There is more information about this in https://news.ycombinator.com/item?id=28879401 and https://news.ycombinator.com/item?id=28884775.

> Also, GC is expensive for large processes because 1. peak memory is typically higher 2. you have to scan all of memory for pointers 3. that memory may have been swapped out. Of course this can be optimized too.

It's true that a tracing GC usually requires significantly more memory than manual allocation or reference counting, though occasionally it doesn't, since to implement compacting you basically have to implement a tracing GC, so a GC can sometimes give you lower fragmentation.

It's not true that you have to scan all of memory for pointers, for two reasons:

1. You may have a region of memory that contains no pointers, like Erlang's binary storage.

2. Generational GCs only have to look for pointers in the nursery, the stack, and things marked by the write barrier. If they're GCing a generation that isn't the nursery, at that point they have to scan that generation too, but that's so much less frequent that it doesn't make GC expensive. You say, "Of course this can be optimized too," but actually almost all modern GCs are generational, since the late 01990s. Except...

3. Things like Erlang partition the heap into separate per-process heaps for a large number of lightweight processes. Each of these heaps can be GCed independently because you can't have pointers from one heap to another. "What about unreferenced processes?", you might ask. Well, Erlang doesn't GC processes; you have to explicitly create and destroy them, like a larger-granularity malloc/free, typically using a Rust-like ownership hierarchy to avoid process leaks. This works better than it sounds like it should.

astrange · on Oct 19, 2021

> Generational GCs only have to look for pointers in the nursery, the stack, and things marked by the write barrier. If they're GCing a generation that isn't the nursery, at that point they have to scan that generation too, but that's so much less frequent that it doesn't make GC expensive. You say, "Of course this can be optimized too," but actually almost all modern GCs are generational, since the late 01990s.

I was going to come back and edit this to say "…when you violate an assumption of generational GC" but you somehow managed to instantly reply to my 3 days late post!

Though I'm not too familiar with how GC runtimes are really implemented; I thought generations were more or less based on allocation age but don't know if it actually does analysis to separate allocations that can't reference each other.

kragen · on Oct 19, 2021

I was just lucky! I didn't mean to play gotcha!

(I should write a HN user-agent to watch old threads, though.)

I should preface this by saying I've never actually written a GC, or maintained one someone else wrote, so I might be off-base here; this is just my understanding from reading books and papers:

The awesome thing about generational GC is that (normally) the generations are physically separate in memory, so the GC can avoid even looking at objects in generations older than the one it's emptying out. (Unless they might contain pointers to objects in that generation, that is, which is detected by the write barrier.)

This combines nicely with the potential killer advantage of copying collectors: they don't need to even look at garbage. So if your nursery (or Garden of Eden, if you prefer) is 8MiB and contains only 128 live objects averaging 64 bytes each, a non-concurrent nursery GC consists of walking the stack and any marked older objects, copying those 8192 bytes into the next generation while adjusting the relevant pointers, then resetting the allocation pointer to the beginning of the nursery. The other, say, 130,944 objects that were allocated in the nursery aren't even looked at by the GC, because nothing points to them. (Unless they have finalizers (destructors).) So the GC examines and copies on average 0.06 bytes per object allocated.

Initializing the objects in the first place, as you must do whether they're allocated by decrementing a stack pointer or anything else, is far more expensive.

Of course, most programs retain a larger fraction of their nursery objects than that, stack allocation makes better use of memory caches than a multi-megabyte nursery, and if your program keeps retaining nursery objects long enough, it will eventually need to GC one of the other generations. So GC still isn't free even in the throughput sense. But it's competitive with manual dynamic allocation.