LLVM’s garbage collection facilities and SBCL’s generational GC

pcwalton · on Feb 14, 2018

> I hope I explained why “one size fits all” does not really do it in garbage collection.

Great to see people acknowledge this. Garbage collection is full of tradeoffs; be skeptical of any claims to the contrary.

I also like the way the author emphasizes the importance of inline bump allocation in the nursery. In the fast path, allocation of memory doesn't need to be any more than 5 or 6 instructions. This speed advantage is huge, and GC schemes that throw it away need to have a very good reason for doing so.

maximilianburke · on Feb 14, 2018

> In the fast path, allocation of memory doesn't need to be any more than 5 or 6 instructions. This speed advantage is huge, and GC schemes that throw it away need to have a very good reason for doing so.

I have some experience writing garbage collectors and with the situations I was targeting the handful of instructions quickly fades when multiple threads come into the equation.

1. Using atomics for bumping the pointer had some luck but contended atomics on the platforms I was targeting meant that the low-instruction-count allocation was still slow.

2. Using locks (futex-style) was slow, as expected.

3. The best results I found for my use case (precise garbage collector for a runtime targeting video game consoles) resulted in per-thread nursery-type allocation arenas, selected by TLS, with no locks in the fast-path. This was slower than the ideal single-threaded fast path because of the TLS overhead.

pcwalton · on Feb 14, 2018

Yeah, the canonical solution for multithreaded GC is the third option (TLABs). The TLS overhead is annoying, but on some architectures you can get away with burning a register to save the TLS load. It might well be worth it on AArch64, with its 32 GPRs...

(TLABs are the recommended solution for multithreaded malloc implementations like jemalloc and tcmalloc as well.)

cwzwarich · on Feb 14, 2018

AArch64 has a dedicated register (TPIDR_EL0) for TLS.

pcwalton · on Feb 14, 2018

Didn't know that, thanks!

sitkack · on Feb 15, 2018

https://www.semanticscholar.org/paper/Hierarchical-PLABs%2C-...

Tarean · on Feb 14, 2018

What is the TLS overhead? On x64 you presumably could spare two registers, one for the current heap pointer and one for the end of the heap. How could you implement this more efficiently with a single threaded version?

Not sure if LLVM has an equivalent of global register variables, though.

MHordecki · on Feb 14, 2018

#3 is a thread-local allocation buffer, which is what HotSpot is using. https://shipilev.net/jvm-anatomy-park/4-tlab-allocation/

MaxBarraclough · on Feb 14, 2018

I must be missing something. From the article:

> the allocation rate goes down at least 5x, and time to execute goes up 10x! And this is not even starting to touch what a collector has to do when multiple threads are asking for memory (probably contended atomics)

...

> For Epsilon, the allocation path in GC is a single compare-and-set - because it issues the memory blocks by pointer-bumps itself.

So both the TLABs and the Epsilon GC are doing pointer-bumps, right? Why does TLAB outperform no-TLAB even in the single-thread case? Just because the pointer-bump instructions are inlined?

aidenn0 · on Feb 14, 2018

Aside from the contention overhead, per-thread nurseries also has a fringe benefit for SMP platforms with exclusive L1 caches; short lived data is unlikely to be shared, so having threads allocate from different cache lines prevents thrashing the lines across cores.

CyberDildonics · on Feb 14, 2018

Why would any allocator split a cache line to multiple threads, and why would a program use dynamic memory allocation for something so small it would fit in the L1 cache?

aidenn0 · on Feb 14, 2018

If your dynamic allocator is a pointer-increment (as it is in most copying generational GCs), then if you have a single global nursery, then two threads allocating after each other would get adjacent locations in memory when allocating.

On SBCL, it's pretty normal to dynamically allocate a cons-cell, which is the size of two pointers. The garbage collector pays zero cost for any garbage in the nursery when it is run, so the performance overhead versus stack allocation is approximately zero.

I've seen language implementations where there is no stack at all (it's one of the most straightforward ways to implement scheme). Chicken scheme works this way, for example (it cleverly uses the C stack as the nursery, since functions never return in chicken and since a copying nursery allocation is just a pointer-increment, which is what a C stack allocation is).

chrisseaton · on Feb 14, 2018

What was the TLS overhead in practice? On most architectures it's just indirecting through one register isn't it?

maximilianburke · on Feb 14, 2018

I can't remember the exact numbers, I last worked on the project in 2011.

If I recall though, because our compiler generated DLLs, we weren't able to use the TLS fast path options, we needed instead to use the TLS API (ie: TlsAlloc/TlsGetValue on Win32) which slowed things down a bit.

pjmlp · on Feb 14, 2018

Yep, given that it also applies to manual memory management, it could be hardly otherwise on automatic memory management.

Turbo Pascal on MS-DOS already provided a pluggable API to customize the allocator.

And I remember companies living off selling optimized versions of malloc(), for different kinds of scenarios.

darksaints · on Feb 14, 2018

Honestly I think this is something that will become less and less important as time goes on. The JDK, for example, has 4 separate garbage collectors that can be selected via configuration. But with runtime information (whether via JIT or PGO), couldn't they be better selected automatically for a given runtime profile? I think as compilers continue to build better optimization capabilities, that the gap between GC and manual memory management will shrink to the same significance as the gap between C and assembly today.

eschew · on Feb 14, 2018

At least two of the article's statements about LLVM are false. In particular:

1) LLVM doesn't place any restrictions on how a language runtime allocates memory.

2) LLVM doesn't "expect" a stack map -- it provides infrastructure to compute them if the front end wants to, but the front end is completely free to ignore that infrastructure.

fao_ · on Feb 14, 2018

Are those corrections to the incorrect statements, or the incorrect statements themselves? It's not very clear, I'm sorry.

sinistersnare · on Feb 14, 2018

So I wouldnt say that LLVM places restrictions, but I will say a little of my experience doing LLVM + BoehmGC

BoehmGC uses the stack for its root set (where it starts to find live memory). With LLVM, you dont ever need to explicitly use the stack, you can use registers for everything and make LLVM figure out if it should go on the stack or not. If you want to use Boehm with LLVM, you are forced to explicitly allocate everything on the stack, and not just on a register, so that Boehm is guaranteed to find the memory.

So I wouldnt say restriction, but definitely you need to think about how LLVM operates with the GC and other runtime components of your language.

eschew · on Feb 14, 2018

Ah, they are corrections to the article's incorrect statements! I apologize for the ambiguity.

Groxx · on Feb 14, 2018

Yeah, it was fuzzy to me too. Found this in the article though:

> First, a description of how SBCL’s generational GC works:

> ...

> Being conservative with regards to the stack. That creates a major clash with LLVM which expects a stack map. A stack map declares which data type can be found where on the stack for a function (as in every invocation of that function needs to comply to it), whereas SBCL’s compiler can and does generate code that just places anything anywhere. This makes the stack a lot more compact in some call patterns.

So those were intended as corrections.

twoodfin · on Feb 14, 2018

I love the idea of a “liberal” GC that occasionally throws away bits of memory still in use in the name of raw performance for restartable tasks.

johncolanduoni · on Feb 14, 2018

How do you know when to restart the task? Or that your output isn’t the product of an out of range memory access?

eslaught · on Feb 14, 2018

You unmap the memory when you free it so that it causes a segfault if you access it, and then if a segfault occurs you know something went wrong.

pebers · on Feb 14, 2018

How do you know that you haven't re-allocated the same memory to something else? You won't get a segfault in that case.

stochastic_monk · on Feb 14, 2018

This sounds to me like re-discovering RAII and its failure cases in new garb.

littlestymaar · on Feb 14, 2018

> so that it causes a segfault if you access it

No, this is UB. It can cause a segfault, but it can also allow bad things™ to happen.

barrkel · on Feb 14, 2018

Let's be clear on the difference between C and C++ undefined behaviour, and machine behaviour that GCs and runtimes can use for implementation.

It is not unusual to rely on triggering a hardware exception in runtimes and GCs, up to and including segfaults. For example, a check for safepoint might be implemented by attempting to read a particular address. When the GC needs to stop the world, it could change the protection bits on the page that contains that address. This technique minimizes the amount of code in the safepoint test and doesn't require any branching logic.

See e.g. https://stackoverflow.com/questions/46394575/safepoints-in-j...

Another technique: a runtime can support dynamic growth of the stack by keeping an unmapped or protected page at the limit, and extending it when hit. This is how stacks work on Windows, and it relies on help from codegen: compilers targeting Windows need to generate code that performs a loop touching every page in its stack frame allocation one at a time in allocation order, if the stack frame is larger than the page size.

See e.g. https://geidav.wordpress.com/tag/stack-probing/

lmm · on Feb 14, 2018

In C that would be true. LLVM might be able to provide stronger guarantees, particularly on specific architectures.

eptcyka · on Feb 15, 2018

Life is horrible enough as it is.

masklinn · on Feb 14, 2018

> Another example where you want to keep for-GC bookkeeping overhead in mainline code low is if you can “GC-by-fork”, which is basically throwing away processes as they would need to GC

That's one of the common options in Erlang/Elixir: spawn a worker process for each task with a `min_heap_size` high enough that most workloads would not trigger a collection (the default is fairly low), let it die after it's handled the request. More complex/memory intensive tasks will fall through to normal GC-ing behaviour once they breach the `min_heap_size` limit.

Tarean · on Feb 14, 2018

> it is copying, however other passes that just punch holes into existing cards (to wipe pointers that are not pointed to anymore without moving a lot of memory) have and will be added

I have a basic understanding about card tables and promotion but couldn't find anything about hole punching. Pretty sure I have heard the term before and was just as confused, could someone point me into the right direction for this?

From context I'd guess that it means the gc doesn't copy unless x% of the block is unused?

cracauer · on Feb 15, 2018

I use the "punch holes" phrase in the following situation: - GC is copying/compacting - GC is at least slightly conservative - allocation is fast/inline/increment-only - that leaves you in a situation where you cannot move/compact some part of the heap

You cannot move the possibly (conservatively) pointed to thing because you cannot adjust the pointer to it (because it might be a non-pointer thing such as an integer.

Now you have some GC unit worth of space occupied by one unmovable object, otherwise it's empty space backed by physical pages. What do you do with the rest of the space? In a C/malloc scheme you are aware of such holes and fill them from new allocations. When you have a fast allocation scheme not involving complex code to find holes you will keep these "hole" as long as the conservative-pointer looking thing exists. You do wipe all the other pointerish things in that GC area, though, so that they don't hold down additional space. Still, now you "waste" a whole GC card worth of physical RAM on a single object, the tradeoff being that you do not want to move to an allocation scheme that spends time thinking about fragmentation.

You could use the empty space in those GC cards as a target for the next moving GC, however that has drawbacks as you know continue to have to special-treat such regular objects co-located with possibly conservatively pointed to objects.

If there is a better term than "punching holes" for this I would be interested.

ETA: now that I think about it, you could give back all physical pages in that GC card that do not contain the held down object. This assumes that GC card size is more than one VM page.

cracauer · on Feb 15, 2018

(author here) Just wanted to say that I have seen the comments and will address them when I have a chance. My post turned out to be a lot more popular than I anticipated and I was busy yesterday and today. I wrote most of this in summer 2017, so given the popularity I will also provide a refresh with today's state of LLVM.

Please keep corrections to my post coming, the time to determine and influence GC design restrictions in LLVM is now. Before a popular GCed language comes along and then tramples whatever its current GC happens to be into the status quo.