Dealing with malloc/free is trivial and cheap - just give every allocated object a couple of reference counts.
The hard part is figuring out which words of memory should be treated as pointers, so that you know when to alter the reference counts.
Most C programs don't rely on all the weird guarantees that C mandates (relying on asm, which is also problematic, is probably more common), but for the ones that do it is quite problematic.
If we're trying to minimize annotation while maximizing C compatibility, it will be necessary to heap-allocate stack frames. This cost can be mitigated with annotations, once again. In this case, a global "forbid leaks even if unused" flag would cover it.
Static allocations only need full heap compatibility if `dlclose` isn't a nop.
And TLS is the forgotten step-child, but at the lowest level it's ultimately just implemented on normal allocations.
Multi-threaded refcounts aren't actually that hard?
There's overhead (depending on how much you're willing to annotate it and how much you can infer), but the only "hard" thing is the race between accessing a field and and changing the refcount of the object it points to, and [even ignoring alternative CAS approaches] that's easy enough if you control the allocator (do not return memory to the OS until all running threads have checked in).
Note that, in contrast the the common refcount approach, it's probably better to introduce a "this is in use; crash on free" flag to significantly reduce the overhead.
The hard part is figuring out which words of memory should be treated as pointers, so that you know when to alter the reference counts.
Most C programs don't rely on all the weird guarantees that C mandates (relying on asm, which is also problematic, is probably more common), but for the ones that do it is quite problematic.