I'm not 100% sure how Zig allocators work but it looks like the arena memory is ...

eknkc · 2024-06-15T15:13:05

I guess you could place a zeroing allocator wrapper in between the arena and it's underlying allocator. That would write zero to anything that's getting freed. Arena deinit will free anything allocated from the underlying allocator so upon completion of each request, used memory would be zeroed before returned back to the main allocator.

And that handler signature would still be the same. Which is the he whole point of this article so, yay.

samatman · 2024-06-15T16:47:26

I once spent an utterly baffling afternoon trying to figure out why my benchmark for a reverse iteration across a rope data structure in Julia was finishing way too fast. I was perf tuning it, and while it would have been lovely if my implementation was actually 50 times faster than reverse iterating a native String type, I didn't buy it.

Finally figured it out: I flipped a sign in the reverse iterator, so it was allocating a bunch of memory and immediately hitting the margin of the Vector, and returning it with most of the bytes undefined. Why didn't I catch it sooner? Well, I kept running the benchmark, which allocated a reverse buffer for the String version, which GC released, then I ran the buggy code... and the GC picked up the recently freed correct data and handed it back to me! Oops.

Of course, if you want to avoid that risk in Zig, you just write a ZeroOnFreeAllocator, which zeros out your memory when you free it. It's a drop in replacement for anything which needs an allocator, job done.

hansvm · 2024-06-15T21:57:57

In my Zig servers I'm using a similar arena-based (with resetting) strategy. It's not as bad as you'd imagine:

The current alloc implementation memsets under the hood. There are ongoing discussions about the right way to remove that performance overhead, but safety comes first.

Any sane implementation has an arena per request and per connection anyway, not shared between processes. You don't have bonkers aliasing bugs because the OS would have panicked before handing out that memory.

Zig has a lot of small features designed to make memory corruption an unhappy code path. I've had one corruption bug out of a lot of Zig code the last few years. It was from a misunderstanding of async (a classic stack pointer leak disguised by a confusion over async syntax). It's not an issue since async is gone from the language, and that sort of thing is normally turned into a compiler error anyway as soon as somebody reports it.

KerrAvon · 2024-06-15T15:46:58

That’s not specific to Zig — local heap allocators generally don’t zero deallocated memory — that’s a significant, unnecessary performance hit.

If you need data to be isolated when memory is corrupt, you need it to be isolated always.

10000truths · 2024-06-15T18:06:09

memset is the golden example of an easily pipelined, parallelized, predictable CPU operation - any semi-modern CPU couldn't ask for easier work to do. Zeroing 8 KB of memory is very cheap.

If we use a modern Xeon chip as an example, an AVX2 store has a throughput of 2 instructions / cycle. Doing that 256 times for 8 KB totals 128 cycles, plus a few extra cycles to account for the latency of issuing the first instruction and the last store to the L1 cache. With a 2 GHz clock frequency, it still takes less than 70 nanoseconds. For comparison, an integer divide has a worst-case latency of 90ish cycles, or 45ish nanoseconds.

toast0 · 2024-06-15T21:25:51

Zeroing memory is very cheap, but not zeroing it is even cheaper.

Zeroing memory on deallocation can be important for sensitive data. Otherwise, it makes more sense to zero on allocation if you know that it's needed because the allocated structure will be used without initilazation and the memory isn't zero by guarantee (most OSes guarantee newly allocated memory will be zero, and have a process to zero pages in the background when possible)

10000truths · 2024-06-15T22:22:12

Sure, but in most practical applications where an HTTP server is involved, zeroing the request/response buffer memory is very unlikely to ever be your bottleneck. Even at 10K RPS per core, your per-request CPU time budget is 100 microseconds. Zeroing memory will only account for a fraction of a percentage of that.

If you're exposing an HTTP API to clients, it's likely that any response's contents will contain sensitive client-specific data. If memory corruption bugs are more likely than bottlenecking on zeroing out your request/response buffer, then zeroing the request/response buffer is a good idea, until proven otherwise by benchmarks or profiling.

adgjlsfhk1 · 2024-06-16T00:26:07

Zeroing on allocation is much more sensible though because that way you preload the memory into your caches as opposed to on deallocation where you bring memory into cache that you know you no longer care about. Also if you do the zero on allocation, the compiler can delete it if it can prove that you write to the memory before reading to it.

celrod · 2024-06-15T19:00:47

This memory is now the least recently used in the L1 cache, despite being freed by the allocator, meaning it probably isn't being used again.

If it was freed after already being removed from the L1 cache, then you also need to evict other L1 cache contents and wait for it to be read into L1 so you can write to it.

128 cycles is a generous estimate, and ignores the costs to the rest of the program.

astrange · 2024-06-15T20:46:50

You can use non-temporal writes to avoid this, and some CPUs have an instruction that zeroes a cache line. It's not expensive to do this.

celrod · 2024-06-15T23:51:25

Nontemporal writes are substantially slower, e.g. with avx512 you can do 1 64 byte nontemporal write every 5 or so clock cycles. That puts you at >= 640 cycles for 8 KiB. https://uops.info/html-instr/VMOVNTPS_M512_ZMM.html

astrange · 2024-06-16T02:10:47

Well, the point of a non-temporal write kind of is that you don't care how fast it is. (Since if it was being read again anytime soon, you'd want it in the cache.)

But yes, it can be an over-optimization.

10000truths · 2024-06-15T20:11:47

The worker is already reading/writing to the buffer memory to service each incoming HTTP request, whether the memory is zeroed or not. The side effects on the CPU cache are insubstantial.

alexchamberlain · 2024-06-15T19:09:55

This might be a stupid question, but why isn't zeroing 8KB of memory a single instruction? It must be so common as to be worthy that all the layers of memory (and indirection) to understand that.

astrange · 2024-06-15T20:47:40

If the memory is above the size of a page, you can tell the VM to drop the page and give you a new zero filled one instead.

josephg · 2024-06-15T21:45:31

For 8kb? Syscalling in to the kernel, updating the processes’s memory map and then later faulting is probably slower by an order of magnitude or more compared to just setting those bytes to zero.

Memcpy, bzero and friends are insanely fast. Practically free when those bytes are in the cpu’s cache already.

astrange · 2024-06-15T22:34:58

So don't syscall. Darwin has a system similar to io_uring for this.

(But it also has a 16KB page size.)

josephg · 2024-06-16T06:52:43

Probably still cause a page fault when the memory is re-accessed though. I suspect even using io_uring will still be a lot slower than bzero if you're just zeroing out 2 pages of memory. Zeroing memory is really fast.

pcwalton · 2024-06-15T23:38:15

128-bit or 256-bit memsets via SIMD instructions are sufficient to saturate RAM bandwidth, so there wouldn't be much of a gain from having a dedicated instruction.

(By the way, x86 does have a dedicated instruction--rep stosb--but compilers differ as to how often they use it, for the reason cited above.)

anonymoushn · 2024-06-16T10:47:48

Supposedly rep movsb is faster than SIMD stores on very recent chips, for cases where you aren't actually hitting RAM with all your writes.

tubs · 2024-06-16T04:49:14

The gain is in power efficiency.

Arm64 provides `dc zva` for this.

saagarjha · 2024-06-15T20:16:06

Zeroing something that large is not typical. That said, some architectures have optimized zeroing instructions, such as dc zva on ARM.

secondcoming · 2024-06-15T21:34:40

compilers are probably going to remove that memset.

olliej · 2024-06-16T02:28:56

Compilers can remove the memset if they can show it is overwritten prior to use (though C and C++ UB could technically make it possible to skip padding they don’t), or it isn’t used (in which case we go back to non-zero’d memory again which in this scenario we’re trying to avoid).

There are various _s variants of memset, etc that require the compiler to perform the operations even if it “proves” the data cannot be read.

And finally modern hardware has mechanisms to say “this is now zero” and not actually zero the memory and instead just tell the MMU that the region is now zero (which removes the cpu time and cache impact of accessing the memory directly).

On macOS and iOS I believe all memory is now zero’d on free and I think malloc ostensibly therefore guarantees zero’d memory (the problem I think is whether calloc tries to rely on that behavior, because then calloc can produce non-zero memory courtesy of a buffer overrun/UaF after free has ostensibly zero’d memory)

josephg · 2024-06-15T21:46:28

In C, you can use explicit_bzero to make sure the instructions aren’t removed by the optimiser:

https://man7.org/linux/man-pages/man3/bzero.3.html

pjmlp · 2024-06-16T20:40:56

> Marked as LEGACY in POSIX.1-2001. Removed in POSIX.1-2008.

In Linux you mean.

atiedebee · 2024-06-16T23:04:34

The only standard explicit memset is in C23

nurpax · 2024-06-15T15:30:55

The same can happen with C malloc/free too.

jedisct1 · 2024-06-15T16:05:49

Zig allocators can be composed, so adding zeroization would be trivial.

keybored · 2024-06-15T15:55:13

Deinit in O(1) seems to be a big attraction of arenas.

foota · 2024-06-15T17:01:07

O(1) is nice, but I feel like avoiding walking a bunch of data structures is maybe most important.

elvircrn · 2024-06-15T17:22:56

Any papers/blogs/SO answers covering this?

foota · 2024-06-15T21:03:22

I don't have anything for you, but if you have some normally allocated hierarchal data structures in order to free them you'll have to go through their members, chase pointers, etc., to figure out the addresses to free, then call free on them in sequence. That's all going to be a lot more expensive than just memsetting a bunch of data to zero, which you can do at whatever the speed of your cores memory bandwidth is.

josephg · 2024-06-15T21:56:40

Yep. And you often don’t even need to zero the data.

Generally, no paper or SO answer will tell you where your program spends its time. Learn to use profiling tools, and experiment with stuff like this. Try out arenas. Benchmark before and after and see what kind of real performance difference it makes in your own program.

saagarjha · 2024-06-15T20:10:02

What are you looking for? Bump allocators are quite simple, compared to typical allocators at least.

tapirl · 2024-06-15T18:49:10

If needed, you should zero memory on allocation succeeds, instead of zeroing it after it is freed.

alexchamberlain · 2024-06-15T19:08:02

Generally, you 0 on free in secure environments to avoid leaking secrets from 1 section of knowledge to the next. ie a request may contain a password, which the next request should not have access to.

tapirl · 2024-06-16T12:33:15

Good reason. But I think it is not the responsibility of memory allocators to do the zero work. It is what the application code should do.

alexchamberlain · 2024-06-16T13:52:05

Depends where you draw the line. An arena allocator per request needs to be managed at least by an app framework, if not the application. It's all layers of abstraction, and one of those layers needs to 0 memory.

tapirl · 2024-06-18T07:13:50

The arena allocator implementation for general uses absolutely should not do the zero work. This is specific use case, which can be implemented in an app-specific custom allocator.

alexchamberlain · 2024-06-18T08:20:32

That's not what I said. My point was that an arena allocator has to be managed at a relatively high level. Similarly, an allocator responsible for 0 on free would be managed at a similar level. They are orthogonal concepts as you say, but there's no reason 0 on free can't be managed by an allocator.

saagarjha · 2024-06-15T20:08:52

Guard pages are not enough to prevent memory corruption across requests.