That’s not specific to Zig — local heap allocators generally don’t zero dealloca...

10000truths · 2024-06-15T18:06:09

memset is the golden example of an easily pipelined, parallelized, predictable CPU operation - any semi-modern CPU couldn't ask for easier work to do. Zeroing 8 KB of memory is very cheap.

If we use a modern Xeon chip as an example, an AVX2 store has a throughput of 2 instructions / cycle. Doing that 256 times for 8 KB totals 128 cycles, plus a few extra cycles to account for the latency of issuing the first instruction and the last store to the L1 cache. With a 2 GHz clock frequency, it still takes less than 70 nanoseconds. For comparison, an integer divide has a worst-case latency of 90ish cycles, or 45ish nanoseconds.

toast0 · 2024-06-15T21:25:51

Zeroing memory is very cheap, but not zeroing it is even cheaper.

Zeroing memory on deallocation can be important for sensitive data. Otherwise, it makes more sense to zero on allocation if you know that it's needed because the allocated structure will be used without initilazation and the memory isn't zero by guarantee (most OSes guarantee newly allocated memory will be zero, and have a process to zero pages in the background when possible)

10000truths · 2024-06-15T22:22:12

Sure, but in most practical applications where an HTTP server is involved, zeroing the request/response buffer memory is very unlikely to ever be your bottleneck. Even at 10K RPS per core, your per-request CPU time budget is 100 microseconds. Zeroing memory will only account for a fraction of a percentage of that.

If you're exposing an HTTP API to clients, it's likely that any response's contents will contain sensitive client-specific data. If memory corruption bugs are more likely than bottlenecking on zeroing out your request/response buffer, then zeroing the request/response buffer is a good idea, until proven otherwise by benchmarks or profiling.

adgjlsfhk1 · 2024-06-16T00:26:07

Zeroing on allocation is much more sensible though because that way you preload the memory into your caches as opposed to on deallocation where you bring memory into cache that you know you no longer care about. Also if you do the zero on allocation, the compiler can delete it if it can prove that you write to the memory before reading to it.

celrod · 2024-06-15T19:00:47

This memory is now the least recently used in the L1 cache, despite being freed by the allocator, meaning it probably isn't being used again.

If it was freed after already being removed from the L1 cache, then you also need to evict other L1 cache contents and wait for it to be read into L1 so you can write to it.

128 cycles is a generous estimate, and ignores the costs to the rest of the program.

astrange · 2024-06-15T20:46:50

You can use non-temporal writes to avoid this, and some CPUs have an instruction that zeroes a cache line. It's not expensive to do this.

celrod · 2024-06-15T23:51:25

Nontemporal writes are substantially slower, e.g. with avx512 you can do 1 64 byte nontemporal write every 5 or so clock cycles. That puts you at >= 640 cycles for 8 KiB. https://uops.info/html-instr/VMOVNTPS_M512_ZMM.html

astrange · 2024-06-16T02:10:47

Well, the point of a non-temporal write kind of is that you don't care how fast it is. (Since if it was being read again anytime soon, you'd want it in the cache.)

But yes, it can be an over-optimization.

10000truths · 2024-06-15T20:11:47

The worker is already reading/writing to the buffer memory to service each incoming HTTP request, whether the memory is zeroed or not. The side effects on the CPU cache are insubstantial.

alexchamberlain · 2024-06-15T19:09:55

This might be a stupid question, but why isn't zeroing 8KB of memory a single instruction? It must be so common as to be worthy that all the layers of memory (and indirection) to understand that.

astrange · 2024-06-15T20:47:40

If the memory is above the size of a page, you can tell the VM to drop the page and give you a new zero filled one instead.

josephg · 2024-06-15T21:45:31

For 8kb? Syscalling in to the kernel, updating the processes’s memory map and then later faulting is probably slower by an order of magnitude or more compared to just setting those bytes to zero.

Memcpy, bzero and friends are insanely fast. Practically free when those bytes are in the cpu’s cache already.

astrange · 2024-06-15T22:34:58

So don't syscall. Darwin has a system similar to io_uring for this.

(But it also has a 16KB page size.)

josephg · 2024-06-16T06:52:43

Probably still cause a page fault when the memory is re-accessed though. I suspect even using io_uring will still be a lot slower than bzero if you're just zeroing out 2 pages of memory. Zeroing memory is really fast.

pcwalton · 2024-06-15T23:38:15

128-bit or 256-bit memsets via SIMD instructions are sufficient to saturate RAM bandwidth, so there wouldn't be much of a gain from having a dedicated instruction.

(By the way, x86 does have a dedicated instruction--rep stosb--but compilers differ as to how often they use it, for the reason cited above.)

anonymoushn · 2024-06-16T10:47:48

Supposedly rep movsb is faster than SIMD stores on very recent chips, for cases where you aren't actually hitting RAM with all your writes.

tubs · 2024-06-16T04:49:14

The gain is in power efficiency.

Arm64 provides `dc zva` for this.

saagarjha · 2024-06-15T20:16:06

Zeroing something that large is not typical. That said, some architectures have optimized zeroing instructions, such as dc zva on ARM.

secondcoming · 2024-06-15T21:34:40

compilers are probably going to remove that memset.

olliej · 2024-06-16T02:28:56

Compilers can remove the memset if they can show it is overwritten prior to use (though C and C++ UB could technically make it possible to skip padding they don’t), or it isn’t used (in which case we go back to non-zero’d memory again which in this scenario we’re trying to avoid).

There are various _s variants of memset, etc that require the compiler to perform the operations even if it “proves” the data cannot be read.

And finally modern hardware has mechanisms to say “this is now zero” and not actually zero the memory and instead just tell the MMU that the region is now zero (which removes the cpu time and cache impact of accessing the memory directly).

On macOS and iOS I believe all memory is now zero’d on free and I think malloc ostensibly therefore guarantees zero’d memory (the problem I think is whether calloc tries to rely on that behavior, because then calloc can produce non-zero memory courtesy of a buffer overrun/UaF after free has ostensibly zero’d memory)

josephg · 2024-06-15T21:46:28

In C, you can use explicit_bzero to make sure the instructions aren’t removed by the optimiser:

https://man7.org/linux/man-pages/man3/bzero.3.html

pjmlp · 2024-06-16T20:40:56

> Marked as LEGACY in POSIX.1-2001. Removed in POSIX.1-2008.

In Linux you mean.

atiedebee · 2024-06-16T23:04:34

The only standard explicit memset is in C23