The author notices that Bolt doesn't use mmap for writes. The reason is surprisingly simple, once you know how it works. Say you want to overwrite a page at some locations that isn't present in memory. You'd write to it and you'd think that is that. But when this happens the CPU triggers a page fault, the OS steps in and reads the underlying page into memory. It then relinquishes control back to the application. The application then continues to overwrite that page.
So for each write that isn't mapped into memory you'll trigger a read. Bad.
Early versions of Varnish Cache struggled with this and this was the reason they made a malloc-based backend instead. mmaps are great for reads, but you really shouldn't write through them.
There's an even better reason for databases to not write to memory mapped pages: Pages get synched out to disk at the kernel's leisure. This can be ok for a cache but it's definitely not what you want for a database!
Yes, for almost all databases, although there was a cool paper from the University of Wisconsin Madison a few years ago that showed how to design something that could work without write barriers, and under the assumption that disks don't always fsync correctly:
"the No-Order File System (NoFS), a simple, lightweight file system that employs a novel technique called backpointer based consistency to provide crash consistency without ordering writes as they go to disk"
Does that generalize to databases? My understanding is that file systems are a restricted case of databases that don’t necessarily support all operations (eg transactions are smaller, can’t do arbitrary queries within a transaction, etc etc).
You can do write/sync/write/sync in order to achieve that. It would be nicer to have FUA support in system calls (or you can open the same file to two descriptors, one with O_SYNC and one without).
It won't trigger a read if you write a full 4kB page in both size and alignment but does for anything else. And then as you said, it stalls your write.
I debugged a Ceph issue being caused by this kind of behaviour except it wasn't actually mmap it was pwritev() but the call would stall and take a long time to return in that call that was expected to be async and go into the page cache.
It was caused by "unaligned" writes, e.g. non 4kB-sized-and-aligned writes which happens from Windows guests (using 512b alignment, at least by default) but not Linux guests (which use 4kB alignment, even if the disk is 4kB aligned)
It was made worse by Ceph behaviour that would write to that page, incur the penalty, immediately use madvise to tell the kernel I DONTNEED that anymore as an optimisation (because its a replica that is write-only, I don't need to read it again!), it got dropped, and a millisecond later wrote to it again and had to read it again (because that optimisation did not consider this case of unaligned writes needing to read)
Yes, this can definitely be a problem. I worked on a transaction processing system that was entirely based on a in-house memory mapped database. All reads and writes went through mmap. At startup, it read through all X gigabytes of data to "ensure" everything was hot, in memory, and also built the in memory indexes.
This actually worked fine in production, since the systems were properly sized and dedicated to this. On dev systems with low memory and often running into swap, you'd run into cases with crazy delays... sometimes a second or two for something that would normally be a few milliseconds.
Isn't there a way around this? When coding for graphics stuff writing to GPU mapped memory people usually take pains to turn off compiler optimizations that might XOR memory against itself to zero it out or AND it against 0 and cause a read, and other things like that.
> Even the following C++ code can read from memory and trigger the performance penalty because the code can expand to the following x86 assembly code.
C++ code:
Copy *((int*)MappedResource.pData) = 0;
x86 assembly code:
Copy AND DWORD PTR [EAX],0
> Use the appropriate optimization settings and language constructs to help avoid this performance penalty. For example, you can avoid the xor optimization by using a volatile pointer or by optimizing for code speed instead of code size.
I guess mmapped files still may need a read to know whether to do copy on write, where mapped memory for the CPU in that case is specifically marked for upload only and gets something flagged that writes it regardless of if there is a change, but mmap maybe has something similar?
The answer is that it works pretty similarly, but GPUs usually do this in specialized hardware whereas mmap'ing of files for DMA-style access is implemented mostly in software.
You'd need a way to indicate when you start and end overwriting the page. You need to avoid the page being swapped out mid-overwrite and not read back in. You'd also pay a penalty for zeroing it when it gets mapped pre-overwrite. The map primitives are just not meant for this.
I think on Linux there's madvise syscall with "remove" flag, which you can issue on memory pages you intend to completely overwrite. I have no idea on performance or other practical issues.
Oracle's JVM allocates your maximum heap size at startup, but these pages aren't actually assigned to either swap space or RAM pages until the first time they're written to (or read, but unless there's a bug in the JVM, it's not reading uninitialized memory), which triggers a page fault.
If the heap usage was high, and drops enough (maybe also needs to stay low for some time period), then Oracle's JVM will release some of the pages back to the OS using madvise, so they go back to using neither RAM nor swap space. On the one hand, the JVM should avoid repeatedly releasing pages back to the OS and then page faulting them back in moments later, but on the other, it shouldn't hold on to pages forever just because it needed them for a short time.
I think the main problem with mmap'd writes is that they're blocking and synchronous.
I presume most database record writes are smaller than a page. In that case, other methods (unless you're using O_DIRECT, which ads its own difficulties) still have the kernel read a whole page of memory into the page cache before writing the selected bytes. So, unless you're using O_DIRECT for your writes, you're still triggering the exact same read-modify-write, it's just that with the file APIs you can use async I/O or use select/poll/epoll/kqueue, etc. to avoid these necessary reads from blocking your writer thread.
Is the trade off in varnish worth it? Workloads for a cache should be pretty read heavy, writes should be infrequent unless it's being filled for the first time
For the general case? No. Varnish retained it's mmap backend so you could still chose to use it if you have a load that required it.
For the general case, where writes are somewhat frequent and the dataset is pretty small the malloc backend was a lot more performant, once we figured out the the default implementation of malloc in Glibc was pretty shit (wrt overhead).
So for each write that isn't mapped into memory you'll trigger a read. Bad.
Early versions of Varnish Cache struggled with this and this was the reason they made a malloc-based backend instead. mmaps are great for reads, but you really shouldn't write through them.