The author notices that Bolt doesn't use mmap for writes. The reason is surprisi...

cperciva · on Jan 23, 2021

There's an even better reason for databases to not write to memory mapped pages: Pages get synched out to disk at the kernel's leisure. This can be ok for a cache but it's definitely not what you want for a database!

eqvinox · on Jan 23, 2021

That's what msync() is for.

cperciva · on Jan 23, 2021

If you're tracking what needs to be flushed to disk when, you might as well just be making explicit pwrite syscalls.

monocasa · on Jan 23, 2021

Right, but it can sync arbitrary ranges sooner, which is also awful for consistency.

reader_mode · on Jan 23, 2021

Shouldn't your write strategy be resilient to that kind of stuff (eg. shutdown during a partial update) ?

gmueckl · on Jan 23, 2021

Don't you need exact guarantees on write ordering to achieve that?

jorangreef · on Jan 23, 2021

Yes, for almost all databases, although there was a cool paper from the University of Wisconsin Madison a few years ago that showed how to design something that could work without write barriers, and under the assumption that disks don't always fsync correctly:

"the No-Order File System (NoFS), a simple, lightweight file system that employs a novel technique called backpointer based consistency to provide crash consistency without ordering writes as they go to disk"

http://pages.cs.wisc.edu/~vijayc/nofs.htm

vlovich123 · on Jan 23, 2021

Does that generalize to databases? My understanding is that file systems are a restricted case of databases that don’t necessarily support all operations (eg transactions are smaller, can’t do arbitrary queries within a transaction, etc etc).

jorangreef · on Jan 24, 2021

I think it would, because if your underlying file system can provide crash consistency then that can be leveraged as a primitive by the database.

bonzini · on Jan 23, 2021

You can do write/sync/write/sync in order to achieve that. It would be nicer to have FUA support in system calls (or you can open the same file to two descriptors, one with O_SYNC and one without).

dooglius · on Jan 23, 2021

I think you mean mlock

lathiat · on Jan 24, 2021

It won't trigger a read if you write a full 4kB page in both size and alignment but does for anything else. And then as you said, it stalls your write.

I debugged a Ceph issue being caused by this kind of behaviour except it wasn't actually mmap it was pwritev() but the call would stall and take a long time to return in that call that was expected to be async and go into the page cache.

It was caused by "unaligned" writes, e.g. non 4kB-sized-and-aligned writes which happens from Windows guests (using 512b alignment, at least by default) but not Linux guests (which use 4kB alignment, even if the disk is 4kB aligned)

It was made worse by Ceph behaviour that would write to that page, incur the penalty, immediately use madvise to tell the kernel I DONTNEED that anymore as an optimisation (because its a replica that is write-only, I don't need to read it again!), it got dropped, and a millisecond later wrote to it again and had to read it again (because that optimisation did not consider this case of unaligned writes needing to read)

Full story here: https://www.youtube.com/watch?v=_vfGcsvnn6U "In-depth technical story: Fixing I/O performance for Windows guests in OpenStack Ceph clouds"

It was at this point I understood memory alignment and why it might matter at a CPU level.

icedchai · on Jan 23, 2021

Yes, this can definitely be a problem. I worked on a transaction processing system that was entirely based on a in-house memory mapped database. All reads and writes went through mmap. At startup, it read through all X gigabytes of data to "ensure" everything was hot, in memory, and also built the in memory indexes.

This actually worked fine in production, since the systems were properly sized and dedicated to this. On dev systems with low memory and often running into swap, you'd run into cases with crazy delays... sometimes a second or two for something that would normally be a few milliseconds.

cma · on Jan 23, 2021

Isn't there a way around this? When coding for graphics stuff writing to GPU mapped memory people usually take pains to turn off compiler optimizations that might XOR memory against itself to zero it out or AND it against 0 and cause a read, and other things like that.

https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-...

> Even the following C++ code can read from memory and trigger the performance penalty because the code can expand to the following x86 assembly code. C++ code:

    Copy *((int*)MappedResource.pData) = 0;

x86 assembly code:

    Copy AND DWORD PTR [EAX],0

> Use the appropriate optimization settings and language constructs to help avoid this performance penalty. For example, you can avoid the xor optimization by using a volatile pointer or by optimizing for code speed instead of code size.

I guess mmapped files still may need a read to know whether to do copy on write, where mapped memory for the CPU in that case is specifically marked for upload only and gets something flagged that writes it regardless of if there is a change, but mmap maybe has something similar?

(edit: this seems to say nothing similar is possible with mmap on x86 https://stackoverflow.com/questions/31014515/write-only-mapp...

but how does it work for GPUs? Something to do with fixed pci-e support on the cpu (base address register https://en.wikipedia.org/wiki/PCI_configuration_space)?

alaties · on Jan 23, 2021

The answer is that it works pretty similarly, but GPUs usually do this in specialized hardware whereas mmap'ing of files for DMA-style access is implemented mostly in software.

https://insujang.github.io/2017-04-27/gpu-architecture-overv... has a pretty good visual of what's doing what for GPU DMA. You can imagine much of what happens here is almost pure software for mmap'd files.

remram · on Jan 23, 2021

You'd need a way to indicate when you start and end overwriting the page. You need to avoid the page being swapped out mid-overwrite and not read back in. You'd also pay a penalty for zeroing it when it gets mapped pre-overwrite. The map primitives are just not meant for this.

rini17 · on Jan 23, 2021

I think on Linux there's madvise syscall with "remove" flag, which you can issue on memory pages you intend to completely overwrite. I have no idea on performance or other practical issues.

KMag · on Jan 24, 2021

Oracle's JVM allocates your maximum heap size at startup, but these pages aren't actually assigned to either swap space or RAM pages until the first time they're written to (or read, but unless there's a bug in the JVM, it's not reading uninitialized memory), which triggers a page fault.

If the heap usage was high, and drops enough (maybe also needs to stay low for some time period), then Oracle's JVM will release some of the pages back to the OS using madvise, so they go back to using neither RAM nor swap space. On the one hand, the JVM should avoid repeatedly releasing pages back to the OS and then page faulting them back in moments later, but on the other, it shouldn't hold on to pages forever just because it needed them for a short time.

rini17 · on Jan 24, 2021

This has no relevance to parent's issue: how to avoid reads that cause expensive page faults when writing to a file-backed page.

KMag · on Jan 25, 2021

Yea, I misread "issues" as "uses". Sorry.

monocasa · on Jan 23, 2021

As other have said, you need hardware support to do this similarly to how GPUs do it.

That being said, that hardware support exists with NVDIMMs.

ww520 · on Jan 23, 2021

I believe GPU solves this by having read only and write only buffers in the rendering pipeline.

KMag · on Jan 23, 2021

I think the main problem with mmap'd writes is that they're blocking and synchronous.

I presume most database record writes are smaller than a page. In that case, other methods (unless you're using O_DIRECT, which ads its own difficulties) still have the kernel read a whole page of memory into the page cache before writing the selected bytes. So, unless you're using O_DIRECT for your writes, you're still triggering the exact same read-modify-write, it's just that with the file APIs you can use async I/O or use select/poll/epoll/kqueue, etc. to avoid these necessary reads from blocking your writer thread.

tayo42 · on Jan 23, 2021

Is the trade off in varnish worth it? Workloads for a cache should be pretty read heavy, writes should be infrequent unless it's being filled for the first time

perbu · on Jan 24, 2021

For the general case? No. Varnish retained it's mmap backend so you could still chose to use it if you have a load that required it.

For the general case, where writes are somewhat frequent and the dataset is pretty small the malloc backend was a lot more performant, once we figured out the the default implementation of malloc in Glibc was pretty shit (wrt overhead).