But how, exactly, do databases use mmap?

perbu · on Jan 23, 2021

The author notices that Bolt doesn't use mmap for writes. The reason is surprisingly simple, once you know how it works. Say you want to overwrite a page at some locations that isn't present in memory. You'd write to it and you'd think that is that. But when this happens the CPU triggers a page fault, the OS steps in and reads the underlying page into memory. It then relinquishes control back to the application. The application then continues to overwrite that page.

So for each write that isn't mapped into memory you'll trigger a read. Bad.

Early versions of Varnish Cache struggled with this and this was the reason they made a malloc-based backend instead. mmaps are great for reads, but you really shouldn't write through them.

cperciva · on Jan 23, 2021

There's an even better reason for databases to not write to memory mapped pages: Pages get synched out to disk at the kernel's leisure. This can be ok for a cache but it's definitely not what you want for a database!

eqvinox · on Jan 23, 2021

That's what msync() is for.

cperciva · on Jan 23, 2021

If you're tracking what needs to be flushed to disk when, you might as well just be making explicit pwrite syscalls.

monocasa · on Jan 23, 2021

Right, but it can sync arbitrary ranges sooner, which is also awful for consistency.

reader_mode · on Jan 23, 2021

Shouldn't your write strategy be resilient to that kind of stuff (eg. shutdown during a partial update) ?

gmueckl · on Jan 23, 2021

Don't you need exact guarantees on write ordering to achieve that?

jorangreef · on Jan 23, 2021

Yes, for almost all databases, although there was a cool paper from the University of Wisconsin Madison a few years ago that showed how to design something that could work without write barriers, and under the assumption that disks don't always fsync correctly:

"the No-Order File System (NoFS), a simple, lightweight file system that employs a novel technique called backpointer based consistency to provide crash consistency without ordering writes as they go to disk"

http://pages.cs.wisc.edu/~vijayc/nofs.htm

vlovich123 · on Jan 23, 2021

Does that generalize to databases? My understanding is that file systems are a restricted case of databases that don’t necessarily support all operations (eg transactions are smaller, can’t do arbitrary queries within a transaction, etc etc).

jorangreef · on Jan 24, 2021

I think it would, because if your underlying file system can provide crash consistency then that can be leveraged as a primitive by the database.

bonzini · on Jan 23, 2021

You can do write/sync/write/sync in order to achieve that. It would be nicer to have FUA support in system calls (or you can open the same file to two descriptors, one with O_SYNC and one without).

dooglius · on Jan 23, 2021

I think you mean mlock

lathiat · on Jan 24, 2021

It won't trigger a read if you write a full 4kB page in both size and alignment but does for anything else. And then as you said, it stalls your write.

I debugged a Ceph issue being caused by this kind of behaviour except it wasn't actually mmap it was pwritev() but the call would stall and take a long time to return in that call that was expected to be async and go into the page cache.

It was caused by "unaligned" writes, e.g. non 4kB-sized-and-aligned writes which happens from Windows guests (using 512b alignment, at least by default) but not Linux guests (which use 4kB alignment, even if the disk is 4kB aligned)

It was made worse by Ceph behaviour that would write to that page, incur the penalty, immediately use madvise to tell the kernel I DONTNEED that anymore as an optimisation (because its a replica that is write-only, I don't need to read it again!), it got dropped, and a millisecond later wrote to it again and had to read it again (because that optimisation did not consider this case of unaligned writes needing to read)

Full story here: https://www.youtube.com/watch?v=_vfGcsvnn6U "In-depth technical story: Fixing I/O performance for Windows guests in OpenStack Ceph clouds"

It was at this point I understood memory alignment and why it might matter at a CPU level.

icedchai · on Jan 23, 2021

Yes, this can definitely be a problem. I worked on a transaction processing system that was entirely based on a in-house memory mapped database. All reads and writes went through mmap. At startup, it read through all X gigabytes of data to "ensure" everything was hot, in memory, and also built the in memory indexes.

This actually worked fine in production, since the systems were properly sized and dedicated to this. On dev systems with low memory and often running into swap, you'd run into cases with crazy delays... sometimes a second or two for something that would normally be a few milliseconds.

cma · on Jan 23, 2021

Isn't there a way around this? When coding for graphics stuff writing to GPU mapped memory people usually take pains to turn off compiler optimizations that might XOR memory against itself to zero it out or AND it against 0 and cause a read, and other things like that.

https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-...

> Even the following C++ code can read from memory and trigger the performance penalty because the code can expand to the following x86 assembly code. C++ code:

    Copy *((int*)MappedResource.pData) = 0;

x86 assembly code:

    Copy AND DWORD PTR [EAX],0

> Use the appropriate optimization settings and language constructs to help avoid this performance penalty. For example, you can avoid the xor optimization by using a volatile pointer or by optimizing for code speed instead of code size.

I guess mmapped files still may need a read to know whether to do copy on write, where mapped memory for the CPU in that case is specifically marked for upload only and gets something flagged that writes it regardless of if there is a change, but mmap maybe has something similar?

(edit: this seems to say nothing similar is possible with mmap on x86 https://stackoverflow.com/questions/31014515/write-only-mapp...

but how does it work for GPUs? Something to do with fixed pci-e support on the cpu (base address register https://en.wikipedia.org/wiki/PCI_configuration_space)?

alaties · on Jan 23, 2021

The answer is that it works pretty similarly, but GPUs usually do this in specialized hardware whereas mmap'ing of files for DMA-style access is implemented mostly in software.

https://insujang.github.io/2017-04-27/gpu-architecture-overv... has a pretty good visual of what's doing what for GPU DMA. You can imagine much of what happens here is almost pure software for mmap'd files.

remram · on Jan 23, 2021

You'd need a way to indicate when you start and end overwriting the page. You need to avoid the page being swapped out mid-overwrite and not read back in. You'd also pay a penalty for zeroing it when it gets mapped pre-overwrite. The map primitives are just not meant for this.

rini17 · on Jan 23, 2021

I think on Linux there's madvise syscall with "remove" flag, which you can issue on memory pages you intend to completely overwrite. I have no idea on performance or other practical issues.

KMag · on Jan 24, 2021

Oracle's JVM allocates your maximum heap size at startup, but these pages aren't actually assigned to either swap space or RAM pages until the first time they're written to (or read, but unless there's a bug in the JVM, it's not reading uninitialized memory), which triggers a page fault.

If the heap usage was high, and drops enough (maybe also needs to stay low for some time period), then Oracle's JVM will release some of the pages back to the OS using madvise, so they go back to using neither RAM nor swap space. On the one hand, the JVM should avoid repeatedly releasing pages back to the OS and then page faulting them back in moments later, but on the other, it shouldn't hold on to pages forever just because it needed them for a short time.

rini17 · on Jan 24, 2021

This has no relevance to parent's issue: how to avoid reads that cause expensive page faults when writing to a file-backed page.

KMag · on Jan 25, 2021

Yea, I misread "issues" as "uses". Sorry.

monocasa · on Jan 23, 2021

As other have said, you need hardware support to do this similarly to how GPUs do it.

That being said, that hardware support exists with NVDIMMs.

ww520 · on Jan 23, 2021

I believe GPU solves this by having read only and write only buffers in the rendering pipeline.

KMag · on Jan 23, 2021

I think the main problem with mmap'd writes is that they're blocking and synchronous.

I presume most database record writes are smaller than a page. In that case, other methods (unless you're using O_DIRECT, which ads its own difficulties) still have the kernel read a whole page of memory into the page cache before writing the selected bytes. So, unless you're using O_DIRECT for your writes, you're still triggering the exact same read-modify-write, it's just that with the file APIs you can use async I/O or use select/poll/epoll/kqueue, etc. to avoid these necessary reads from blocking your writer thread.

tayo42 · on Jan 23, 2021

Is the trade off in varnish worth it? Workloads for a cache should be pretty read heavy, writes should be infrequent unless it's being filled for the first time

perbu · on Jan 24, 2021

For the general case? No. Varnish retained it's mmap backend so you could still chose to use it if you have a load that required it.

For the general case, where writes are somewhat frequent and the dataset is pretty small the malloc backend was a lot more performant, once we figured out the the default implementation of malloc in Glibc was pretty shit (wrt overhead).

bonzini · on Jan 23, 2021

The right answer is that they shouldn't. A database has much more information than the operating system about what, how and when to cache information. Therefore the database should handle its own I/O caching using O_DIRECT on Linux or the equivalent on Windows or other Unixes.

The article at https://www.scylladb.com/2017/10/05/io-access-methods-scylla... is a bit old (2017) but it explains the trade-offs

masklinn · on Jan 23, 2021

> Therefore the database should handle its own I/O caching using O_DIRECT on Linux or the equivalent on Windows or other Unixes.

That's not wrong, but at the same time it adds complexity and requires effort which can't be spent elsewhere unless you've got someone who really only wants to DIO and wouldn't work on anything else anyway.

Postgres has never used DIO, and while there have been rumbling about moving to DIO (especially following the fsync mess) as Andres Freund noted:

> efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

jandrewrogers · on Jan 23, 2021

PostgreSQL has two main challenges with direct I/O. The basic one is that it adversely impacts portability, as mentioned, and is complicated in implementation because file system behavior under direct I/O is not always consistent.

The bigger challenge is that PostgreSQL is not architected like a database engine designed to use direct I/O effectively. Adding even the most rudimentary support will be a massive code change and implementation effort, and the end result won't be comparable to what you would expect from a modern database kernel designed to use direct I/O. This raises questions about return on investment.

jorangreef · on Jan 23, 2021

I have found that planning for DIO from the start makes for a better, simpler design when designing storage systems, because it keeps the focus on logical/physical sector alignment, latent sector error handling, and caching from the beginning. And even better to design data layouts to work with block devices.

Retrofitting DIO onto a non-DIO design and doing this cross-platform is going to be more work, but I don't think that's the fault of DIO (when you're already building a database that is).

antpls · on Jan 24, 2021

Is there a known library with a cross platform abstraction that could help?

jorangreef · on Jan 24, 2021

I wrote this for Node.js, which is a native binding in C, exposing cross platform functionality: https://github.com/ronomon/direct-io

Although if it's a new project and you're used to C, I would recommend also taking a good look at Zig (https://ziglang.org/), because it's so explicit about alignment compared to C, and makes alignment a first-class part of the type system, see this other comment of mine that goes into more detail: https://news.ycombinator.com/item?id=25801542

Something that will also help, is setting your minimum IO unit to 4096 bytes, the Advanced Format sector size, because then your Direct IO system will just work, regardless of whether sysadmins swap disks of different sector sizes from underneath you. For example, a minimum sector size of 4096 bytes will work not only for newer AF disks but also for any 512 byte sector disks.

Lastly, Direct IO is actually more a property of the file system, not necessarily the OS (e.g. Linux), so you will find some file systems on Linux that return EINVAL when you try to open a file descriptor with O_DIRECT, simply because they don't support O_DIRECT (e.g. a macOS volume accessed from within a Linux VM) so that should be your way of testing for support, not only the OS.

jorangreef · on Jan 23, 2021

Yes, and it's not only about performance, but also safety because O_DIRECT is the only safe way to recover from the journal after fsync failure (when the page cache can no longer be trusted by the database to be coherent with the disk): https://www.usenix.org/system/files/atc20-rebello.pdf

From a safety perspective, O_DIRECT is now table stakes. There's simply no control over the granularity of read/write EIO errors when your syscalls only touch memory and where you have no visibility into background flush errors.

formerly_proven · on Jan 23, 2021

Around four years ago I was working on a transactional data store and ran into these issues that virtually no one tells you how durable I/O is supposed to work. There were very few articles on the internet that went beyond some of the basic stuff (e.g. create file => fsync directory) and perhaps one article explaining what needs to be considered when using sync_file_range. Docs and POSIX were useless. I noticed that there seemed to be inherent problems with I/O error handling when using the page cache, i.e. whenever something that wasn't the app itself caused write I/O you really didn't know any more if all the data got there.

Some two years later fsyncgate happened and since then I/O error handling on Linux has finally gotten at least some attention and people seemed to have woken up to the fact that this is a genuinely hard thing to do.

jorangreef · on Jan 24, 2021

What was the data store you were working on? Is it open source?

My experience was the same as you.

What helped me was discovering all the fantastic storage and file system papers coming out of the University of Wisconsin Madison, supervised by Remzi and Andrea Arpaci-Dusseau.

Their teams have studied and documented almost all aspects of what is required to write reliable storage systems, even diving into interactions between local storage failures and global consensus protocols, how a single disk block failure can destroy Raft and Zookeeper. Most safety testing of these systems tends to focus on the network fault model. I think in a few years time we'll all look back and see how today we had almost no concept of a storage fault model. It's kind of exciting to think that there's going to be a new breed of replicated databases that are far more reliable than today's systems. On the another hand, perhaps the future is already here, just not very evenly distributed.

http://pages.cs.wisc.edu/~remzi/

Their OSTEP book (Operating Systems in Three Easy Pieces) is also a great fun read: http://pages.cs.wisc.edu/~remzi/OSTEP/

jlokier · on Jan 24, 2021

> From a safety perspective, O_DIRECT is now table stakes

Except for the awkward problem where O_DIRECT writes don't send a write-barrier to the drives, so the written data can still disappear.

jorangreef · on Jan 24, 2021

That's a common misunderstanding of the purpose of O_DIRECT.

For a write barrier, you would still use fsync() or O_DSYNC, along with O_DIRECT.

The man page for open(2) is clear on this: https://man7.org/linux/man-pages/man2/open.2.html

jlokier · on Jan 25, 2021

I guess I know, as this is what I found when I googled just now :-) https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct...

I was trying to address this aspect of the parent comment:

> O_DIRECT is the only safe way to recover from the journal after fsync failure (when the page cache can no longer be trusted by the database to be coherent with the disk)

O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.

(By the way, O_SYNC/O_DSYNC are equivalent to calling fsync/fdatasync after each write, therefore subject to some of the same issues.)

But even in normal situations with fsync working fine, it is not clear if you can rely on fsync to do a drive write-cache flush when there isn't any metadata or page cache data for the file because you've only been using O_DIRECT.

Neither open(2) or fsync(2) man pages address this durability issue. You can't use O_DSYNC or O_SYNC for good throughout with O_DIRECT because your database does not want the overhead of a write-cache flush on every write. You only want it for barriers. And you can't rely on fdatasync because there's no data to flush in the page cache, no block I/O to do, so fdatasync could meet expectations by doing nothing.

My solution in the past has been to toggle the LSB in st_mtime before async just to force a drive write-cache flush when I'm not sure that anything else will force one. It's not pretty.

jorangreef · on Jan 26, 2021

> O_DIRECT is not a safe way to recover from the journal if you have decided you cannot trust fsync to do its job, because you need fsync to make O_DIRECT write-cache durable.

I was specifically referring not to an fsync in your sense (where the disk or fs does not respect fsync at all so that fsync is a no-op, or where the fs has a bug with O_DIRECT not flushing if it sees nothing dirty in the page cache - by the way I think this is no longer an issue, otherwise it's a kernel bug you can report)

...but to handling an fsync error in the context of the paper from WISC that I linked to in that parent comment, where the kernel's page cache has gone out of sync with the disk after an fsync EIO error ("Fsyncgate"):

"when the page cache can no longer be trusted by the database to be coherent with the disk: https://www.usenix.org/system/files/atc20-rebello.pdf"

The details are all in the paper. Sure, some disks may not respect fsync, but O_DIRECT is still the only way to safely read and recover from the journal when the kernel's page cache is out of sync with the disk (again, details in the paper). It's another fantastic paper out of WISC.

natmaka · on Jan 23, 2021

> A database has much more information than the operating system about what, how and when to cache information

Yes, on a dedicated server. However many DB engines instances run on non-dedicated servers, for example along a web server flanked with various processes sometimes reading the local filesystem or using RAM (Varnish, memcached...), and often-run tasks (tempfiles purge, log aggregation, monitoring probes, MTA...). In such a case letting the DB engine use too much RAM, and therefore reducing its global efficiency while limiting buffercache size, may (all other things being equal) imply more 'read' operations, reducing overall performance.

sradman · on Jan 23, 2021

Great point. Selecting the RDBMS page cache size is a key performance parameter that is near impossible to get right on a mixed-use host, both non-dedicated servers and client desktop/laptop. SQL Anywhere, which emphasizes zero-admin, has long supported Dynamic Cache Sizing [1] specifically for this mixed-use case which is/was its bread-and-butter. I don't know if any other RDBMSes do the same (MS SQL?).

As a side note, Apache Arrow's main use case is similar, a column oriented data store shared by one-or-more client processes (Python, R, Julia, Matlab, etc.) on the same general purpose host. This is also now a key distinction between the Apple M1 and its big.LITTLE ARM SoC vs. Amazon Graviton built for server-side virtualized/containerized instances. We should not conflate the two use-cases and understand that the best solution for one use case may not be the best for the other.

[1] http://dcx.sybase.com/1200/en/dbusage/perform-bridgehead-405...

natmaka · on Jan 24, 2021

Enabling some coopetition among caches by periodically adjusting their sizes, giving more RAM to the most efficient one (higher hit/miss ratio) while shrinking the least efficient one, may let the whole system 'walk' toward the soft point where all caches stay nearby their peak efficiency, or at least alleviate avoidable thrashing.

A first model may lay on memory ballooning and a PID loop. I could not find any pertinent software.

> We should not conflate the two use-cases

Exactly the point! Most hints/principles aren't generic but are stated with insufficient description of their pertinent contexts, especially on online boards and chats channels. It impacts most 'best practices' and code snippets.

sradman · on Jan 23, 2021

O_DIRECT prevents file double buffering by the OS and DBMS page cache. MMAP removes the need for the DBMS page cache and relies on the OS’s paging algorithm. The gain is zero memory copy and the ability for multiple processes to access the same data efficiently.

Apache Arrow takes advantage of mmap to share data across different language processes and enables fast startup for short lived processes that re-access the same OS cached data.

geofft · on Jan 23, 2021

Yes, but the claim is that the buffer you should remove is the OS's one, not the DBMS's one, because for the DBMS use case (one very large file with deep internal structure, generally accessed by one long-running process), the DBMS has information the OS doesn't.

Arrow is a different use case, for which mmap makes sense. For something like a short-lived process that stores config or caches in SQLite, it probably is actually closer to Arrow than to (e.g.) Postgres, so mmap likely also makes sense for that. (Conversely, if you're not relying on Arrow's sharing properties and you have a big Python notebook that's doing some math on an extremely large data file on disk in a single process, you might actually get better results from O_DIRECT than mmap.)

In particular, "zero memory copy" only applies if you are accessing the same data from multiple processes (either at once or sequentially). If you have a single long-running database server, you have to copy the data from disk to RAM anyway. O_DIRECT means there's one copy, from disk to a userspace buffer; mmap means there's one copy, from disk to a kernel buffer. If you can arrange for a long-lived userspace buffer, there's no performance advantage to using the kernel buffer.

sradman · on Jan 23, 2021

> but the claim is that the buffer you should remove is the OS's one

I was not trying to minimize O_DIRECT, I was trying to emphasize the key advantage succinctly and also explain the Apache Arrow use case of mmap which the article does not discuss.

jnwatson · on Jan 23, 2021

In theory that's true. In practice, utilizing the highly-optimized already-in-kernel-mode page cache can produce tremendous performance. LMDB, for example, is screaming fast, and doesn't use DIO.

jandrewrogers · on Jan 23, 2021

What is your metric for "screaming fast"? A user-mode cache with direct I/O can outperform any kernel-mode design several-fold. That's empirical fact, and the reason all high-performance database engines do it. I've designed systems both ways and it isn't particularly close; the technical reasons why are well-understood. Typical direct I/O designs enable macro-optimizations that are either not practical or not possible with mmap().

The main advantage of mmap() is portability.

quotemstr · on Jan 23, 2021

Yep. Every mature, high performing, non-embedded database evolves towards getting the underlying operating system out of the way as much as possible.

api · on Jan 23, 2021

You can also mount a file system in synchronous mode on most OSes, which may make sense for a DB storage volume (but not other parts of the system).

nullsense · on Jan 23, 2021

I think of the major database vendors only postgres uses mmap and everyone else does their own I/O caching management.

the8472 · on Jan 23, 2021

There was a patch set (introducing the RWF_UNCACHED flag) to get buffered IO with most of the benefits of O_DIRECT and without its drawbacks, but it looks like it hasn't landed.

There also are new options to give the kernel better page cache hints via the new MADV_COLD or MADV_PAGEOUT flags. These ones did land.

shoo · on Jan 23, 2021

See also: sublime HQ blog about complexities of shipping a desktop application using mmap [1] and corresponding 200+ comment HN thread [2]:

> When we implemented the git portion of Sublime Merge, we chose to use mmap for reading git object files. This turned out to be considerably more difficult than we had first thought. Using mmap in desktop applications has some serious caveats [...]

> you can rewrite your code to not use memory mapping. Instead of passing around a long lived pointer into a memory mapped file all around the codebase, you can use functions such as pread to copy only the portions of the file that you require into memory. This is less elegant initially than using mmap, but it avoids all the problems you're otherwise going to have.

> Through some quick benchmarks for the way Sublime Merge reads git object files, pread was around ⅔ as fast as mmap on linux. In hindsight it's difficult to justify using mmap over pread, but now the beast has been tamed and there's little reason to change any more.

[1] https://www.sublimetext.com/blog/articles/use-mmap-with-care [2] https://news.ycombinator.com/item?id=19805675

PaulHoule · on Jan 23, 2021

I like mmap and I don't.

It is incompatible with non-blocking I/O since your process will be stopped if it tries to access part of the file that is not mapped -- this isnt a syscall blocking (which you might work around) but rather any attempt to access mapped memory.

I like mmap for tasks like seeking into ZIP files, where you can look at the back 1% of the file, then locate and extract one of the subfiles; the trouble there is that the really fun case is to do this over the network with http (say to solve Python dependencies, to extract the metadata from wheel files) in which case this method doesnt work.

Sesse__ · on Jan 23, 2021

mmap is great for rapid prototyping. For anything I/O-heavy, it's a mess. You have zero control over how large your I/Os are (you're very much at the mercy of heuristics that are optimized for loading executables), readahead is spotty at best (practical madvise implementation is a mess), async I/O doesn't exist, you can't interleave compression in the page cache, there's no way of handling errors (I/O error = SIGBUS/SIGSEGV), and write ordering is largely inaccessible. Also, you get issues such as page table overhead for very large files, and address space limitations for 32-bit systems.

In short, it's a solution that looks so enticing at first, but rapidly costs much more than it's worth. As systems grow more complex, they almost inevitably have to throw out mmap.

amelius · on Jan 23, 2021

> It is incompatible with non-blocking I/O since your process will be stopped if it tries to access part of the file that is not mapped

Yeah, but the same problem occurs in normal memory when the OS has swapped out the page.

So perhaps non-blocking I/O (and cooperative multitasking) is the problem here.

loeg · on Jan 23, 2021

> Yeah, but the same problem occurs in normal memory when the OS has swapped out the page.

I'd argue that swapping is an orthogonal problem which can be solved in a number of ways: disable swap at the OS level, mlock() in the application, maybe others.

mmap is really a bad API for IO — it hides synchronous IO and doesn't produce useful error statuses at access.

> So perhaps non-blocking I/O (and cooperative multitasking) is the problem here.

I'm not sure how non-blocking IO is "the problem." It's something Windows has had forever, and unix-y platforms have wanted for quite a long time. (Long history of poll, epoll, kqueue, aio, and now io_uring.)

amelius · on Jan 23, 2021

> it hides synchronous IO and doesn't produce useful error statuses at access.

You can trap IO errors if necessary. E.g. you can raise signals just like segfaults generate signals.

> I'm not sure how non-blocking IO is "the problem."

The point is that non-blocking IO wants to abstract away the hardware, but the abstraction is leaky. Most programs which use non-blocking IO actualy want to implement multitasking without relying threads. But that turns out to be the wrong approach.

loeg · on Jan 23, 2021

> The point is that non-blocking IO wants to abstract away the hardware, but the abstraction is leaky.

Why do you say it doesn't match hardware? Basically all hardware is asynchronous — submit a request, get a completion interrupt, completion context has some success or failure status. Non-blocking IO is fundamentally a good fit for hardware. It's blocking IO that is a poor abstraction for hardware.

> Most programs which use non-blocking IO actualy want to implement multitasking without relying threads. But that turns out to be the wrong approach.

Why is that the wrong approach? Approximately every high-performance httpd for the last decade or two has used a multitasking, non-blocking network IO model rather than thread-per-request. The overhead of threads is just very high. They would like to use the same model for non-network IO, but Unix and unix-alikes have historically not exposed non-blocking disk IO to applications. io_uring is a step towards a unified non-blocking IO interface for applications, and also very similar to how the operating system interacts with most high-performance devices (i.e., a bunch of queues).

amelius · on Jan 23, 2021

> Why do you say it doesn't match hardware?

Because the CPU itself can block. In this case on memory access. Most (all?) async software assumes the CPU can't block. A modern CPU has a pipelining mechanism, where parts can simply block, waiting for e.g. memory to return. If you want to handle this all nicely, you have to respect the api of this process which happens to go through the OS. So for example, while waiting for your memory page to be loaded, the OS can run another thread (which it can't in the async case because there isn't any other thread).

loeg · on Jan 24, 2021

A CPU stall on L3 miss (100ns?) is orders of magnitude shorter than the kinds of blocking IO we don't want to wait on (10s-100s of µs even for empty-queue NVMe; slower for everything else).

The OS can't run another thread while fulfilling an mmap page fault because it has to actually do the IO to fill the page while taking that trap. And in the async scenario, CPUs and high speed devices can do clever things like snoop DMAs directly into L3 cache, avoiding your L3 miss scenario as well.

The comparison between L3 miss and mmap faults is apples and oranges.

codetrotter · on Jan 23, 2021

> the trouble there is that the really fun case is to do this over the network with http (say to solve Python dependencies, to extract the metadata from wheel files) in which case this method doesnt work

If the web server can tell you the total size of the file by responding to a HEAD request, and it support range requests then it will be possible.

https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requ...

Or am I missing something?

remram · on Jan 23, 2021

You can't do this with mmap though, you can't instruct the OS to grab pages via HTTP range requests.

Sesse__ · on Jan 24, 2021

With userfaultfd(), you can. Not necessarily a good idea, though...

kccqzy · on Jan 23, 2021

Write a fuse layer.

gpderetta · on Jan 24, 2021

Or a signal handler (but yes, it is overkill).

johndough · on Jan 23, 2021

You are correct, this works. There even is a file system built around this idea: https://github.com/fangfufu/httpdirfs

quotemstr · on Jan 23, 2021

You use mmap whether you want to or not: the system executes your program by mmaping your executable and jumping into it! You can always take a hard fault at any time because the kernel is allowed to evict your code pages on demand even if you studiously avoid mmap for your data files. And it can do this eviction even if you have swap turned off.

If you want to guarantee that your program doesn't block, you need to use mlockall.

loeg · on Jan 23, 2021

You're not wrong. Applications and libraries that want to be non-blocking should mlock their pages and avoid mmap for further data access. ntpd does this, for example.

After application startup, you can avoid additional mmap.

geofft · on Jan 23, 2021

This is technically true, but the use case we're talking about is programs that are much smaller than their data. Postgres, for instance, is under 50 MB, but is often used to handles databases in the gigabytes or terabytes range. You can mlockall() the binary if you want, but you probably can't actually fit the entire database into RAM even if you wanted to.

Also, when processing a large data file (say you're walking a B-tree or even just doing a search on an unindexed field), the code you're running tends to be a small loop, within the same few pages, so it might not even leave the CPU's cache, let alone get swapped out of RAM, but you need to access a very large amount of data, so it's much more likely the data you want could be swapped out. If you know some things about the data structure (e.g., there's an index or lookup table somewhere you care about, but you're traversing each node once), you can use that to optimize which things are flushed from your cache and which aren't.

quotemstr · on Jan 23, 2021

Indeed. It's a question of scale: I write programs that can't afford to get blocked behind IO, ever, and that level, I need to pay attention to things like code paging, and even more esoteric things like synchronous reclaim.

If you're just optimizing stuff generally instead of trying to guarantee invariants, sure, ignore code paging and use direct IO for your own data.

jorangreef · on Jan 23, 2021

But that's a different order of magnitude problem: control plane vs data plane.

At some point, we could also say that the line fill buffer blocks our programs (more often than we realize).

All of this is accurate, but at different scales.

PaulHoule · on Jan 23, 2021

Also many systems in 2021 have a lot of RAM and hardly ever swap.

rapsey · on Jan 23, 2021

Process will be stopped or thread?

ithkuil · on Jan 23, 2021

Thread

waynesonfire · on Jan 23, 2021

Thanks for diving into this DB! I find it interesting that many databases share such similar architectural principles. NIH. It's super fun to build a database so why not.

Also, don't beat yourself over how deep you'll be diving into the design. Why apologize for this? Those that want a deep expository would quickly move on.

amelius · on Jan 23, 2021

This is one area where Rust, a modern systems language, has disappointed me. You can't allocate data structures inside mmap'ed areas, and expect them to work when you load them again (i.e., the mmap'ed area's base address might have changed). I hope that future languages take this usecase into account.

simias · on Jan 23, 2021

I'm not sure I see the issue. This approach (putting raw binary data into files) is filled with footguns. What if you add, remove or reorder fields? What if your file was externally modified and now doesn't match the expected layout? What if the data contains things like file descriptors or pointers that can't meaningfully be mapped that way? Even changing the compilation flags can produce binary incompatibilities.

I'm not saying that it's not sometimes very useful but it's tricky and low level enough that some unsafe low level plumbing is, I think, warranted. You have to know what you're doing if you decide to go down that route, otherwise you're much better off using something like Serde to explicitly handle serialization. There's some overhead of course, but 99% of the time it's the right thing to do.

geofft · on Jan 23, 2021

I had a use case recently for serializing C data structures in Rust (i.e., being compatible with an existing protocol defined as "compile this C header, and send the structs down a UNIX socket"), and I was a little surprised that the straightforward way to do it is to unsafely cast a #[repr(C)] structure to a byte-slice, and there isn't a Serde serializer for C layouts. (Which would even let you serialize C layouts for a different platform!)

I think you could also do something Serde-ish that handles the original use case where you can derive something on a structure as long as it contains only plain data types (no pointers) or nested such structures. Then it would be safe to "serialize" and "deserialize" the structure by just translating it into memory (via either mmap or direct reads/writes), without going through a copy step.

The other complication here is multiple readers - you might want your accessor functions to be atomic operations, and you might want to figure out some way for multiple processes accessing the same file to coordinate ordering updates.

I kind of wonder what Rust's capnproto and Arrow bindings do, now....

burntsushi · on Jan 23, 2021

It's likely that the "safe transmute" working group[1] will help facilitate this sort of thing. They have an RFC[2]. See also the bytemuck[3] and zerocopy[4] crates which predate the RFC, where at least the latter has 'derive' functionality.

[1] - https://github.com/rust-lang/project-safe-transmute

[2] - https://github.com/jswrenn/project-safe-transmute/blob/rfc/r...

[3] - https://docs.rs/bytemuck/1.5.0/bytemuck/

[4] - https://docs.rs/zerocopy/0.3.0/zerocopy/index.html

amelius · on Jan 23, 2021

The footguns can be solved in part by the type-system (preventing certain types from being stored), and (if necessary) by cooperation with the OS (e.g. to guarantee that a file is not modified between runs).

How else would you lazy-load a database of (say) 32GB into memory, almost instantly?

And why require everybody to write serialization code when just allocating the data inside a mmap'ed file is so much easier? We should be focusing on new problems rather than reinventing the wheel all the time. Persistence has been an issue in computing since the start, and it's about time we put it behind us.

simias · on Jan 23, 2021

>How else would you lazy-load a database of (say) 32GB into memory, almost instantly?

By using an existing database engine that will do it for me. If you need to deal with that amount of data and performance is really important you have a lot more to worry about than having to use unsafe blocks to map your data structures.

Maybe we just have different experiences and work on different types of projects but I feel like being able to seamlessly dump and restore binary data transparently is both very difficult to implement reliably and quite niche.

Note in particular that machine representation is not necessarily the most optimal way to store data. For instance any kind of Vec or String in rust will use 3 usize to store length, capacity and the data pointer which on 64 bit architectures is 24 bytes. If you store many small strings and vectors it adds up to a huge amount of waste. Enum variants are also 64 bits on 64 bit architectures if I recall correctly.

For instance I use bincode with serde to serialize data between instances of my application, bincode maps almost 1:1 the objects with their binary representation. I noticed that by implementing a trivial RLE encoding scheme on top of bincode for running zeroes I can divide the average message size by a factor 2 to 3. And bincode only encodes length, not capacity.

My point being that I'm not sure that 32GB of memory-mapped data would necessarily load faster than <16GB of lightly serialized data. Of course in some cases it might, but that's sort of my point, you really need to know what you're doing if you decide to do this.

burntsushi · on Jan 23, 2021

> How else would you lazy-load a database of (say) 32GB into memory, almost instantly?

That's what the fst crate[1] does. It's likely working at a lower level of abstraction than you intend. But the point is that it works, is portable and doesn't require any cooperation from the OS other than the ability to memory map files. My imdb-rename tool[2] uses this technique to build an on-disk database for instantaneous searching. And then there is the regex-automata crate[3] that permits deserializing a regex instantaneously from any kind of slice of bytes.[4]

I think you should maybe provide some examples of what you're suggesting to make it more concrete.

[1] - https://crates.io/crates/fst

[2] - https://github.com/BurntSushi/imdb-rename

[3] - https://crates.io/crates/regex-automata

[4] - https://docs.rs/regex-automata/0.1.9/regex_automata/#example...

quotemstr · on Jan 23, 2021

You can't do that in C++ or any language. You need to do your own relocations and remember enough information to do them. You can't count on any particular virtual address being available on a modern system, not if you want to take advantage of ASLR.

The trouble is that we have to mark relocated pages dirty because the kernel isn't smart enough to understand that it can demand fault and relocate on its own. Well, either that, or do the relocation anew on each access.

secondcoming · on Jan 23, 2021

It works with C++ if you use boost::interprocess. Its data structures use offset_ptr internally rather than assuming every pointer is on the heap.

amelius · on Jan 23, 2021

That introduces different data-types, rather than using the existing ones (instantiated with different pointer-types).

secondcoming · on Jan 23, 2021

Indeed. I don't know if there's a plan for the standard type to move to offset-ptr, or if there's even a std::offset_ptr, but it would be great if there was.

For us, some of the 'different data type' pain was alleviated with transparent comparators. YMMV.

Edit: It seems C++11 has added some form of support for it... 'fancy pointers'

https://en.cppreference.com/w/cpp/named_req/Allocator#Fancy_...

quotemstr · on Jan 23, 2021

Sure. But that counts as "doing your own relocations". Unsafe Rust could do the same, yes?

secondcoming · on Jan 23, 2021

I don't know enough about Rust to say. If it doesn't have the concept of a 'fancy pointer' then I assume no, you'd have to essentially reproduce what boost::interprocess does.

ekimekim · on Jan 24, 2021

I'm still learning Rust, but iiuc you could do this by creating an OffsetPtr type that implements the Deref trait (https://doc.rust-lang.org/std/ops/trait.Deref.html). This is exactly a "fancy pointer" as you describe.

whimsicalism · on Jan 23, 2021

What is being relocated?

ithkuil · on Jan 23, 2021

If you use offsets instead of pointers you're doing relocations "on the fly"

whimsicalism · on Jan 23, 2021

I don't see what the issue in doing this is in C++.

The only thing that'll break will be the pointers and references to things outside of the mmap'd area.

simias · on Jan 23, 2021

By that logic you can do it in unsafe Rust as well then. Obviously in safe Rust having potentially dangling "pointers and references to things outside of the mmap'd area" is a big no-no.

And note that even intra-area pointers would have to be offset if the base address changes. Unless you go through the trouble of only storing relative offsets to begin with, but the performance overhead might be significant.

Hello71 · on Jan 23, 2021

libsigsegv (slow) or userfaultfd (less slow) can be used for this purpose.

turminal · on Jan 23, 2021

This is impossible without significant performance impact. No language can change that.

Edit: except theoretically for data structures that have certain characteristics known in advance

amelius · on Jan 23, 2021

Well, one approach is to parameterize your data-types such that they are fast in the usual case, but become perhaps slightly slower (but still on par with hand-written code) in the more versatile case.

gpderetta · on Jan 24, 2021

Boost.interproces does exactly that for the STL.

the8472 · on Jan 23, 2021

Work on custom allocators is underway, some of the std data structures already support them on nightly.

https://github.com/rust-lang/wg-allocators/issues/7

comonoid · on Jan 23, 2021

Yes, you can.

You cannot with standard data structures, but you can with your custom ones.

That's all about trade-offs, anyway, there is no magic bullet.

remram · on Jan 23, 2021

What about Rust makes this more difficult than doing the same thing in C++?

jnwatson · on Jan 23, 2021

There's no placement new in Rust? That's disappointing.

steveklabnik · on Jan 23, 2021

Not in stable yet, no. It’s desired, but has taken a while to design, as there have been higher priority things for a while. We’ll get there!

boxfire · on Jan 23, 2021

Very strange to see few to no references to io_uring here. I guess it's still too new. As I've seen many times before so much complexity is replicated in userspace to reproduce kernel behavior out of mmap or DIO/AIO, in order to break the latency, caching, and prioritization into a micromanaged state tuned for a narrow set of applications... Then applied to database code used in a myriad of applications which violate those assumptions and have their own needs. io_uring can't take over fast enough.

jandrewrogers · on Jan 23, 2021

Your assumption is correct, io_uring is too new, it isn't available in most LTS kernels. Give it a few years.

Also, if you already have a competent io_submit/O_DIRECT implementation then there are few material performance benefits to io_uring for databases. It mostly just cleans up the API. This has value from a code design/maintenance standpoint, particularly since io_submit is lacking in the documentation department, but the lack of kernel support in most environments makes it a poor tradeoff at this time.

jlokier · on Jan 24, 2021

Is it not the case that io_submit can block in some filesystems to handle filesystem metadata (such as block allocation), even with O_DIRECT, whereas io_uring never blocks the submitting thread?

jandrewrogers · on Jan 24, 2021

Yes, in theory. In practice, the way io_submit() is actually used in most systems today would not have that issue, and it is designed that way for other practical reasons. You'd want to use io_uring in a similar way. Even if you ignore the blocking aspect, file system metadata modification at runtime is an edge case factory.

For database-y type software generally, it is increasingly uncommon to even install a file system. You work with the raw block devices directly, virtualized or otherwise.

minitoar · on Jan 23, 2021

Interana mmaps the heck out of stuff. I’ve found that relying on the file cache works great. Though our access patterns are admittedly pretty simple.

nopurpose · on Jan 23, 2021

mmap is not as free as people think. VM subsystem is full of inefficient locks. Here is a very good writeup on a problem BBC encountered with Varnish: https://www.bbc.co.uk/blogs/internet/entries/17d22fb8-cea2-4...

lrossi · on Jan 24, 2021

> huge pressure on the virtual memory (VM) subsystem due to extensive dirty page writeback and page steals. The VM subsystem is constantly modifying page table entries (PTEs). This PTE churn results in frequent translation lookaside buffer (TLB) flushes and many inter-processor interrupts (IPIs) to do so. These TLB flushes have a very negative performance hit.

Interesting. I was aware of various mmap limitations, but I didn’t think about the TLB changes/flushes, which obviously come with an important overhead.

rcgorton · on Jan 23, 2021

I found some of the 'sizing' snippets in the example came across as disingenuous: if you KNOW the size of the file, mmap it initially using that without the looping overhead. And you presumably know how much memory you have on a given system. The description (at least as how I read the article) implies bolt is a truly naive implementation of a key/value DB

ramoz · on Jan 23, 2021

Perhaps a part 2 would dive a bit deeper into os caching and hardware (SSDs, their interfaces etc)

jeffbee · on Jan 23, 2021

Apparently in a way that the author of the article, and probably the authors of bolt, do not really understand.

29athrowaway · on Jan 23, 2021

malloc is implemented using mmap.

You map memory manually when you need very low level control over memory.

jeffbee · on Jan 23, 2021

`malloc` is not one thing. Some mallocs use mmap and others use brk. Some implementations use both.

kevin_thibedeau · on Jan 23, 2021

Some use neither.