Memory-Mapped Files = access violations when a disk read fails. If you're not pr...

kentonv · on July 2, 2023

> Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.

crabbone · on July 2, 2023

> This is not specific to mmap -- regular old write() calls have the same behavior.

This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.

In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.

kentonv · on July 2, 2023

> This is not true. This depends on how the file was opened. You may request DIRECT | SYNC

Well sure, but 99.9% of people don't do that (and shouldn't, unless they really know what they are doing).

> In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs.

What network-attached storage actually uses O_SYNC behavior without being asked? I'd be quite surprised if any did this as it would make typical workloads incredibly slow in order to provide a guarantee they didn't ask for.

pclmulqdq · on July 3, 2023

100% of people writing a database know about filesystem options like DIRECT and SYNC, and that is the subject of this paper.

Also, most of the network-attached storage we people use is in the form of things like EBS, which is very careful to imitate the behavior of a real disk, but with different performance and some different (albeit very rare) failure modes.

kentonv · on July 3, 2023

100% of people writing databases also know how fsync() and msync() work. I interpreted this thread as being targeted at a wider audience.

crabbone · on July 5, 2023

It's literally for people writing their own database. Why would you interpret it differently?

tsimionescu · on July 3, 2023

It's fun to remember that fsync() on Linux on ext4 at least offers no real guarantee that the data was successfully written to disk. This happens when write errors from background buffered writes are handled internally by the kernel, and they cleanup the error situation (mark dirty pages clean etc). Since the kernel can't know if a later call to fsync() will ever happen, it can't just keep the error around. So, when the call does happen, it will not return any error code. I don't know for sure, but msync() may well have the same behavior.

Here is an LWN article discussing the whole problem as the Postgres team found out about it.

https://lwn.net/Articles/752063/

afr0ck · on July 2, 2023

Linux throws a SIGBUS. A process should anticipate such I/O failures by implementing a SIGBUS handler, especially a database server.

For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.

crabbone · on July 2, 2023

> msync() system call that can be used to flush the page cache on demand.

for everyone, not just the file you mapped to memory. I.e. the guarantee is that your file will be written, but there's no way to do that w/o affecting others. This is not such a hot idea in an environment where multiple threads / processes are doing I/O.

afr0ck · on July 3, 2023

msync() affects only the pages that part of the mmap area you ask for in the arguments. From the man pages:

> int msync(void addr[.length], size_t length, int flags);

> msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem

crabbone · on July 5, 2023

No it doesn't. That's physically impossible. Read what you quoted -- it never says that it's going to do it only for the file in question.

If you don't know why it's not possible, here's a simplified version of it: hardware protocols (s.a. SCSI) must have fixed size messages to fit them through the pipeline. I.e. you cannot have a message larger than the memory segment used for communication with the device, because that will cause fragmentation and will lead to a possibility of message being corrupted (the "tail" being lost or arriving out of order).

On the other hand, to "flush" a file to persistent storage you'd have to specify all blocks associated with the file that need to be written. If you try to do this, it will create a message of arbitrary size, possibly larger than the memory you can store it in. So, the only way to "flush" all blocks associated with a file is to "flush" everything on a particular disk / disks used by the filesystem. And this is what happens in reality when you do any of the sync family commands. The difference is only in what portion of the not-yet synced data the OS will send to the disk before requesting a sync, but the sync itself is for the entire disk, there aren't any other syncs.

afr0ck · on July 9, 2023

I don't know what you're talking about, but msync() flushes only the pages in that range. The pages are in the page cache (on Linux, it's a per-file xarray [1] of pages). Once all the dirty pages in the range are located, they go through the filesystem to be mapped to block numbers and then submitted to the block layer to be written to the storage device. Only the disk blocks mapped to the pages in that range will be written.

Source: I'm a Linux kernel developer.

[1] https://docs.kernel.org/core-api/xarray.html

wmf · on July 2, 2023

I wonder how many apps don't handle errors from read() anyway.

sidewndr46 · on July 2, 2023

does that get delivered as SIGSEGV to the process or something else?

afr0ck · on July 2, 2023

On Linux, it's a SIGBUS.