Hacker News new | past | comments | ask | show | jobs | submit login
Are You Sure You Want to Use MMAP in Your Database Management System? (2022) (cmu.edu)
192 points by nethunters on July 2, 2023 | hide | past | favorite | 177 comments



This is a pretty old argument and IMO it's far out of date/obsolete.

Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs. In the modern application/server environment, no user level process has accurate information about the total state of the machine, only the kernel (or hypervisor) does and it's an exercise in futility to try to manage paging etc at the user level.

As Dr. Michael Stonebraker put it: The Traditional RDBMS Wisdom is (Almost Certainly) All Wrong. https://slideshot.epfl.ch/play/suri_stonebraker (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.

Granted, even using mmap you still need to know wtf you're doing. MongoDB's original mmap backing store was a poster child for Doing It Wrong, getting all of the reliability problems and none of the performance benefits. LMDB is an example of doing it right: perfect crash-proof reliability, and perfect linear read scalability across arbitrarily many CPUs with zero-copy reads and no wasted effort, and a hot code path that fits into a CPU's 32KB L1 instruction cache.


Out of curiosity, how many databases have you written?

This is co-authored by Pavlo, Viktor Leiss, with feedback from Neumann. I'm sorry, but if someone on the internet claims to know better than those 3, you're going to need some monumental evidence of your credibility.

Additionally, what you link here:

  > ... (See the slide at 21:25 into the video). Modern DBMSs spend 96% of their time managing buffers and locks, and only 4% doing actual useful work for the caller.
Is discussing "Main Memory" databases. These databases do no I/O outside of potential initial reads, because all of the data fits in-memory!

These databases represent a small portion of contemporary DBMS usage when compared to traditional RDBMS.

All you have to do is look at the bandwidth and reads/sec from the paper when using O_DIRECT "pread()"s versus mmap'ed IO.


This is a classic appeal to authority. Let's play the argument, not the man.

(My understanding is that the GP wrote LMDB, works on openLDAP, and was a maintainer for BerkelyDB for a number of years. But even if he'd only written 'hello, world!' I'm much more interested in the specific arguments).


Correct, and thank you. I wrote LMDB, wrote a lot of OpenLDAP, and worked on BerkeleyDB for many years. And actually Andy Pavlo invited me to CMU to give a lecture on LMDB a few years back. https://www.youtube.com/watch?v=tEa5sAh-kVk

Andy and I have had this debate going for a long time already.


Well, I eat my shorts.

Isn't LMBD closer to an embedded key-value store than an RDBMS, though? Also there's a section in the paper that mentions it's single-writer.


Yes, LMDB is an embedded key/value store but it can be used as the backing store of any other DB model you care for. E.g. as a backend to MySQL, or SQLite, or OpenLDAP, or whatever.


I think the real argument is more nuanced. Where you see mmap() fail badly on Linux, even for read-only workloads, is under a few specific conditions: very large storage volumes, highly concurrent access, non-trivial access patterns (e.g. high-dimensionality access methods). Most people do not operate data models under these conditions, but if you do then you can achieve large integer factor gains in throughput by not using mmap().

Interestingly, most of the reason for these problems has to do with theoretical limitations of cache replacement algorithms as drivers of I/O scheduling. There are alternative approaches to scheduling I/O that work much better in these cases but mmap() can’t express them, so in those cases bypassing mmap() offers large gains.


GP wrote a key-value store called LMDB that is constrained to a single writer, and often used for small databases that fit entirely in memory but need to persist to disk. There's a whole different world for more scalable databases.


"fit entirely in memory" is not a requirement. LMDB is not a main-memory database, it is an on-disk database that uses memory mapping.


Can you explain "high-dimensionality access methods" to me? (Or if it's too big for an HN comment, maybe recommend a paper).


This guy talks a lot of crap. See his website for examples, and don't waste your time with him

<<<There is one significant drawback that should not be understated. Algorithm design using topology manipulation can be enormously challenging to reason about. You are often taking a conceptually simple algorithm, like a nested loop or hash join, and replacing it with a much more efficient algorithm involving the non-trivial manipulation of complex high-dimensionality constraint spaces that effect the same result. Routinely reasoning about complex object relationships in greater than three dimensions, and constructing correct parallel algorithms that exploit them, becomes easier but never easy.>>>

http://www.jandrewrogers.com/2015/10/08/spacecurve/


I'd imagine same kind of worst case access would also be a problem doing IO the "classical" way


The argument is that:

- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls

- mmap() complicates transactionality and error-handling

- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks


1 - for reading any uncached data, the I/O stalls are unavoidable. Whatever client requested that data is going to have to wait regardless.

2 - complexity? this is simply false. LMDB's ACID txns using MVCC are much simpler than any "traditional" approach.

3 - contention is a red herring since this approach is already single-writer, as is common for most embedded k/v stores these days. You lose more perf by trying to make the write path multi-threaded, in lock contention and cache thrashing.


> for reading any uncached data, the I/O stalls are unavoidable.

Excuse me for a silly question, but whilst an I/O stall may be unavoidable, wouldn't a thread stall be avoidable if you're not using mmap?

Assuming that you're not swapping, you'll generally know if you've loaded something into memory or not, whilst mmap doesn't help you know if the relevant page is cached. If the data isn't in memory, you can send the I/O request to a thread to retrieve it, and the initiating thread can then move onto the next connection. I suspect this isn't doable under mmap based access?


It's kind of disingenuous to talk about how great your concurrency system is when you only allow a single writer. RCU (which I imagine your system is isomorphic to) is pretty simple compared to what many DB engines use to do ACID transactions that involve both reads and writes.


You don't need more than single-writer concurrency if your write txns are fast enough.

Our experience with OpenLDAP was that multi-writer concurrency cost too much overhead. Even though you may be writing primary records to independent regions of the DB, if you're indexing any of that data (which all real DBs do, for query perf) you wind up getting a lot of contention in the indices. That leads to row locking conflicts, txn rollbacks, and retries. With a single writer txn model, you never get conflicts, never need rollbacks.


> You don't need more than single-writer concurrency if your write txns are fast enough.

This only works on systems with sufficiently slow storage. If your server has a bunch of NVMe, which is a pretty normal database config these days, you will be hard-pressed to get anywhere close to the theoretical throughput of the storage with a single writer. That requires 10+ GB/s sustained. It is a piece of cake with multiple writers and a good architecture.

Writes through indexing can be sustained at this rate (assuming appropriate data structures), most of the technical challenge is driving the network at the necessary rate in my experience.


That's all just false. Just because you're single-writer at the application level doesn't mean the OS isn't queueing enough writes to saturate storage at the device level. We've benchmarked plenty of high speed NVMe devices, like Intel Optane SSDs, etc. showing this. http://www.lmdb.tech/bench/optanessd/


this guy is a fool. Ignore him. Or see his website.


That's probably because your OpenLDAP benchmarks used a tiny database. If you have multi-terabyte databases, you will start to see huge gains from a multi-writer setup because you will be regularly be loading pages from disk, rather than keeping almost all of your btree/LSM tree in RAM.


Yeah, no. Not with a DB 50x larger than RAM, anyway.

http://www.lmdb.tech/bench/hyperdex/

RAM is relatively cheap too, there's no real reason to be running multi-TB databases at greater than a 50x ratio.


Sorry, but "50x larger than RAM" is a pretty small DB - that's an 800 GB database on a machine with 16 GB of RAM. I usually have seen machines with 500-1000x ratios of flash to RAM. "RAM is relatively cheap" is also false when you're storing truly huge amounts of data, which is how the systems you compare yourself to (LevelDB, etc) are usually deployed. Note that RAM is now the single greatest cost when buying servers.

> Now that the total database is 50 times larger than RAM, around half of the key lookups will require a disk I/O.

That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector. I am also a little surprised at the lack of a random writes benchmark. I get that this is marketing material, though.


> I am also a little surprised at the lack of a random writes benchmark.

Eh? This was 20% random writes, 80% random reads. LMDB is for read-heavy workloads.

> That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector.

No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.


> Eh? This was 20% random writes, 80% random reads. LMDB is for read-heavy workloads.

The page says "updates," not "writes." Updates are a constrained form of write where you are writing to an existing key. Updates, importantly, do not affect your index structure, while writes do.

> No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.

It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."


> The page says "updates," not "writes." Updates are a constrained form of write where you are writing to an existing key. Updates, importantly, do not affect your index structure, while writes do.

OK, I see your point. It would only have made things even worse for LevelDB here to do an Add/Delete workload because its garbage compaction passes would have had to do a lot more work.

> It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."

This was 100 million records, and a 5-level deep tree. To get to 6 levels deep it would be about 10 billion records. Most of the branch pages would still fit in RAM; most queries would require at most 1 more I/O than the 5-level case. The cost is still better than any other approach.


take a look at http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf - the overhead of coordinating multiple writers often makes multi-writer databases slower than single-writer databases. remember, everything has to be serialized when it goes to the write ahead log, so as long as you can do the database updates as fast as you can write to the log then concurrent writers are of no benefit.


This is another cool example of a toy database that is again very small:

> The database size for one warehouse is approximately 100 MB (we experiment with five warehouses for a total size of 500MB).

It is not surprising that when your database basically fits in RAM, serializing on one writer is worth doing, because it just plainly reduces contention. You basically gain nothing in a DB engine from multi-writer transactions when this is the case. A large part of a write (the vast majority of write latency) in many systems with a large database comes from reading the index up to the point where you plan to write. If that tree is in RAM, there is no work here, and you instead incur overhead on consistency of that tree by having multiple writers.

I'm not suggesting that these results are useless. They are useful for people whose databases are small because they are meaningfully better than RocksDB/LevelDB which implicitly assume that your database is a *lot* bigger than RAM.


> RocksDB/LevelDB which implicitly assume that your database is a lot bigger than RAM.

Where are you getting that assumption from? LevelDB was built to be used in Google Chrome, not for multi-TB DBs. RocksDB was optimized specifically for in-memory workloads.


I worked with the Bigtable folks at Google. LevelDB's design is ripped straight from BigTable, which was designed with that assumption in mind. I'm also pretty sure it was not designed specifically for Google Chrome's use case - it was written to be a general key-value storage engine based on BigTable, and Google Chrome was the first customer.

RocksDB is Facebook's offshoot of LevelDB, basically keeping the core architecture of the storage engine (but multithreading it), and is used internally at Facebook as the backing store for many of their database systems. I have never heard from anyone that RocksDB was optimized for in-memory workloads at all, and I think most benchmarks can conclusively say the opposite: both of those DB engines are pretty bad for workloads that fit in memory.


I think we've gone off on a tangent. At any rate, both LevelDB and RocksDB are still single-writer so whatever point seems to have been lost along the way.


I've used RocksDB for an in-memory K/V store of ~600GB in size and it worked really well. Not saying it's the best choice out there but it did the job very well for us. And in particular because our dataset was always growing and we needed the option to fallback to disk if needed, RocksDB worked very well.

Was a PITA to optimise though; tons of options and little insight into which ones work.


I am using the same rough model, and I'm using that on a 1.5 TB db running on Raspberry PI very successfully.

Pretty much all storage libraries written in the past couple of decades are using single writer. Note that single writer doesn't mean single transaction. Merging transactions is easy and highly profitable, after all.


Yeah for workloads with any long running write transactions a single writer design is a pretty big limitation. Say some long running data load (or a big bulk deletion) running along with some faster high throughput key value writes - the big data load would block all the faster key-value writes when it runs.

No "mainstream" database I'm aware of has a global single writer design.


"Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers" is already an appeal to authority in itself.

We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.


Out of curiosity, do you have anything actually useful to add or are just throwing appeals to authority because you don't ?


Even thought the data resides mostly in-memory they still have to write transactions to disk to preserve them, don't they?


> your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).

As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.

This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.


> There's nothing special about kernel programmers.

Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.

> no DBA worth their salt would put database in the environment where it has to share resources with applications.

Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.

> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.


> "people who've studied computer architecture" - which most application developers never have

If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.

From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.

Not to be gatekeeping or anything, but it is a pretty well studied field with lots of very knowledgeable people around, who are probably more than keep to help. There aren't too many qualified jobs around and you probably have a budget if you are developing a database commercially.


mmap doesn't allow their users to precisely control the persistence aspect

It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().


msync lets you force a flush so you can control the latest possible moment for a writeout. But the OS can flush before that, and you have no way to detect or control that. So you can only control the late side of the timing, not the early side. And in databases, you usually need writes to be persisted in a specific order; early writes are just as harmful as late writes.


I'd even take a memory ordering guarantee, something like, within each page, data is read out sequentially as atomic aligned 64-bit reads with acquire ordering. (Though this probably is what you get on AMD64.) As-is, there's not even a guarantee against an atomic aligned write being torn when written out.


That is absolutely not what you actually get from the hardware.

For fun, there is no guarantee in terms of writing a page in what order it is written. SQLite documents that they assume (but cannot verify) that _sector_ writes are linear, but not atomic. https://www.sqlite.org/atomiccommit.html

> If a power failure occurs in the middle of a sector write it might be that part of the sector was modified and another part was left unchanged. The key assumption by SQLite is that if any part of the sector gets changed, then either the first or the last bytes will be changed. So the hardware will never start writing a sector in the middle and work towards the ends. We do not know if this assumption is always true but it seems reasonable.

You are talking several levels higher than that, at the page level (composed of multiple sectors).

Assume that they reside in _different_ physical locations, and are written at different times. That's fun.


Every HDD since the 1980s has guaranteed atomic sector writes:

> Currently all hard drive/SSD manufacturers guarantee that 512 byte sector writes are atomic. As such, failure to write the 106 byte header is not something we account for in current LMDB releases. Also, failures of this type should result in ECC errors in the disk sector - it should be impossible to successfully read a sector that was written incorrectly in the ways you describe.

Even in extreme cases, the probability of failure to write the leading 128 out of 512 bytes of a sector is nearly nil - even on very old hard drives, before 512-byte sector write guarantees. We would have to go back nearly 30 years to find such a device, e.g.

https://archive.org/details/bitsavers_quantumQuaroductManual...

Page 23, Section 2.1 "No damage or loss of data will occur if power is applied or removed during drive operation, except that data may be lost in the sector being written at the time of power loss."

  From the specs on page 15, the data transfer rate to/from the platters is
 1.25MB/sec, so the time to write one full sector is 0.4096ms; the time to
 write the leading 128 bytes of the sector is thus 1/4 of that: 0.10ms. You
 would have to be very very unlucky to have a power failure hit the drive
 within this .1ms window of time. Fast-forward to present day and it's simply
 not an issue.
^ above quoted from https://lists.openldap.org/hyperkitty/list/openldap-devel@op...


Doesn't help when you work with pages :-)

Assume 512 sectors ( I know those are rare ), but I don't think that there is any guarantees that 4KB page would be:

* Written atomically * Written in a particular order


Even memory ordering guarantees within sector boundaries are sufficient, and something the kernel could provide on its own.


Also doesn't help when you are running on virtual / networked hardware. Nothing ensure that what you think is a sector write would actually align properly with the hardware.


The design and guarantees of the virtualized hardware provide that guarantee. I've worked on several such products. They all guarantee atomic sector writes (typically via copy-on-write).


> Most applications today are running on smartphones/mobile devices.

That's patently false. There are about 8 bn. people. Even if everyone has a smartphone or two, it's nothing compared to the total of all devices that can be called "computer". I think that "smart TV" alone will beat the number of smartphones. But even that is a drop in a bucket when it comes to the total of running programs on Earth / its orbit.

But, that's beside the point. Smartphones aren't designed to run database servers. Even if they indeed were the majority, they'd still be irrelevant for this conversation because they are a wrong platform for deploying databases. In other words, it doesn't matter how people deploy databases to smartphones -- they have no hopes of achieving good performance, and whether they use mmap or not is of no consequences -- they've lost the race before they even qualified for it.

> LMDB

Are we talking about this? https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa... If so, this is irrelevant for databases in general.

> LMDB databases may have only one writer at a time

(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.


>irrelevant for databases in general

It's one of the databases compared in the paper

OP is one of the authors of LMDB


Smart TVs are also all running SQLite


Can you comment on what the paper gets wrong? It says that scalability with mmap is poor due to page table contention and others. How does LMDB manage to scale well with mmap? Is page table contention just not an issue in practice?


Maybe someone should pull LMDB's mmap/paging system into a usable library. I'd love to use the k/v store part of course, but I keep hitting the default key size limitation and would prefer not to link statically.


It wouldn't be much use without the B+tree as well; it's the B+tree's cache friendliness that allows applications to run so efficiently without the OS knowing any specifics of the app's usage patterns.


Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

In my experience -- and in line with the article -- mmap works fine with small working sets. It seems that most benchmarks of lmdb have relatively small data sets.


> Do you have benchmarks of lmdb when the working set is much larger than memory? I couldn't find any.

Where did you look? This is a sample using DB 5x and 50x larger than RAM http://www.lmdb.tech/bench/hyperdex/

There are plenty of other larger-than-RAM benchmarks there.


Hm. That seems to be comparing against a 2013 era leveldb, which at the time also used mmap. (It's since switched the default for performance reasons)

It's also strange to me that there's no transition in performance when the data set size grows beyond cache.


> Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers and (b) your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.

The article is about DBMS developers. For DBMS developers, "in practice" (a) and (b) are usually true I think.


Who is deploying databases in containers?


A disturbingly large number of deployments I’ve seen using Kubernetes or docker compose have databases deployed as such.


Given the ability to deploy pods to dedicated nodes based on label selectors, what is the actual performance impact of running a database in a container on a bare metal host with mounted volume versus running that same process with say systemd on that same node? Basically, shouldn’t the overhead of running a container be minimal?


The problem is kubelet likes to spike in memory / CPU / network usage. It's not a well-behaved program to put alongside a database. It's not written with an eye for resource utilization.

Also, it brings nothing of value to the table, but requires a lot of dance around it to keep it going. I.e. if you are a decent DBA, you don't have a problem setting up a node to run your database of choice, you would be probably opposed to using pre-packaged Docker images anyways.

Also, Kubernetes sucks at managing storage... basically, it doesn't offer anything that'd be useful to a DBA. Things that might be useful come as CSI... and, obviously, it's better / easier to not use a CSI, but to interface directly with the storage you want instead.

That's not to say that storage products don't offer these CSI... so, a legitimate question would be why would anyone do that? -- and the answer is -- not because it's useful, but because a lot of people think they need / want it. Instead of fighting stupidity, why not make an extra buck?


I run DB’s on K8s, not because I don’t know what I’m doing, but because most of the trade offs are worth it.

If I run a db workload in K8s, it’s a tiny fraction of the operational overhead, and not a massively noticeable performance loss.

I would absolutely love a way to deploy and manage db’s as easily as K8s with fewer of the quite significant issues that have mentioned, so if you know of something that is better behaved around singular workloads, but keeps the simple deploys, the resiliency, the ease of networking and config deployments, the ease of monitoring, etc, I am all ears.


If you think that deploying anything with Kubernetes is simple... well, I have bad news for you.

It's simple, until you hit a problem. And then it becomes a lot worse than if you had never touched it. You are now in the stage of a person who'd never made backups and never had a failure that required them to restore from backups, and you are wondering why would anyone do it. Adverse events are rare, and you may go like this for years, or, perhaps the rest of your life... unfortunately, your experience will not translate into a general advice.

But, again, you just might be in the camp where performance doesn't matter. Nor does uptime matter, nor does your data have very high value... and in that case it's OK to use tools that don't offer any of that, and save you some time. But, you cannot advise others based on that perspective. Or, at least, not w/o mentioning the downsides.


Everyone running databases in production knows how to take backups and restore from them. K8s or not, even using your cloud provider's database's built-in backups is hardly safe. One click of the "delete instance" button (or nowadays, an exciting fuck up in IaC code), and your backups are gone! Not to mention the usual cloud provider problems of "oops your credit card bounced" or "the algorithm decided we don't like your line of business". You have to have backups, they have to be "off site", and you have to try restoring them every few months. There is pretty much no platform that gives you that for free.

I am not sure what complexity Kubernetes adds in this situation. Anything Kubernetes can do to you, your cloud provider (or a poorly aimed fire extinguisher) can do to you. You have to be ready for a disaster no matter the platform.


If you run in the cloud, any of the major cloud providers can take that undifferentiated heavy lifting off your hands (Amazon RDS etc.).


If you care about perf you would pin the kubelet and all other overhead workload to one core, and mask that off for your workload.


> If you care about perf you would pin the kubelet

Wrong. I wouldn't use kubelet at all. Kubernetes and good performance are not compatible. The goal of Kubernetes is to make it easier to deploy Web sites. Web is a very popular technology, so Kubernetes was adopted in many places where it's irrelevant / harmful because Web developers are plentiful and will help to power through the nonsense of this program. It's there because it makes trivial things even easier for less qualified personnel. It's not meant as a way to make things go faster, or to use less memory, or to use less persistent storage, or less network etc... it's the wheelchair of ops, not a highly-optimized professional-grade equipment.


IMO if you’re concerned about performance and yet are deploying databases this way — mmap should not even be on the radar.


How would containers even hurt performance? How does the database no longer having the ability to see other processes on the machine somehow make it slower?


There are many "holes" in these containers.

1. fsync. You cannot "divide" it between containers. Whoever does it, stalls I/O for everyone else.

2. Context switches. Unless you do a lot of configurations outside of container runtime, you cannot ensure exclusive access to the number of CPU cores you need.

3. Networking has the same problem. You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server. Otherwise just the amount of chatter that goes on through the control plane of something like Kubernetes will be a noticeable disadvantage. Again, containers don't help here, they only get in the way as to get that kind of exclusive network access you need more configuration on the host, and, possible an CNI to deal with it.

4. kubelet is not optimized to get out of your way. It needs a lot of resources and may spike, hindering or outright stalling database process.

5. Kubernetes sucks at managing memory-intensive processes. It doesn't work (well or at all) with swap (which, again, cannot be properly divided between containers). It doesn't integrate well with OOM killer (it cannot replace it, so any configurations you make inside Kubernetes are kind of irrelevant, because system's OOM killer will do how it pleases, ignoring Kubernetes).

---

Bottom line... Kubernetes is lame from infrastructure perspective. It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you. You don't want that kind of program near your database.


My background is more borg then k8s, but…

Alway allocate whole cores, just mask them off

Dedicate physical IO devices for sensitive workloads

You can have per cgroup swap if you want, but imo swap is not useful

I think all of this is possible in k8s


Whole core masking is not quite as easy as it should be, predominantly because the API is designed to hand wave away actual cores. The way you typically solve this is to go the other way and claim exclusive cores for the orchestrator and other overhead.


As these are obviously very real issues, and Kubernetes also isn’t going away imminently, how many of these can be fixed/improved with different design on the application front?

Would using direct-Io API’s fix most of the fsync issues? If workloads pin their stuff to specific cores can we incite some of the overhead here? (Assuming we’re only running a single dedicated workload + kubelet on the node).

> You would either have to dedicate a whole NIC or SRI-OV-style virtual NIC to your database server

Tbh I’ve no idea we could do this with commodity cloud servers, nor do I know how, but I’m terribly interested in knowing how, do you know if there’s like a “dummy’s guide to better networking”? Haha

> kubelet is not optimized to get out of your way...Kubernetes sucks at managing memory-intensive processes

Definitely agree on both these issues, I’ve blown up the kubelet by overallocating memory before, which basically borked the node until some watchdog process kicked in. Sounds like the better solution here is a kubelet rebuilt to operate more efficiently and more predictably? Is the solution a db-optimised kubelet/K8s?


This is extremely misinformed. No matter how you choose to manage workloads, ultimately you are responsible for tuning and optimization.

If you're not in control of the system, and thus kubelet, obviously your hands are tied. I'm not sure anyone is suggesting that for a serious workload.

Now to dispell your myths:

1. You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.

2. You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones. There are a number of advanced techniques that are not at all necessary if you want to be a control freak, such as creating your own cgroups. This isn't "outside" of the runtime. Kubernetes is designed to conform to your managed cgroups. That's the whole point. RTFM.

3. The general theme of your complaint has nothing to do with kubernetes. There's no beating a dedicated NIC and even network fabric. Some cloud providers even allow you to multi-NIC out of the box so this is pretty solvable. Also, like, the dumbest QoS rules can drastically minimize this problem generally. Who cares.

4. Nah. RTFM. This is total FUD.

5.a. I don't understand. Are you sharing resources on the node or not? If you're not, then swap works fine. If you are, then this smells like cognitive dissonance and maybe listen to your own advice, but also swap is still very doable. It's just disk. swapon to your heart's content. But also swap is almost entirely dumb these days. Are you suggesting swapping to your primary IO device? Come on. More FUD.

5.b. OOM killer does what it wants. What's a better alternative that integrates "well" with the OOM killer? Do you even understand how resource limits work? The OOM killer is only ever a problem if you either do not configure your workload properly (true regardless of execution environment) or you run out of actual memory.

Bottom line: come down off your high horse and acknowledge that dedicated resources and kernel tuning is the secret to extreme high performance. I don't care how you're orchestrating your workloads, the best practices are essentially universal.

And to be clear, I'm not recommending using Kubernetes to run a high performance database but it's not really any worse (today) than alternatives.

> It's written for Web developers. To make things appear simpler for them, while sacrificing a lot of resources and hiding a lot of actual complexity... which is impossible to hide, and which, in an even of failure will come to bite you.

What planet are you currently on? This makes no sense. It's a set of abstractions and patterns, the intent isn't to hide the complexity but to make it manageable at scale. I'd argue it succeeds at that.

Seriously, what is the alternative runtime you'd prefer here? systemd? hand rolled bash scripts? puppet and ansible? All of the above??


> You can assign dedicated storage devices to your database. Outside of mount operations you're not going to see much alien fsync activity. This is paranoid.

This is word salad. Do you even know what fsync is for? I'm not even asking if you know how it works... What is "alien" fsync activity? Mount is perhaps the one system call that has nothing to do with fsync... so, I wouldn't expect any fsync activity when calling mount...

Finally, I didn't say that you cannot allocate a dedicated storage device -- what I said is that Kubernetes or Docker or Singularity or containerd or... well, none of container (management) runtimes that I've ever used know how to do it. You need external tools to do it. The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.

> You can pin kubelet CPU cores. You can ensure exclusive access to the remaining ones.

No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.

And... I don't have the time or the patience necessary to answer to the rest of the nonsense. Bottom line: you don't understand what you are replying to, and arguing with something I either didn't say, or just stringing meaningless words together.


> Do you even know what fsync is for?

I do, though perhaps an ignorant life would be simpler. "Alien" is a word with a definition. Perhaps "foreign" is a better word. Forgive me for attempting to wield the English language.

No one well will use your fucking disk if you mount it exclusively in a pod. Does that make sense? You must be a joy to work with.

> The point isn't that you cannot, the point is that a container runtime will only stand in your way when you try to do it.

I have no idea what this means. How does kubernetes stand in your way?

> No you cannot. Not through Kubernetes. You need to do this on the node that hosts kubelet.

This is incorrect. You can absolutely configure the kubelet to reserve cores and offer exclusive cores to pods by setting a CPU management policy. I know because I was waiting for this for a very long time for all of the reason in the discussion here. It works fine.

You clearly have an axe to grind and it seems pretty obvious you're not willing to do the work to understand what you're complaining about. It might help to start by googling what a container runtime even is, but I'm not optimistic.


I’ll assume the worst case:

- lots of containers running on a single host

- containers are each isolated in a VM (aka virtualized)

- workloads are not homogenous and change often (your neighbor today may not be your neighbor tomorrow)

I believe these are fair assumptions if you’re running on generic infrastructure with kubernetes.

In this setup, my concerns are pretty much noisy neighbors + throttling. You may get latency spikes out of nowhere and the cause could be any of:

- your neighbor is hogging IO (disk or network)

- your database spawned too many threads and got throttled by CFS

- CFS scheduled your DBs threads on a different CPU and you lost your cache lines

In short, the DB does not have stable, predictable performance, which are exactly the characteristics you want it to have. If you ran the DB on a dedicated host you avoid this whole suite of issues.

You can alleviate most of this if you make sure the DB’s container gets the entire host’s resources and doesn’t have neighbors.


> - containers are each isolated in a VM (aka virtualized)

Why are you assuming containers are virtualized? Is there some container runtime that does that as an added security measure? I thought they all use namespaces on Linux.


It’s becoming standard as a security measure. See: Kata containers, Firecracker VM


Not so; neither Kata containers nor Firecracker are in widespread public use today. (Source: I work for AWS and consult regularly with container services customers, who both use AWS and run on premise.)


Ah, good to know!


None of those are the fault of containers. You can do all of what you said without containers.


Nobody who matters.

Those who do that don't know what they are doing (even if they outnumber the other side hundred to one, they "don't count" because they aren't aiming for good performance anyways).

Well, maybe not quite... of course it's possible that someone would want to deploy a database in a container because of the convenience of assembling all dependencies in a single "package", however, they would never run database on the same node as applications -- that's insanity.

But, even the idea of deploying a database alongside something like kubelet service is cringe... This service is very "fat" and can spike in memory / CPU usage. I would be very strongly opposed to an idea of running a database on the same VM that runs Kubernetes or any container runtime that requires a service to run it.

Obviously, it says nothing about the number of processes that will run on the database node. At the minimum, you'd want to run some stuff for monitoring, that's beside all the system services... but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.


>but I don't think GP meant "one process" literally. Neither that is realistic nor is it necessary.

The point was simply about other processes that could be competing for resources - CPU, memory, or I/O. It is expensive for a user-level process to perform accounting for all of these resources, and without such accounting you can't optimally allocate them.

If there are other apps that can suddenly spike memory usage then any careful buffer tuning you've done goes out the window. Likewise for any I/O scheduling you've done, etc.


I'm running prod databases in containers so the server infra team doesn't have to know anything about how that specific database works or how to upgrade it, they just need to know how to issue generic container start/stop commands if they want to do some maintenance.

(But just in containers, not in Kubernetes. I'm not crazy.)


My group and a bunch of my peer groups.

And we are running them at the scale that most people can’t even imagine.


Embedded DB


Another interesting limitation of mmap() is that real-world storage volumes can exceed the virtual address space a CPU can address. A 64-bit CPU may have 64-bit pointers but typically cannot address anywhere close to 64 bits of memory, virtually or physically. A normal buffer pool does not have this limitation. You can get EC2 instances on AWS with more direct-attached storage than addressable virtual address space on the local microarchitecture.


To put concrete numbers: x86-64 is limited to 48 bits for virtual addresses, which is "only" 256TiB (281TB).


All of that is true, but I don't think it's a realistic concern. You're going to be sharding your data across multiple nodes before it gets that large. Nobody wants to sit around backing up or restoring a monolithic 256 TiB database.


Technically you get quite a bit less than the 256 TB theoretical in practice.

It is a realistic concern, I’ve lived it for more than a decade across many orgs, though I shared your opinion at one point. Storage density is massively important for both workload scalability and economic efficiency. Low storage density means buying a ton of server hardware that sits idle under max load and vastly larger clusters than would otherwise be necessary, which have their own costs.

When your database is sufficiently large, backup and restore often isn’t even a technical possibility so that requirement is a red herring. The kinds of workloads that can be recovered from backup at that scale on a single server, and some can, benefit massively from the economics of running it on a single server. A solution that has 10x the AWS bill for the same workload performance doesn’t get chosen.

At scale, hardware footprint economics is one of the central business decision drivers. Data isn’t getting smaller. It is increasingly ordinary for innocuous organizations to have a single table with a trillion records in it.

For better or worse, the market increasingly drives my technical design decisions to optimize for hardware/cloud costs above all else, and dense storage is a huge win for that.


This entire comment section is a bit of a dumpster-fire. I'm convinced the word database has outlived its usefulness for any serious discussion. It has the informational density of saying : "I work in IT"


Starting with Ice Lake there’s support for 5-level paging, which increases this to 128 PiB. Can’t say that I’ve ever seen this used in the wild though.


Yeah, there mostly isn’t a use case for it in databases. If you have that much storage you’ll need to bypass the kernel cache and scheduler anyway for other reasons. That was true even at the 48-bit limit.


Intel now extended the page table level to 5-level making this number not so valid. Granted, PL5 creates more TLB pressure and longer memory access time due to that.


Not just databases - we ran into the same issues when we needed a high-performance caching HTTP reverse proxy for a research project. We were just going to drop in Varnish, which is mmap-based, but performance sucked and we had to write our own.

Note that Varnish dates to 2006, in the days of hard disk drives, SCSI, and 2-core server CPUs. Mmap might well have been as good or even better than I/O back then - a lot of the issues discussed in this paper (TLB shootdown overhead, single flush thread) get much worse as the core count increases.


Varnish' design wasn't very fast even for 2006-era hardware. It _was_ fast compared to Squid, though (which was the only real competitor at the time), and most importantly, much more flexible for the origin server case. But it came from a culture of “the FreeBSD kernel is so awesome that the best thing userspace can do is to offload as many decisions as humanly possible to the kernel”, which caused, well, suboptimal performance.

AFAIK the persistent backend was dropped pretty early on (eventually replaced with a more traditional read()/write()-based one as part of Varnish Plus), and the general recommendation became just to use malloc and hope you didn't swap.


Varnish has a file system backed cache that depends on the page cache to keep it fast.

What did you differently in your custom one that was faster then varnish?


Simple multithreaded read/write. On a 20-core 40-thread machine with a couple of fast NVMe drives it was way faster.


Old timers will recall when using mmap was a prominently promoted selling point for the “no sql” dbms.


seems like all databases are moving towards the middle. Postgres has JSON support, MongoDB has transactions and also a columnar extension for OLAP type data. NoSQL seems almost meaningless as a term now. Feels like a move towards a winner takes all multi-modal database that can work with most types of data fairly well. Postgres with all of it's specialized extensions seems like it will be the most popular choice. The convenience of not having to manage multiple databases is hard to beat unless performance is exponentially better, Postgres with these extensions can probably be "good enough" for a lot of companies

reminds me of how industries typically start out dominated by vertically integrated companies, move to specialized horizontal companies, then generally move back to vertical integration due to efficiency. Car industry started this way with Ford, went away from it, and now Tesla is doing it again. Lots of other examples in other industries


The pendulum swing is common in any system, and is a really effective mechanism for evaluation.

You almost always want somewhere in the middle, but it’s often much easier to move back after a large jump in one direction than to push towards the middle.


For documents it made access fast since there’s no joins, etc. that require paging from all over. The problem ended up being updates and compaction issues.


My memory is that the problem was ACID. The document stores didn’t promise to be reliable because apparently that didn’t scale.

And there was a very well known cartoon video discussion about it with “web scale” and “just write to dev null” and other classics that became memes :)


Did you ever read Pat Helland's article, "Life Beyond Distributed Transactions: An apostate’s opinion" https://dl.acm.org/doi/10.1145/3012426.3025012? "This article explores and names some of the practical approaches used in the implementation of large-scale mission-critical applications in a world that rejects distributed transactions."


No I haven’t. Thanks for the interesting link :)

Admittedly I live in a world where big distributed transactions are a given and work fine and sql speeds us up not slows us down. I’m guessing sql and acid scaled after all?


> I’m guessing sql and acid scaled after all?

Yes and no. Distributed transactions and two-phase commit have been superseded by things like Paxos and Raft, with a variety of consistency models, so the implementation is drastically different.


Document stores often are reliable and more fault tolerant. But yes they trade ACID.

There are some applications that require high throughput (usually write) but can be fine with read consistency.

Couple of examples - consumer facing comment systems where other users are OK to miss your comment by 30 seconds - timeseries logging where you are usually reading infrequently but writing very much in a denormalized format so joins aren't as critical

For general CRUD, ACID is important though.


Related:

Are You Sure You Want to Use MMAP in Your Database Management System? [pdf] - https://news.ycombinator.com/item?id=31504052 - May 2022 (43 comments)

Are you sure you want to use MMAP in your database management system? [pdf] - https://news.ycombinator.com/item?id=29936104 - Jan 2022 (127 comments)


Many general-purpose OS abstractions start leaking when you're working on systems-like software.

You notice it when web servers are doing kernel bypass to for zero-copy, low-latency networking, or database engines throw away the kernel's page cache to implement their own file buffer.


Yes. I think mmap() is misunderstood as being an advanced tool for systems hackers, but it's actually the opposite: it's a tool to make application code simpler by leaving the systems stuff to the kernel.

With mmap, you get to avoid thinking about how much data to buffer at once, caching data to speed up repeated access, or shedding that cache when memory pressure is high. The kernel does all that. It may not do it in the absolute ideal way for your program but the benefit is you don't have to think about these logistics.

But if you're already writing intense systems code then you can probably do a better job than the kernel by optimizing for your use case.


Web servers doing kernel bypass for zero-copy networking? Do you have a specific example in mind? I'm curious.


The most common example is DPDK [1]. It's a framework for building bespoke networking stacks that are usable from userspace, without involving the kernel.

You'll find DPDK mentioned a lot in the networking/HPC/data center literature. An example of a backend framework that uses DPDK is the seastar framework [2]. Also, I recently stumbled upon a paper for efficient RPC networks in data centers [3].

If you want to learn more, the p99 conference has tons of speakers talking about some interesting challenges in that space.

[1] https://www.dpdk.org/.

[2] https://github.com/scylladb/seastar

[3] https://github.com/erpc-io/eRPC


Interesting. I hear a lot more about sendfile(), kTLS and general kernel space tricks than I do about DPDK and userspace networking, but maybe it's just me.

I do wonder what trend is going to win: bypass the kernel or embrace the kernel for everything?

The way I see it, latency decreases either way (as long as you don't have to switch back and forth between kernel and user space), but userspace seems better from a security standpoint.

Then again, everyone is doing eBPF, so probably the "embrace the kernel" approach is going to win. Who knows.


The people who use DPDK and the like are a lot quieter about it. The nature of kernel development means that people tend to hear about what you're doing, while DPDK and userspace networking tends to happen in more proprietary settings.

That said, I'm not sure many people write webservers in DPDK, since the Kernel is pretty well suited to webservers (sendfile, etc.). Most applications that use kernel-bypass are more specialized.


The downside, of course, is that each program owns one instance of the hardware. Applications don't share the network card. This isn't a general purpose solution.

That may be acceptable for your purposes, or it may not.


Probably the most common example is sendfile() for writing file contents out to a socket without reading them into userspace:

https://man7.org/linux/man-pages/man2/sendfile.2.html


Isn't that the opposite? That is, bypassing user space, not kernel space?


Oh, hmm, yeah, perhaps OP meant something more like using raw sockets to get packets directly into userspace without relying on the kernel to arrange them into streams?

I'm not very familiar with that though.


Yes, I knew about sendfile() but I wasnt't aware of any web server using that (though I know Kafka uses it).

Then I found out Apache supports it via the EnableSendfile directive. Nice.

>This directive controls whether httpd may use the sendfile support from the kernel to transmit file contents to the client. By default, when the handling of a request requires no access to the data within a file -- for example, when delivering a static file -- Apache httpd uses sendfile to deliver the file contents without ever reading the file if the OS supports it.


Pretty much all modern Linux web servers support sendfile(). Examples:

* nginx: [1] * Haskell webserver module: [2] * caddy: [3]

[1]: https://nginx.org/en/docs/http/ngx_http_core_module.html#sen... [2]: https://hackage.haskell.org/package/warp-3.3.28/docs/Network... [3]: https://github.com/caddyserver/caddy/pull/5022


I'd expect most serious web servers support it. I've written one that does (workerd), it's not too hard.

That said, it's tricky to use if the server also does TLS termination... then you need kTLS, which is a much bigger can of worms.


Sendfile isn’t kernel bypass.


It sounds like a lot of the performance issues are TLB-related. Am I right in thinking huge-pages would help here? If so, it's a bit unfortunate they didn't test this in the paper.

Edit: Hm, it might not be possible to mmap files with huge-pages. This LWN article[1] from 5 years ago talks about the work that would be required, but I haven't seen any follow-ups.

[1]: https://lwn.net/Articles/718102/


Huge pages aren't pageable, so they wouldn't be particularly advantageous for a mmap DB anyway, you'd have to do traditional I/O & buffer management for everything.


No, huge pages wouldn't help. They would change when the TLB gets flushed, but the flushes would still be there.


Memory-Mapped Files = access violations when a disk read fails. If you're not prepared to handle those, don't use memory-mapped files. (Access violation exceptions are the same thing that happens when you attempt to read a null pointer)

Then there's the part with writes being delayed. Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.


> Be prepared to deal with blocks not necessarily updating to disk in the order they were written to, and 10 seconds after the fact. This can make power failures cause inconsistencies.

This is not specific to mmap -- regular old write() calls have the same behavior. You need to fsync() (or, with mmap, msync()) to guarantee data is on disk.


> This is not specific to mmap -- regular old write() calls have the same behavior.

This is not true. This depends on how the file was opened. You may request DIRECT | SYNC when opening and the writes are acknowledged when they are actually written. This is obviously a lot slower than writing to cache, but this is the way for "simple" user-space applications to implement their own cache.

In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs. Sometimes it's even configurable. This is why databases also offer various modes of synchronization -- you need to know how your appliance works and configure the database accordingly.


> This is not true. This depends on how the file was opened. You may request DIRECT | SYNC

Well sure, but 99.9% of people don't do that (and shouldn't, unless they really know what they are doing).

> In the world of today, you are very rarely writing to something that's not network attached, and depending on your appliance, the meaning of acknowledgement from write() differs.

What network-attached storage actually uses O_SYNC behavior without being asked? I'd be quite surprised if any did this as it would make typical workloads incredibly slow in order to provide a guarantee they didn't ask for.


100% of people writing a database know about filesystem options like DIRECT and SYNC, and that is the subject of this paper.

Also, most of the network-attached storage we people use is in the form of things like EBS, which is very careful to imitate the behavior of a real disk, but with different performance and some different (albeit very rare) failure modes.


100% of people writing databases also know how fsync() and msync() work. I interpreted this thread as being targeted at a wider audience.


It's literally for people writing their own database. Why would you interpret it differently?


It's fun to remember that fsync() on Linux on ext4 at least offers no real guarantee that the data was successfully written to disk. This happens when write errors from background buffered writes are handled internally by the kernel, and they cleanup the error situation (mark dirty pages clean etc). Since the kernel can't know if a later call to fsync() will ever happen, it can't just keep the error around. So, when the call does happen, it will not return any error code. I don't know for sure, but msync() may well have the same behavior.

Here is an LWN article discussing the whole problem as the Postgres team found out about it.

https://lwn.net/Articles/752063/


Linux throws a SIGBUS. A process should anticipate such I/O failures by implementing a SIGBUS handler, especially a database server.

For the second part of your comment, on Linux systems, there is the msync() system call that can be used to flush the page cache on demand.


> msync() system call that can be used to flush the page cache on demand.

for everyone, not just the file you mapped to memory. I.e. the guarantee is that your file will be written, but there's no way to do that w/o affecting others. This is not such a hot idea in an environment where multiple threads / processes are doing I/O.


msync() affects only the pages that part of the mmap area you ask for in the arguments. From the man pages:

> int msync(void addr[.length], size_t length, int flags);

> msync() flushes changes made to the in-core copy of a file that was mapped into memory using mmap(2) back to the filesystem


No it doesn't. That's physically impossible. Read what you quoted -- it never says that it's going to do it only for the file in question.

If you don't know why it's not possible, here's a simplified version of it: hardware protocols (s.a. SCSI) must have fixed size messages to fit them through the pipeline. I.e. you cannot have a message larger than the memory segment used for communication with the device, because that will cause fragmentation and will lead to a possibility of message being corrupted (the "tail" being lost or arriving out of order).

On the other hand, to "flush" a file to persistent storage you'd have to specify all blocks associated with the file that need to be written. If you try to do this, it will create a message of arbitrary size, possibly larger than the memory you can store it in. So, the only way to "flush" all blocks associated with a file is to "flush" everything on a particular disk / disks used by the filesystem. And this is what happens in reality when you do any of the sync family commands. The difference is only in what portion of the not-yet synced data the OS will send to the disk before requesting a sync, but the sync itself is for the entire disk, there aren't any other syncs.


I don't know what you're talking about, but msync() flushes only the pages in that range. The pages are in the page cache (on Linux, it's a per-file xarray [1] of pages). Once all the dirty pages in the range are located, they go through the filesystem to be mapped to block numbers and then submitted to the block layer to be written to the storage device. Only the disk blocks mapped to the pages in that range will be written.

Source: I'm a Linux kernel developer.

[1] https://docs.kernel.org/core-api/xarray.html


I wonder how many apps don't handle errors from read() anyway.


does that get delivered as SIGSEGV to the process or something else?


On Linux, it's a SIGBUS.


Yes, I definitely would want to use mmap() in my storage system. And would love to see the limitations that make this tricky addressed.


The TLDR is that MMAP sorta does what you want, but DBMSes need more control over how/when data is paged in/out of memory. Without this extra control, there can be issues with transactional safety, and performance.


For all of its usefulness in the good old days of rusty disks I wonder if virtual memory is worth having for dedicated databases, caches, and storage heads. Avoiding TLB flushes entirely sounds like a huge win for massively multithreaded software and memory management in a large shared flat address space doesn't sound impossibly hard.


This is the kind of debate that has been going on surrounding virtual memory forever[0][1]. If you can keep everything in memory, then you're golden. But eventually you won't, and you'll need to rely on secondary storage.

Is there a performance benefit to be had by managing the memory and paging yourself? Yes. But eventually you will also consider running processes next to your database, for logging, auditing, ingesting data, running backups, etc. Virtual memory across the whole system helps with that, especially if other people will be using your database in ways you can't predict. As for the efficiency of MMUs and the OS, seems like for almost all cases it's "satisfactory" enough[1].

[0] http://denninginstitute.com/pjd/PUBS/bvm.pdf

[1] From 1969! https://dl.acm.org/doi/pdf/10.1145/363626.363629


I guess things like mshare could be extended to the entire process address spaces and the kernel could avoid TLB invalidation on context switches between them. Core affinity could be used to keep other programs from scheduling on the cores intended for processes sharing the whole address space.


The jump in address sizes starts to get too unwieldy. 32 bit addresses were ok, 64 bit addresses start to get clunky, 128 bit would be exorbitant for CPU real estate. There's a reason AMD64 still only supported 40 physical address bits when it was introduced, and later only expanded to 48 bits.

The reality is there will always be a hierarchy for storage, and paging will always be the best mechanism to deal with it. Because primary memory will always be most expensive, no matter what technology it's based on. There will always be something slower, cheaper, and denser that will be used for secondary storage. There will always be cheaper storage. And its capacity will exceed primary, and it will always be most efficient to reference secondary storage in chunks - pages - and not at individual byte addresses.


I don't really see what those two things have to do with each other. When you don't use mmap, you manage the disc<->ram storage virtualisation yourself. Hardware paging, then, is pure overhead. The parent doesn't argue against layering of storage media, nor against chunking in general. Only against mmus as a mechanism for implementing it.


The mention of a large shared flat address space implied no paging, to me. Maybe I just read something into it that wasn't there.


The 'paging' is implemented in software, not in hardware. This is how databases not using mmap already work, so mmus are already pure overhead for them.


I've become convinced that there are very few, if any, reasons to MMAP a file on disk. It seems to simplify things in the common case, but in the end it adds a massive amount of unnecessary complexity.


It's incredibly useful in read-only, memory constrained scenarios. I.E. we used to mmap all of our animation data on many rendering engines I worked on where having ~20-50mb of animation data and only "paying" a couple 10s of kb based on usage patterns was very handy. It becomes even more powerful when you have multiple processes sharing that data and the kernel is able to re-use clean pages across processes.

From reading the paper most of the concerns are around the write side. LMDB is the primary implementation that I know which leans heavily into mmap but it also comes with a number of constraints there(single writer, read locks can lead to unbounded appending to the WAL, etc). As with any tech choice it's about knowing constraints/trade-offs and making appropriate choices for your domain.


Complexity? You mmap it in and then read the multi terrabyte file as if it was an array.

The opposite with actual file io sucks in terms of complexity. I get that you can write bespoke code that performs better but mmap is a one liner to turn a file into an array.


Need to handle the exceptions/signals every time a disk read fails. With classic IO, you know when the read will happen. But with memory-mapped files, the exception can happen at any time you are reading from the memory range.

As for why disk reads fail, yes that's a thing. Less common on internal storage (bad sectors), but more common on removable USB devices or Network drives (especially on wifi).


Multi-terabyte? Better hope you have lots of spare RAM for all those page structures the kernel has to keep.


"mmap" in the general case is incredibly useful.

There's so much you get "for free" and the UX/DX of reads/writes to it, especially if you're primarily operating on structs instead of raw byte/string data.

(Example, reading a file and "reinterpret_cast<>"'ing it from bytes to in-memory struct representations)

It's just that for the _particular_ case of a DBMS that relies on optimal I/O and transactionality, the general-purpose kernel implementation of mmap falls short of what you can implement by hand.


I've been thinking for the past few years about how to get a scenario like 'git clone' of a large repo to go fast. One thought is to memory map the destination files being written by git and then copy/unzip the data there. You'd save a copy versus the staging buffer that you'd currently be passing to write(). However, the overhead of managing the tlb shootdowns would likely be fatal except for the largest output files.


if you truss starting up a binary, the OS normally mmaps the binary, at least in tests i ran.


A well written bespoke function can beat a generalized function at a specific task.

If you have the resources to write and maintain the bespoke method great. The large database developers probably have this. For others please don't take this link and go around claiming mmap is bad though. That gets tiresome and is misguided. Mmap is a shortcut to access large files in a non linear fashion. It's good at that too. Just not as good as a bespoke function.


This paper isn't aimed at random developers, and it's not a criticism of mmap in general.

This is an appeal to core database engineers to stop using the wrong tool for the job.


mmap can be handy but usually is not a good idea when you care about ACID properties. So it tends to be most useful outside databases.


Can you give some examples where mmap is useful?


I once improved a parser's performance a huge amount (iirc, something like 500x) when parsing large (>1GB) text files by mmap'ing the files instead of reading them into a byte array. It's not a magic bullet but it was alright for that application.

Another technique that can only be done with mmap is to map two contiguous regions of virtual memory to the same underlying buffer. This allows you to use a ring buffer but only read from/write to what looks like a contiguous region of memory.


If your data is likely to already be in the system cache, memory mapping can achieve zero copying of the data, whereas reading will perform at least one memcpy. So there can be a performance advantage depending on the usage pattern.

Also, I've never tested this, but I believe mapped files will get flushed as long as the system stays running. So if you only need resilience against abnormal termination rather than system crashes, it seems like a good option?


> Also, I've never tested this, but I believe mapped files will get flushed as long as the system stays running. So if you only need resilience against abnormal termination rather than system crashes, it seems like a good option?

Linux will not lose data written to a MAP_SHARED mapping when the process crashes.

But! Linux will synchronously update mtime when starting to write to a currently write protected mapping (e.g. one which was just written out). This means (a) POSIX is violated (IMO) and (b) what should be a minor fault to enable writes turns into an actual metadata write, which can cause actual synchronous IO.

I have an ancient patch set to fix this, but I never got it all the way into upstream Linux.

What you can do is mmap a file on a tmpfs as long as you trust yourself to have some other reliable process handle the data even if your application terminates abnormally. This is awkward with a container solution if you need to survive termination of the entire container.


It’s useful in a world of several processes sharing things. This is much less common today in a world of “single process” containers and VMs as well as monolithic processes using threads or async techniques.

However, Java can build a special library file of the core JRE classes that it can mmap into memory with the intent to speed up startup times, mostly for small Java programs.

Guile scheme will mmap files that have been compiled to byte code. You can visualize a contrived (especially today) scenario where Guile is used for CGI handlers, having the bulk of their code mapped, the overall memory impact of simultaneous handlers is much lower, as well as start up times.

The process model is less common today so the value of this goes down, but it can still have its place.


One place I've seen it used was a lib by a guy called DHoerl for reading images that are too big to fit in memory (this was years ago on iOS).

A very over-simplified and probably a bit incorrect description of what it did was to create a smaller version of the image - one that could fit in memory - by sub sampling every nth pixel, which was addressed via mmap.

It actually dealt with jpegs so I have no idea how that bit worked as they are not bitmaps.


glibc itself uses mmap under the covers when doing malloc in certain situations. Granted, it's anonymous and not file-backed, but it's still proven to be performant. See, e.g, mallopt(3).


This reads more like "don't write your own DBMS" than "don't use mmap."


maybe a stupid question but what is wrong with coffee and spicy food?


For the majority of the world, nothing. But if your diet consists of fairly bland food then it can result in unpleasant trips to the toilet.


Acid reflux I thought


to put it crudely I think the punchline is the spicy food hurts on the way out, and the coffee makes that happen with greater velocity


Just doesn’t taste good together I think




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: