To me it's more sensible to consider writing a database from scratch *for* the u...

kjs3 · on June 23, 2023

Doesn't Oracle (and I think reaches waaaaay back in the memory banks DB2) do something similar in the sense that it is packaged up with its own VM/storage subsystems that bypasses the native OSes facilities and is hyper-optimized for the DB use case?

cmrdporcupine · on June 23, 2023

All serious databases are doing their own low-level buffer pool management for relations&tuples, which bypass e.g. malloc, for a bunch of reasons. There are various techniques for this. The papers I link to in my comment refer to some recent ones.

My point is that in the end what a DB needs to do at this level is translate tuple or btree etc. node references into physical memory references; either by a fault that pulls them from disk, by a fault that maps a virtual address into the kernel as physical memory, or by a live reference, etc. And it feels to me that a lot of the logic there mirrors what already happens in the OS's VM subsystem. (For various reason's the "automagic" facilities that do things like this in the OS -- file-backed mmap'd -- are a terrible way to implement a DB, BTW: https://db.cs.cmu.edu/mmap-cidr2022/)

That and of course at the persistent storage layer the data structures are tuned around physical device performance characteristics. Back in the days to deal with the mechanics of head & cylinders etc; these days, the interface and aspects of flash storage, write amplification, etc. Hence log structured merge trees, b-epsilon-trees, etc. etc. The persistent storage layer inside the DB looks very much like a filesystem. I mean, it kind of is one, but with a different concept of namespace (relations&tuples&versions etc instead of files and directories)

So, I dunno, I'd maybe rather drink straight from the spring rather than get a bottle of bubbly water at the restaurant. Or something something weird analogy.

emaginniss · on June 23, 2023

As I recall, Oracle on Solaris was able to go direct to the disks and experienced a lot of perf improvements over similar HPUX or Linux machines

kjs3 · on June 26, 2023

This is my recollection as well. I was at a big Solaris shop when I cared enough about DBs to dig in this deep, but there were many, many discussions with the ops folks about whether this bypass was a good thing, and in the end the performance metrics were judged 'better', for values of better I don't accurately recall.

crabbone · on June 23, 2023

The database itself assumes the existence of a filesystem. The whole toolset from Oracle, also includes a filesystem that you can install and use with the database, but the database itself cannot just work with a bare block device.

kjs3 · on June 26, 2023

I think that's exactly what I said.

miohtama · on June 23, 2023

Oracle did its own storage on block level devices (disk partition) as far back as 2000. I don’t know if it still does it.

cmrdporcupine · on June 23, 2023

I have never used it but I've heard people talk about that as an essentially failed path. Admins apparently hated it, and the performance likely wasn't all that.

However, we are in a different era now. Virtualization and cloud infrastructure changes this whole situation quite a bit.

crabbone · on June 23, 2023

> On the whole, DBs already manage their own persistent storage

Nope. No, they don't. They use filesystem. I think only MSSQL can be configured to use bare block device, but this configuration is rarely used in practice.

Oracle, for example, implements its own filesystem, and it would use that, if you configure it to do so, but the database itself expect there to be files, directories etc.

cmrdporcupine · on June 23, 2023

You're missing the point, or misreading me. Yes DBs are using files in the filesystem, yes on the whole, but inside those files are datastructures (btrees, etc.) that are really self-managed storage.

Your filesystem uses tree data structures to map files and dirs to physical locations in block storage, and caches chunks of them, etc. A DB's storage layer does the same for relations, tuples, and indexes; including explicit optimization for various page sizes, perf characteristics of the underlying block device, etc.

Yes it's doing that, in turn, inside files, but that's quite different from how a "regular application" uses files. It's using quite little of the FS's value-add beyond it being a way to share tenancy with other things on the machine.

If e.g. DB used a file for every tuple ("row"), that would suck. If it relied completely on the OS's default sync and recovery facilities, that would also not be ideal.

crabbone · on June 23, 2023

Sorry, no, you are missing the point. Most popular databases cannot exist w/o filesystems. filesystem isn't just files and directories, it's the guarantees you get on reads / writes / copy / delete / create, the cache behavior, the ownership, snapshots, deduplication, redundancy, compression, checksums, encryption... (obviously, not all filesystems do this, and some databases already can take on some of the functions of filesystem).

Databases don't implement those features themselves, but will have to, if they decide to work with bare block devices.