Hacker News new | past | comments | ask | show | jobs | submit login

To me it's more sensible to consider writing a database from scratch for the unikernel. Or at least the storage engine portion of it.

Plenty of recent DB research about running up against the wall of what the Linux VM subsystem can provide for them in terms of memory management. LeanStore[1], Umbra [2], and research [3] since then show that to crank the most out of buffer pool mgmt, we're getting closer and closer to the TLB itself. Fiddling with mmap & vm overcommit, pointer swizzling (or not), userfaultfd & custom page fault handling, even custom kernel drivers, etc.

To really crank performance on in-memory and hybrid in-memory/disk systems -- why even bother with Linux then? Let's run direct on hypervisor! On the whole, DBs already manage their own persistent storage, so don't strictly need a filesystem (esp in the New Cloud World where pages often go up into S3 buckets etc); they manage their own memory; and often user accounts, etc too, and often managing their own concurrency. They're really an OS within an OS in many respects.

Virtualization already handles abstracting things enough that drivers for network and disk etc aren't as big of benefit from the OS side. Security and monitoring can be handled at the per-VM level. We're no longer held back by the requirement to have to have a pile of drivers for different hardware configurations. At least in broad strokes.

But I wouldn't start with Postgres as a base, that's for sure. If you're building enough of libc and a POSIX/Unix ABI that you can run stock programs, there's likely little benefit at all.

I doubt you'd get much win for analytical workloads, but for very high throughput transactional workloads ... what fun!

Let's go all the way baby!

[1] https://db.in.tum.de/~leis/papers/leanstore.pdf

[2] https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

[3] https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/_my_direct_up...




Doesn't Oracle (and I think reaches waaaaay back in the memory banks DB2) do something similar in the sense that it is packaged up with its own VM/storage subsystems that bypasses the native OSes facilities and is hyper-optimized for the DB use case?


All serious databases are doing their own low-level buffer pool management for relations&tuples, which bypass e.g. malloc, for a bunch of reasons. There are various techniques for this. The papers I link to in my comment refer to some recent ones.

My point is that in the end what a DB needs to do at this level is translate tuple or btree etc. node references into physical memory references; either by a fault that pulls them from disk, by a fault that maps a virtual address into the kernel as physical memory, or by a live reference, etc. And it feels to me that a lot of the logic there mirrors what already happens in the OS's VM subsystem. (For various reason's the "automagic" facilities that do things like this in the OS -- file-backed mmap'd -- are a terrible way to implement a DB, BTW: https://db.cs.cmu.edu/mmap-cidr2022/)

That and of course at the persistent storage layer the data structures are tuned around physical device performance characteristics. Back in the days to deal with the mechanics of head & cylinders etc; these days, the interface and aspects of flash storage, write amplification, etc. Hence log structured merge trees, b-epsilon-trees, etc. etc. The persistent storage layer inside the DB looks very much like a filesystem. I mean, it kind of is one, but with a different concept of namespace (relations&tuples&versions etc instead of files and directories)

So, I dunno, I'd maybe rather drink straight from the spring rather than get a bottle of bubbly water at the restaurant. Or something something weird analogy.


As I recall, Oracle on Solaris was able to go direct to the disks and experienced a lot of perf improvements over similar HPUX or Linux machines


This is my recollection as well. I was at a big Solaris shop when I cared enough about DBs to dig in this deep, but there were many, many discussions with the ops folks about whether this bypass was a good thing, and in the end the performance metrics were judged 'better', for values of better I don't accurately recall.


The database itself assumes the existence of a filesystem. The whole toolset from Oracle, also includes a filesystem that you can install and use with the database, but the database itself cannot just work with a bare block device.


I think that's exactly what I said.


Oracle did its own storage on block level devices (disk partition) as far back as 2000. I don’t know if it still does it.


I have never used it but I've heard people talk about that as an essentially failed path. Admins apparently hated it, and the performance likely wasn't all that.

However, we are in a different era now. Virtualization and cloud infrastructure changes this whole situation quite a bit.


> On the whole, DBs already manage their own persistent storage

Nope. No, they don't. They use filesystem. I think only MSSQL can be configured to use bare block device, but this configuration is rarely used in practice.

Oracle, for example, implements its own filesystem, and it would use that, if you configure it to do so, but the database itself expect there to be files, directories etc.


You're missing the point, or misreading me. Yes DBs are using files in the filesystem, yes on the whole, but inside those files are datastructures (btrees, etc.) that are really self-managed storage.

Your filesystem uses tree data structures to map files and dirs to physical locations in block storage, and caches chunks of them, etc. A DB's storage layer does the same for relations, tuples, and indexes; including explicit optimization for various page sizes, perf characteristics of the underlying block device, etc.

Yes it's doing that, in turn, inside files, but that's quite different from how a "regular application" uses files. It's using quite little of the FS's value-add beyond it being a way to share tenancy with other things on the machine.

If e.g. DB used a file for every tuple ("row"), that would suck. If it relied completely on the OS's default sync and recovery facilities, that would also not be ideal.


Sorry, no, you are missing the point. Most popular databases cannot exist w/o filesystems. filesystem isn't just files and directories, it's the guarantees you get on reads / writes / copy / delete / create, the cache behavior, the ownership, snapshots, deduplication, redundancy, compression, checksums, encryption... (obviously, not all filesystems do this, and some databases already can take on some of the functions of filesystem).

Databases don't implement those features themselves, but will have to, if they decide to work with bare block devices.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: