Why is this "hacky"? I've long wondered why databases need to be on filesystems,...

jasode · 2024-02-07T11:03:55 1707303835

>I've long wondered why databases need to be on filesystems, other than simply convenience. [...]So it seems like you could improve performance a lot by eliminating the filesystem layer and going straight to direct disk access.

Both Oracle RDBMS and older versions of Microsoft SQL Server had the option to use "raw disk devices" / "raw partitions" instead of the filesystem. It had some obvious justifications such as avoiding the "wasteful double buffering" of the file system cache being redundant to the database's cache and avoiding "unnecessary" extra i/o abstraction layers.

Microsoft later got rid of that option because it wasn't worth the tradeoff of a small performance gain while losing the easier management aspects of NTFS file system.

user8501 · 2024-02-07T17:21:51 1707326511

SQLite essentially achieves this as well by keeping the entire database in one file. It’s even faster than the filesystem in some cases: https://www.sqlite.org/fasterthanfs.html

bzzzt · 2024-02-07T12:01:54 1707307314

Oracle also had (has?) the capability to split up database volumes on raw disks based on I/O load.

rkagerer · 2024-02-07T13:37:57 1707313077

I'd like to go the other way, and have a filesystem that leverages a proper database for its metadata.

I can search millions of records in a split second in SQL. But searching records on my hard disk takes enormously more time. The solutions out there for speeding it up (like Windows Indexing service) rely on asynchronous indexing that's patched on top of the filesystem instead of integrated realtime within it, and are subsequently brittle (in my experience), slow down your computer at inconvenient times, and too often aren't instantly up date.

Being able to have stored procedures execute when changes trigger would be another interesting aspect that could take the place of event filters and file watchers (especially if critical portions could be configured to happen sequentially & atomically, akin to IRQ handlers).

I feel like databases have gotten lots of love over the decades and are super optimized (and by this point are pretty dang reliable). Filesystems feel like they lost their time in the spotlight and have floundered with less innovation.

Hardware that calculates CRC's of blocks in realtime would also be awesome (imagine what it would do for sync). The ironic thing is your HDD already does this, there's just no way to tap into that information at the software / filesystem driver level.

jasode · 2024-02-07T13:57:20 1707314240

>I'd like to go the other way, and have a filesystem that leverages a proper database for its metadata.

Microsoft made an attempt back in 2003 at creating a "smarter" file system with rich metadata based on a real database engine (MS SQL Server) ... but eventually abandoned the project: https://en.wikipedia.org/wiki/WinFS

WinFS wasn't necessarily going to completely replace NTFS but Microsoft did have ambitious plans for its integration into everything.

It's interesting that other platforms like Unix/Linux/Apple that were designing "next generation" file systems like ZFS/ext4/APFS after the failed WinFS project had a chance to try adding their own WinFS smart metadata features to but they didn't.

The later fragmentation of user data being spread around in the cloud such as AWS objects, Apple iCloud, Backblaze backups, etc instead of every file saved on home computers makes a "universal rich metadata index stored in a database engine" to be a cross-platform standard by all operating systems less realistic today than back in 2003. Therefore, the core os file systems remain a "dumb blob of bytes" with a separate layer of metadata index db in vendor-specific schemes like Apple Spotlight db, Windows Search in ".edb" file, etc.

didgetmaster · 2024-02-07T17:57:10 1707328630

When WinFS was canceled, I started my own pet project to build a new kind of file system that totally changed the way file metadata was stored and managed. I built an object based system that could easily handle hundreds of millions of files within a single container (what I called a pod) and do searches much like how you would search a large database table.

The 'file table' contains a 64 byte record for each object and has some very unique characteristics that make searches incredibly fast (e.g. it can find all the photos in under a second, even if there are 20 million of them among 200 million files). You can also attach various metadata 'tags' to each object and find matching objects based on them.

The project is in open beta and anyone can download it and try it out on their own computer.

https://www.didgets.com

rkagerer · 2024-02-07T16:13:32 1707322412

Yeah they totally missed the mark with WinFS, in my opinion. It was too exotic/foreign with no clear adoption path.

All I want is dumb files with something like SQLite replacing or augmenting the allocation table. Then sprinkle on some new capabilities incrementally.

EvanAnderson · 2024-02-07T16:48:42 1707324522

You'd probably enjoy reading about the Be File System: https://en.wikipedia.org/wiki/Be_File_System

It has relational database-like semantics for metadata.

nvm0n2 · 2024-02-07T16:28:34 1707323314

The reason indexing services don't work synchronously is because it would cause file IO to get far slower than apps expect, and that can break user interfaces and cause unexpected slow downs. There's also no real reason for it to be synchronous in most cases.

josephg · 2024-02-07T11:03:07 1707303787

Lots - maybe most - applications need to essentially run their own mini database to store user data. I obviously don't want a partition per application, or per word document.

Abstracting away your hard disk is literally the job of your filesystem.

shiroiuma · 2024-02-07T11:07:52 1707304072

I don't mean that some little SQLite DB for your web browser's storage and settings needs its own partition; that's obviously going too far. I mean for really huge databases, where an entire server is dedicated to running that DB and it has high performance needs.

josephg · 2024-02-07T13:23:17 1707312197

Fair enough. But I am including all of those little databases.

I agree with you - you could hack around this for a big database by putting it on a dedicated partition (even though you shouldn't need to). But I'm also thinking about applications with little databases that need to live inside the filesystem. They shouldn't need to embed sqlite in order to survive a crash.

A different syscall API would allow you to write a small, fast, efficient database engines (like redis or mysql) without needing all of the complex, performance gobbling tricks that these databases currently have.

A better filesystem API would help databases everywhere.

convolvatron · 2024-02-07T17:52:49 1707328369

last time I looked at this, and it was a long time ago, it was around 10% and workload dependent.

if you make direct and locking more robust, and allowed an application to turn off the kernel cache, you would only really need to interact with the filesystem after open in order to change your extents.

+1. this is clearly a place where we can have nice things and performance

PeterisP · 2024-02-08T06:11:36 1707372696

For server applications, the trends of virtualization and microservices mean that having "machine" dedicated to running a single application is a very common scenario, and also for server VMs running a bunch of apps it is not obvious that having a partition per application is an issue, it would be trivial for the VM initialization scripts to automagically configure a dedicated partition for each DB.

Having a OS+filesystem running a virtual OS+filesystem running a database has all kinds of redundant caching, and we could systematically skip a layer here.