A database built like an operating system: the architecture of FaunaDB

jandrewrogers · on March 31, 2017

There is an old engineering aphorism that proper database engines are complete operating systems implemented in user-space. This is true both literally and figuratively; most sophisticated database kernels treat the underlying operating system as a glorified device driver.

It is why they are both so complex to implement and so much fun to build.

hullsean · on March 31, 2017

When I was first introduced to Oracle years ago, I was surprised at how much it acted like an operating system inside. Once connected to the shell you could find processes running, examine memory through data dictionary queries and so forth. Tuning it involved looking at disk I/O patterns, memory usage and making your loops tighter. In the SQL world that meant indexing & reviewing plans for joins carefully.

MySQL & Postgres carry some of those same attributes. Even though they are user processes and the OS controls the lower levels, you would typically deploy them on their own servers, and give them all the memory & disk I/O you could, because they were their own operating systems essentially.

It's not clear to me how building a database "like an operating system" is something new. What I did see new in there was a consistency checking tool. That is super important. Because in a distributed environment you're bound to have drift.

Also I'd like to see how they managed to get full ACID Compliance. Fast network interfaces are great but there is still some latency which means hiccups when multiple nodes try to update the same row.

freels · on March 31, 2017

I agree with you that it's not an original analogy, in fact I think you can safely say that most larger, performance-oriented systems end up looking more like an OS than not. There's a large overlap in the problems that need to be solved and the resources available to distributed systems, that I'd be more skeptical of an implementation that didn't lean heavily on the OS, or at least borrow the general patterns.

To answer your question about consistency, we have implemented a transaction resolution mechanism inspired by Calvin. The whitepaper contains a lot more detail, but the short answer is that we use a consistent, distributed transaction log to provide a global order of all read-write transactions, which are executed deterministically on the responsible data partitions. This allows FaunaDB to provide a guarantee of strict-serializability for read-write transactions. Read-only transactions are by default a bit more relaxed and provide a hard guarantee of serializability in order to avoid global coordination overhead.

dunkelheit · on April 1, 2017

The key difference with conventional RDBMS systems is that transactions are deterministic. When interacting with a traditional RDBMS a client opens a transaction with BEGIN statement, then does reads and writes interactively at its own pace. In contrast, in Calvin-inspired systems clients send their transactions as one unit written in a severely restricted query language (disclaimer - not a FaunaDB engineer).

inopinatus · on March 31, 2017

Just to flip the analogy, most OS file systems are essentially databases for arbitrary sized binary values, with a hierarchical key structure.

walterbell · on March 31, 2017

Yes, big databases can disable the OS scheduler (CPU pinning) and file caching (O_DIRECT) so they can perform workload-aware optimizations for better performance than general-purpose OS mechanisms.

cr0sh · on March 31, 2017

Hey, they re-invented PICK!

https://en.wikipedia.org/wiki/Pick_operating_system

(actually, I don't know much about this new system to be commenting on it - but whenever I see something with keywords like "database operating system" - my mind instantly drifts to PICK, because that was the first system I used when I started my software development career)

evanweaver · on March 31, 2017

You could do worse than calling it distributed PICK for the cloud. Everything old is new again.

sixdimensional · on April 1, 2017

I thought the same - started out using PICK as well. Wow, I thought I was one of the few on HN. Nice to know you!

I felt the same way when "NoSQL", document dbs, etc. came out. Ok, yeah, now we are distributed and massive scale with better tooling... but.. same/similar concepts.

nthcolumn · on April 1, 2017

Me too. What is this 'Cocoon'?

nthcolumn · on April 1, 2017

You were using NOSQL keystores before it was cool! So ahead of your time.

borplk · on April 1, 2017

Similarly I'd love to see an operating system that operated like a database.

Hate the arbitrary ascii-file nonsense. If the state of the operating system is in a proper ACID database it would be so nice. Tooling could make it just as easy as ascii files but under the hood it would be better.

trimbo · on April 1, 2017

> Similarly I'd love to see an operating system that operated like a database.

Microsoft has really wanted something like that too. For example, WinFS.

https://en.wikipedia.org/wiki/WinFS

joshu · on April 1, 2017

While we are at it, only let the system compiler make executables. When it does, log all the tables and columns it touches, so you can always find out everything that touches something.

OS/400 was cool.

mirimir · on April 1, 2017

Yes, AS/400 (iSeries) has integrated SQL :)

danellis · on April 1, 2017

Aren't filesystems databases? The "arbitrary ascii-file nonsense" is just like storing text fields in your database.

borplk · on April 3, 2017

Except fields in a database are not arbitrary.

I'd much rather configure Apache by inserting fields in a database (and tooling can be developed to help) than editing ascii files in its custom configuration language.

You could configure the entire system with a bunch of queries. You could view the state of the system with queries so easily in a universal manner.

   "select server_name from nginx.sites where enabled = true;"

This is fundamentally different from the model that we have today. Right now there's no good safe way of querying for the true state of things. You have to parse a bunch of crap and even then you wont know for sure if the value in the file is actually the present value or not unless the specific technology allows for that to be queried.

Tons of different methods of including and overriding and so on. Knowing what's the "effective state" is difficult.

   "select pid, owner, memory_usage from os.processes order by memory_usage desc limit 10;"

Those are just some basic examples and I know it's not that easy to pull off but I think it's certainly worth thinking about and challenging the popular Unix-isms of arbitrary information stuffed in arbitrary formats in arbitrary places held together by a thing string of conventions.

It adds so much cost and room for bugs and errors because of the countless boundaries that it creates between each little component (constant parse, spit out write read) meaning and context and structure is lost in each step.

An operating system with a structured state backend could bring harmony to all the components.

krylon · on April 1, 2017

In a way, yes, definitely.

But one does not get transactions. I would love a filesystem with transactional semantics. Surprisingly often, applications use some form of relational database not because their data is such a good fit for that model, but because they want/need transactions. In theory, one could roll their own transaction layer, after all, that's what RDBMSs do. But I have a hunch that getting transactions to work correctly without totally killing performance is not exactly trivial.

vbezhenar · on April 1, 2017

NTFS definitely have transactions. I'm not sure about other file systems.

krylon · on April 1, 2017

I had totally forgotten about that, you are right, NTFS does have a transaction API.

I think I remember the API being marked as deprecated, though. Or is my memory playing tricks on me again?

vbezhenar · on April 1, 2017

Yes, they deprecated it, claiming that developers didn't use it: https://msdn.microsoft.com/en-us/library/windows/desktop/hh8...

skissane · on April 1, 2017

IMO, part of why developers didn't use it was the API design made using it too painful. They took most of the existing file APIs and added new *Transacted APIs alongside them, e.g. CreateFile begot CreateFileTransacted, etc. So to add filesystem transactions to your existing code, you had to change all the file API calls.

I think, if they had made transactions work with the existing file API–such as by storing the hTransaction in thread-local storage, with an API to get/set the transaction association of the current thread, and having the existing non-transactional APIs behave transactionally in that case–it might have saw more adoption by developers.

nthcolumn · on April 1, 2017

...and alternate file streams.

evanweaver · on March 31, 2017

Hey everybody. Direct link to the PDF is: https://fauna.com/pdf/FaunaDB-Technical-Whitepaper.pdf

ablock · on March 31, 2017

Daytona from AT&T Labs built the database into the OS...I don’t think it worked very well. The inverted model makes more sense. A process per query isn't very scalable.

sargun · on March 31, 2017

Go on? Postgresql does process per session / query, and it scales ridiculously well, especially with the hilariously multi-core systems we have today.

Although it doesn't handle unbounded concurrency, there are techniques (like manual locking, or pgbouncer) to deal with this.

ablock · on March 31, 2017

So Daytona compiles and runs a new binary for each query, not just for each connection, and does all its coordination through files + pipes + shared memory. Query binaries even fork themselves for parallelism(!). My understanding is that the cost of switching between kernel and user space is pretty overwhelming. Maybe STM would help that someday but now you’ve implemented an actual database in the kernel...

PostgreSQL’s process per user model makes way more sense but as you say still has upper limits on concurrency.

eternalban · on March 31, 2017

Not sure how to read that. Isn't simply 1 process per connection?

sargun · on April 1, 2017

Yes. But you can't run more than one query per connection.

eternalban · on April 1, 2017

One executing query per connection, meaning a connection maintains a serialized query processing model.

quantum_state · on April 1, 2017

Technical issues aside, as a db service platform provider, the statement in its Terms, "FI adheres to policies and procedures to prevent data and information loss, including a daily back-up regime, but does not make any guarantees that there will be no loss of data or information, or that FI policies and procedures will conform to industry best practices. FI expressly excludes and disclaims any liability for any loss of Licensee's data and information, no matter how such loss was caused." will make any serious user think twice or more times ... Would welcome further comments.

zitterbewegung · on April 1, 2017

Sounds like standard legal boilerplate. AWS has similar wording in their terms. https://aws.amazon.com/service-terms/ Search for loss of data.

batoure · on March 31, 2017

The "starting from thinking about this like its linux instead of thinking of it like its oracle" description is one I regularly use in Hadoop talks for less technical audiences.

The paper is interesting, it leaves me with more questions than answers.

The biggest question is: -is this a new beast from the ground up or more efficient packaging around existing distributed computing libraries?

--For example when I hear a company say something like:

> based on log-structured merge trees (similar to Google Bigtable

I want to understand if they are just rebranding HBase for that part of the tool

evanweaver · on March 31, 2017

It is a new beast (good pun, by the way). There are no service dependencies like HBase or Zookeeper, and there is not really any packaging to speak of since it's just a JAR file.

Some open source is used internally, especially Netty, but our goal has always been to build a tightly coupled system for performance reasons. For example, the entire replication pipeline including the optimized Raft implementation is from scratch, as is the scheduler.

LSM trees perform well especially on SSDs, but we probably will migrate away from them eventually to something closer to LMDB in order to get a zero-copy read path.

whargarbl · on March 31, 2017

"FaunaDB implements a process scheduler that dynamically allocates resources and enforces quality of service...Execution proceeds via cooperative multitasking."

Just like Windows 95! I don't think I've ever seen a database do this. Apparently it operates per-query or at least per application. Operating system seems right; this is a long paper.

electrum · on March 31, 2017

Presto does this as well. Execution is done by "drivers" which move data pages between operators. It's a cooperative push model rather than a Volcano (iterator) style pull model. There are typically more drivers than worker threads. Drivers are assigned to threads using multilevel feedback queues (inspired by OS schedulers).

my123 · on April 1, 2017

SQL Server also uses its own scheduler inside ;)

zenithm · on March 31, 2017

Dynamic isolation sounds cool especially without the overhead of containers. Seems obvious in hindsight I guess.

corysama · on April 1, 2017

So, can we get this running on a microkernel? Or, Fauna-on-Xen, anybody? :]

elvinyung · on March 31, 2017

How does FaunaDB plan to support or enable many OLAP use cases? As far as I understand it, there are no interactive query sessions due to Calvin, but can I easily e.g. ETL things into a data warehouse?

freels · on March 31, 2017

A better way to think about how Calvin fits into our stack is as a building block similar to two-phase-commit, on which we've built the higher level transaction semantics. And while we do not have any plans on supporting session transactions, this is more of a design constraint to support the best possible OLTP experience rather than a hard technical limitation.

Since our focus has been getting OLTP right, ETLs are currently handled by chunking record imports across multiple transactions. (We have internal tooling for this that we've used to help customers move their data into FaunaDB; we are planning on open-sourcing it soon.)

However, first-class support for OLAP use-cases is on our roadmap and something we are excited about providing, so expect more from us about this in the coming months.

elvinyung · on April 1, 2017

Thanks!

pbnjay · on April 1, 2017

Wasn't the Informix Database effectively an operating system? I'm sure it wasn't the only one. This paper makes it sound like they're the only one.

zenithm · on March 31, 2017

Detailed consistency model explanation starts on page 15.

socmag · on March 31, 2017

Performance numbers anyone?

evanweaver · on March 31, 2017

Write benchmarks, although the latest numbers are better still: https://fauna.com/blog/distributed-acid-transaction-performa...

Read benchmarks are coming soon.

socmag · on April 1, 2017

Great, thank you. Apologies for not going fishing for it myself. :+1:

frik · on April 1, 2017

Some older operating systems used a database-filesystem or database like filesystem as storage.

I heard older IBM systems had something like this. BeOS had BeFS with inbuilt metadata index. NTFS + WinFS in user land vision was a bit like this. Office server aka MS Sharepoint (which implemented WinFS vision for intranet) stores all files in the database.

skissane · on April 1, 2017

> I heard older IBM systems had something like this.

Most (non-Unix) mainframe and minicomputer operating systems – not just IBM's – have support for record-oriented files and indexed files in the filesystem. The functionality is roughly equivalent to that of non-relational "flat file" databases or key value stores. Examples include ISAM and VSAM under z/OS (aka MVS), and RMS under OpenVMS. This is in contrast to Unix and Windows, where the filesystem itself only supports files as unstructured bytestreams, and any record structure or index structure is imposed by higher levels such as applications or shared libraries.

You might also be thinking of the IBM minicomputer operating system OS/400 (nowadays called "IBM i"), which embeds a relational database (a variant of DB2) into the operating system. (I have often wondered how deep the integration actually is–I believe it is something deeper than just bundling a relational database with an OS, like how many Linux distributions include MySQL or Postgres, but I don't really know.)