Show HN: An experimental distributed SQL database from scratch in Go

slicedbrandy · on Dec 27, 2019

Hey HN,

Work is coming along on this project to build a distributed SQL database from scratch, mostly as a reference for newcomers to get an idea about the inner workings.

Looking for contributors who are interested in anything from parsers, disk paging, building out a REPL, defining an IR grammar, implementing consensus and more!

Anyone interested in contributing in these areas is more than welcome!

redis_mlc · on Dec 27, 2019

It takes 5-10 years to write, test and productionize a database or filesystem. If you're planning to invest that kind of time and effort, some suggestions are:

* if you're writing a distributed database, a novel and valuable feature would be to consider the network partition case foremost. For example, design the database from the standpont of a node being down for a month.

* do adequate logging so that an operator can understand what is going right or wrong

* how can terminated nodes automatically be rebuilt efficiently and automatically?

* all configuration settings should be dynamic

Source: experienced DBA, worked with Cassandra, Influxdb and most SQL RDBMSs

derision · on Dec 27, 2019

> if you're writing a distributed database

is this really a requirement for all distributed databases or only decentralized databases? if I'm using a distributed database and I control all the nodes, if one of them is down I don't plan on spinning it back up (assuming the data is replicated across other nodes). I guess what I'm saying is if I need 20 nodes, I'm going to make sure I always have 20 nodes running.

teraflop · on Dec 28, 2019

Yes, this applies to any distributed system.

It's a basic truth about the computer networks that if you send a message to a machine and don't hear back, you don't know whether it's actually dead or just not able to communicate with you. If a machine finds itself in that situation, one option is to simply wait until the peer eventually comes back to life, or retries enough times to eventually get through.

If it takes any other action, it has to make sure that action is "safe" (w.r.t whatever guarantees it's trying to provide) under either scenario. That's all "partition tolerance" really is: a statement that you, as the developer, have a sort of burden of proof to make sure you've considered all the possible failure scenarios. (If that seems like a tautology, well, that's why we usually don't bother talking about the "P" in the CAP theorem.)

As your comment alludes to, one can often simplify the problem a bit by reducing the number of possible scenarios, by assuming that crashed nodes never recover, they're only replaced by new nodes. But that still doesn't make the problem go away. Unless your network (including both the hardware and the OS network stack!) is infallible, you can't reliably know whether a remote machine has crashed in the first place.

Consider the common situation where you want to provide some kind of transactional guarantees; for instance, transactions should appear to complete in causal order, and their effects should not disappear once committed. That implies that even if a node "looks" dead, it's not safe for another one to take over its role unless it's really, truly dead, or you would risk returning stale (transactionally invalid) results.

zamalek · on Dec 27, 2019

> design the database from the standpont of a node being down for a month.

It uses Raft, so this should be handled by nature. If you are referring to writes to said node (not that Raft would allow it), you are delving into CAP theorem and what you are suggesting is impossible (unless you don't care about consistency).

zbentley · on Dec 28, 2019

Raft is a consensus protocol. It doesn't handle what you actually do during long partitions (stop writes if quorum is lost? Indicate that consistency may be compromised in some other way?) Or how to handle long-lost nodes rejoining (do they rejoin no matter how long they've been gone and service stale reads? Does the cluster "forget" them after some period of time? If so, how do you synchronize the act of forgetting so you don't end up in split brain if only some healthy nodes have forgotten their long-lost brother when it comes back online?)

Raft is not "use this library and your distributed system Just Works". It's a low level building block, like TLS--and nobody says "just use OpenSSL and your app is fully secure".

bruth · on Dec 28, 2019

If it is intended to be a reference project, I suggest focusing on writing the README sections first highlighting the main areas and then linking to code sections to ease the navigation.

You may or may not be aware, but Andy Pavlo records his courses and puts them on YouTube. His latest playlist covers the main database topics with the last five or so lectures covering distributed databases: https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

edit: ^suggesting Pavlo's work since he introduces database concepts very well, so it may be worth structuring the reference architecture in the same way.

kasey_junk · on Dec 27, 2019

How are the goals of this project different than cockroachdb?

ochredoke · on Dec 27, 2019

I am embarking upon a similar project: I’m building a privacy-focused product similar to Google Photos and Flickr. My goal is to not only to develop a usable product, but also to document the process in as much detail as possible in order to enable others to build similar things.

I feel strongly about spreading the skills required to build non-trivial products. I hope to enable others to deliver similar technical projects.

vchak1 · on Dec 28, 2019

Have you seen Perkeep? https://perkeep.org/ Mainly build by Brad Fitzpatrick, in Go.

ochredoke · on Dec 28, 2019

Unfortunately I have not. I should do better at researching prior art. This is something I’ve wanted to do for a while, but never managed to carve out enough time. It’s great that there’s an open source project that I can simply use.

bradleyankrom · on Dec 27, 2019

That’s awesome, I look forward to your forthcoming Show HN.

oscargrouch · on Dec 28, 2019

By looking into your code, it looks your Btree is a in memory implementation, correct?

I guess is a important information to put on the Readme, and also if you have any plans to create a persistent one, how you will approach this (if you will try to replicate the SQLite Btree here)

Also, i guess the most easy way to make a pesistent btree happen, is to get that one from that neat key-value store in Go that mimic the LMDB Btree (I dont recall the name right now).

The algo behind the LMDB kind dont need to use journal files, so will be reliable AND fast (is it inspired by Rodeh's Btrees ?). I just dont know about the paging aspect, if its any good as the one SQLite uses.

evolveyourmind · on Dec 28, 2019

I’ve also noticed that. As far as I can tell, there isn’t actually any persistent operation, even for the tables.