Hacker News new | past | comments | ask | show | jobs | submit login
LearnDB: Learn how to build a database (learndb.net)
269 points by niklabh on Nov 29, 2018 | hide | past | favorite | 53 comments



For complicated reasons I was involved in a successful project to develop an open-source, Oracle-SQL-compatible, transactional-integrity-preserving, extensible database designed for in-memory performance in Java. In the process of this development "suddenly" the reasons for a whole bunch of performance and syntax oddities in databases like Oracle and PostgreSQL became very clear.

I was shocked to learn that was 15 years ago when I looked up the link to share, but if you are interested in the topic of "how to implement a database" it may be worth a look.

For what it is worth it was listed (not by me) on the C2 Wiki on the "Programs to Read" Page, where it was described as "[A] database written in Java with good unit tests and ShortMethods." [1]

Both statements are true, for a complete working example of a production database (it supported a commercial product for at least 10 years) it is actually a pretty accessible and well documented code-base.

The project is called AxionDB and can be found at [2].

[1] http://wiki.c2.com/?ProgramsToRead [2] http://axion.tigris.org/source/browse/axion/


On the subject of readable database source code:

* LMDB is legendary for its performance, but it is source code is kind of hard to read. Found there's a pretty decent port to Java which looks a lot more readable [1].

* LMDB doesn't support SQL. But there's this SQL implementation: Apache Calcite [2]. Creating a DB using llmdbjava + Calcite sound like an interesting project.

1: https://github.com/lmdbjava/lmdbjava/tree/master/src/main/ja...

2: https://calcite.apache.org/


Maybe it would be better to start with https://github.com/leo-yuriev/libmdbx


This looks super interesting! Although it still looks more complicated to understand than lmdbjava. I feel any C codebase will be more difficult to follow since there's gonna be a lot more code related to memory management that happens automatically in a GC language like java.


JVM and .NET ports - https://github.com/castortech/mdbxjni and https://github.com/wangjia184/mdbx.NET are listed there as well.


Wrt JVM, not a port, but a JNI interface.


Could you provide some examples of oddities?


"Oddities" might be overstating it, but I remember a number of times a light went off and I though "oh, so _that's_ why Oracle does X". It was the same kind of information you might deduce from looking at a "query plan" from `EXPLAIN` or similar, and the kind of thing you might deduce from a careful review of topics like transaction isolation levels.

It's been a very long time so I'm not sure I can name many specific examples, but IIRC some of the topics that became very clear were things like:

* the importance of the order in which JOINs and WHEREs are applied (to limit the number of rows being accessed per step)

* the relationship between columns that are selected and those that appear in WHERE and ORDER BY clauses.

* the value of tables with a small number of columns (limiting data that must be read per row)

* the cost/complexity of variable-width columns (VARCHAR vs a fixed-length string), again because of the time and complexity required to do something like "skip ahead three rows" in a data file

* the behavior of CLOB/BLOB types (stored external to the "main" table content)

* the various types of transaction isolation levels and the conditions in which it becomes hard or impossible to guarantee isolation without table or row locking

  * etc.
The references at http://axion.tigris.org/readings.html describe some of the theoretical concepts we seemed to think were important or useful at the time.

Also, there's a post at http://heyrod.com/articles/radio-blog/pleasures-of-profiling... that describes a session of performance-tuning on the index implementation that gets into some of the nitty-gritty implementation topics. (But a lot of the issues addressed there were an artifact of Java's primitive vs. object representation, which has been reduced by things like generics.)


Yup, a good way to understand why RDBMS are they way they are is to try to make your own in whatever language you choose. I've done this on a very elementary scale making a built in DB for a past project, at first it was single threaded, but when you start considering indexing and the datas structures necessary, it makes you understand and appreciate the man hours that have gone into projects like SQLite, PostgreSQL etc. At first you think, oh I can just write my own database, but when you start getting into areas and the CAP theorem starts to create problems, you really understand why anything past a local machine only ephemeral KV store is asking for trouble.


Exactly. We sort of backed into this project by starting with HSQL (http://hsqldb.org/) a low-end pure-Java "database" that was popular at the time.

The more we used HSQL the more it became clear that it was more like a SQL-parser wrapping a simple key/value store than a full-on ACID database. We created AxionDB precisely because we needed the kinds of capabilities you mention and at the time HSQL did not provide them and wasn't remotely architected to support them.

To be honest though, creating a moderately robust RDBMS from "scratch" turned out not to be the most ambitious or complex part of the overarching project that spawned AxionDB. The harder part was trying to use Java's primitive, built-in HTML-renderer to create something approximating a fully featured browser. The effort and complexity behind something like Gecko, WebKit, Edge, Blink, etc. is very easy to underestimate. It's a hard problem, made much harder by having to tackle the kinds of content you find "in the wild. Frankly building a database was a much more straightforward problem than that.


Aside: found about your GraphViz cookbook on your site [1]. Looks super useful. Is it coming any time soon?

1: http://heyrod.com/projects/gv-cookbook.html


Do you have any more details on “the reasons” for the performance and syntax oddities?


Hi Denzel. I replied to a very similar question at https://news.ycombinator.com/item?id=18562007 so rather than repeating myself I'll just point you to that.


How is this #2 on Hacker News? There's literally nothing here about how to actually build a database (yet?)

Instead, there's just a key-value store implemented on top of a javascript hashmap and a filesystem.


This has to be a joke? This looks more like 'how to use a k-v store 101 for non programmers'

I don't want to put more content under a terrible post, but the best resource for this material Jennifer Widom's MOOC.

https://cs.stanford.edu/people/widom/DB-mooc.html


I took this back in 2011 and don’t remember there being much on “building a database”. Is there more advanced content now?


I'd be very interested in a book or blog series similar to the scads of "how to build an os from scratch" or the raytracer projects, like Peter Shirley's little One Weekend series. Is there anything that scatches that itch?


Nand2Tetris is a fan-favorite: https://www.nand2tetris.org/

Build a working Tetris game from literal logic gates up.


I closed the page right after I read JavaScript.


Why would a pedagogical project choose JavaScript of all languages?

Unless the project specifically needed to leverage features of the language, or a web browser, it's an incredibly poor choice for building anything with well maintained abstractions. Or anything at all ready, when the language is covered in warts.

I imagine the author hasn't yet discovered for themselves why it's a poor choice, given only a key-value store has been implemented (using JSON.stringify, no less)


JavaScript is extremely accessible and the warts can easily be worked around.

I spent a lot of my life being a JavaScript hater and after working with it for four years I can definitely see how it is really a simple starting point for a lot of concepts. Sure it has warts. What doesn’t?

Have you worked with a JS language for any extended period of time? What about that made you not even want to consider it for anything? I’m really curious, I’m not trying to set you up - I’m really not exeperienced enough to do so :)

Edit: to be clear because of general curiosity and because I hate bringing work home, most side projects I work on are in some other language.


> and the warts can easily be worked around

Not really. The author doesn't leverage any kind of static typesystem, which would at least mitigate some of the warts. He's chosen Node.js instead of leveraging the browser, so no free visualization layer for doing anything interesting with, unfortunately.

> Have you worked with a JS language for any extended period of time?

Most of my day job is working in a really, really big Javascript codebase. I can say with confidence that the language is something that we're absolutely stuck with, and I still have no idea why somebody would implement a pedagogical database with it (well, then again, they haven't; they've implemented a key-value store, which is trivial).

To offer an alternative, they could have chosen something boring but everywhere like Java, which is just as accessible to those with less experience. Then they'd have the possibility of doing fine-grained parallelism, file access, designing abstractions that fit within a statically typed language.


No stupid questions time, but given all languages, would an array based language like J be a decent choice for building a database from scratch?

I mean, tables seem like doubly indexed arrays.

In my mind it wouldn't be readable but it'd be maybe less lines of code than other languages?

I have never thought about building a database that takes into consideration CAP theorem and ACID compliance, sounds super tough. I'm usually happy enough when I learn something neat about Postgres.

I thought this would be neat but it's just a kv store in JavaScript.


This was my thought exactly. For many high school and university students in the US, Java is their first language and introductory courses usually deal in unsatisfying toy applications. I could imagine a great primer into building a database that introduces all kinds of interesting computer science concepts as well as practical programming techniques (e.g. organizing a project as it grows).


How is JavaScript accessible? Because of websites like the linked article, which covers the basics, that lets you build things? Because Udemy, codeacademy and all the others will teach you how to google program using it?

Maybe you’re right, but this page doesn’t actually teach you databases, and most of those courses don’t actually teach you programming.

At least not efficiently.

I have worked with JS for a long time, and I don’t hate it, but it’s such a terrible language and environment that the most popular part of it is literally a strict syntactical superset of of it.

I don’t particularly like Typescript by the way, and I don’t think it’s really that useful, but you can’t deny that most people do.

I think 95% of all projects would be better of using something that wasn’t JavaScript for the whole stack, and I think that’s the reason so few things are build on node. I like graphql as much as you do, I also think Prisma is okish, but I’d rather use Django, Flask, Java Spring it .Net Core because they are so much more efficient in the long term.

Of course on the client side, all the innovations and all the talent lies with JS, and there is some advantages for using JS for your whole stack. Those advantages end at the DB though, at least in my opinion.

This isn’t a problem, you can use Prisma for Postgres, and there are decent drivers for mssql, but I’d never advise people to use nosql unless they had a very specific reason for doing so, and I can’t think of one.


js is perfect for the resource constrained student of today ie their primary computing device is a mobile.mobiles ship with a js environment built in regardless of platform. i used to be like you until i had to start advising kids from low income environs in india on what language to learn(from a personal desire to create hackers with low resource access and seeing what happened+give them access to coding as a skill).

js really is the most accessible


Do they have internet? Because if they do, then cloud9 IDE is an excellent way to teach, any language too.

One of the best examples of this is CS50x from Harvard, which uses it (and other cloud services) to give students an IDE, automatic helpers, debuggers and code checkers.

Node is easy to set up and get going with, but if you’re teaching people how to hack, maybe it’s better to teach them something much more low level so they can actually build real things? I mean, if you want to build a solar powered aquafarm, or water pump, then C or Python and a raspberry gets you a lot further than JavaScript, and with C you actually learn how computers work.

But I guess it depends on the circumstances, and not knowing yours, then perhaps JS is a good choice.


I don't think the average case for new programmers looking to hack something is to start out building a solar powered aquafarm. It's better to teach the fundamentals first, which are easy enough to pick up from Javascript. Most programming courses start with something easy like Java, Python or Pascal for exactly this reason. Then later if they want to complete a more advanced project, they can pick up low level coding - it's not much of a leap, despite how fawny and melodramatic people get over coding in C/assembly/whatever.


C/assembly are great starters because they teach you how computers actually work.

Even a simple function in JS taking one variable abstracts a good deal of that away from you because it handles things like memory allocation for you.

A lot of modern hopeful programmers that make it into our interviews can barely explain what the new keyword of a OOP language actually does, and way too many have no idea what a stack is.

This isn’t useful when you write CRUD web-applications for a few thousands users, which is arguably a lot of modern programming is, but it’s extremely useful if you ever want to build something original.

Which is actually the key issue. You seem to think building a water pump isn’t a beginner project, but why isn’t it? It’s one of the most basic programs you could write. In fact it’s so basic that you could solve it mechanically, without the use of programming, if you have running water to power the timed open close mechanism of the pump and pull the water.

By contrast, a web site is infinitely more complex. Only you think it isn’t, because other programmers have build most of your tools for you. Which is great, you should stand on the shoulder if giants every time you can. What isn’t going to be great is when you’re tasked with solving a problem no one has solved for you first.


for the vast majority of modern projects, you don't need the performance that manual memory management provides. I would bet that 90%+ of modern software is written in a memory-managed language. If you need to leap from JS to C, it's really not that hard. But trying to learn about stack frame allocation and RAII and cache line optimisation is a lot to take on when you've just learned what a variable is.

The abstractions are what's important, memory layout is an implementation detail. Even when you're writing C or assembly, you're still thinking in terms of data flow and logic, you're just having to do a lot of the work manually. Learning the abstractions without the baggage of the implementation detail will get 90% of people 90% of the way. Once you have that solid foundation, you can go on to learn assembly or C if you need it because it's not that big a leap from a coding mindset to a hardware control mindset. But to go straight from no coding experience to hardware control is going to be more difficult for a new student to wrap their head around.


I think you misunderstood me. It’s not the abstraction that’s important. It’s the problem solving that comes with knowing how the understanding.

How can you really be tasked with solving problems if your first response is to look for a node package that does it for you? And that’s what JavaScript teaches you.

I’m not saying you shouldn’t use node packages. But learning to do that as your first steps into programming is robbing you of learning how to use code to solve problems.


Ah I see - so your issue is with the proliferation of tiny libraries associated with NPM?

That's not inherently an issue with Javascript. I agree that solving every issue with a library is bad in the same way that solving every issue by copy-pasting from stack overflow is bad, but it's not inherent to the language. I primarily work in Javascript these days and I've never actually seen a project that relies on an excess of micro-libraries. A good teacher will teach students to understand programming logic, not to lean on excessive crutches.


>Do they have internet?

no


This was my thought as well. I'm curious to know if the author has previous experience building actual databases or if this is a case of follow-along-while-I-figure-this-out.


> Why would a pedagogical project choose JavaScript of all languages?

I don't think the JS part is the issue. Node.js does I/O, files and co. The issue is that the article doesn't teach how to build a database at all.

It doesn't explain how to efficiently persist and fetch data from a file, indexing strategies with trees, concurrent file access,locking, basic transaction... that's what I expect from a tutorial about how to build a basic DB system.


Plenty of languages do I/O, files, and co.

Why choose a language plagued with warts and a half-baked ecosystem for something pedagogical? A language with all sorts of peculiarities from the '95 browser era, and not at least leverage browser technology? A language which gives you very non-interesting coarse control over resource, and has no concept of parallelism?


Moreover, Node.js's non-blocking, single-threaded nature is great for the kinds of plumbing that things web services and related applications require (The stuff we used to call "database-to-web" applications -- i.e., fetch some content from a data-store and render it and/or accept some content from a client and store it) but seems like a pretty poor match for disk I/O and CPU-bound applications like a database.

And I'm saying this as someone that's actually a big fan of JavaScript and that does a fair amount of enterprise-scale work with it.

EDIT: I meant to add: But, I think this might be beside the point. If you are looking for a language than a large number of people can more-or-less read and write, JS is a pretty good choice. If your objective is teachablity/readability rather than production-quality performance or capabilities, I think JS is probably on the short-list of candidate languages.

I can't find the quote offhand but I think Douglas Crockford or someone like that described JavaScript as "the only language people feel comfortable using without learning the syntax". (This was especially true prior to server-side applications like Node.js, when JS was largely considered a toy-like language for scripting basic input validations). If nothing else, this makes JS a reasonable "executable pseudo-code"--something practically any programming can read and probably most can execute.


Easy access and fast results. Learning isn't about creating the best possible implementation. Python wpuld have been another suggestion from me personally. What language would you have used?

A lot of people without a formal educational background start with established scripting languages.

I you handle data, chances are that you will get in contact with javascript at some point.

Using coffeescript or typescript would be a worse choice for learning, even if they counter some disadvantages of the base language in question.


I realized how little i knew about how databases work until I watched this lecture series. https://www.youtube.com/channel/UCHnBsf2rH-K7pn09rb3qvkA

Does anyone have DB internal book recommendation that inst' boring as hell.


You might try this collection of papers about DB technology.

http://www.redbook.io/

Fair warning: this is not for beginners. If you find the going too hard, start with a DB textbook.

Also, that site doesn't seem to include the papers themselves. But most of those papers are very famous, so if you search for the titles, you should find copies. Worst case, you might need to visit a university library.


> Fair warning: this is not for beginners. If you find the going too hard, start with a DB textbook.

which one can teach how to build a DB system from scratch? I'm not talking about SQL theory or implementing a SQL parser but the actual persistence, indexing part.


Both of them, really. There's a lot to know if you want to build a modern database system. But no work I am aware of addresses the specific question of how to build a database system from scratch.

A reasonable approach might be to start with this high-level paper, and follow papers they reference until they get specific enough to address your specific questions:

    Joseph M. Hellerstein, Michael Stonebraker, James Hamilton. 
    Architecture of a Database System. Foundations and Trends in Databases, 1, 2 (2007).
For the sort of low-level issues that you are interested in, it might be useful to study a well-regarded persistence package, such as Berkeley DB.

https://en.wikipedia.org/wiki/Berkeley_DB

Another system that might be worth studying is SQLite.

https://www.sqlite.org/index.html


If the Hellerstein/Stonebraker/Hamilton paper is the kind of overview you are looking for, then a better link for the SQLite equivalent is

https://sqlite.org/arch.html


After hearing DDIA (https://dataintensive.net) recommended a few times, I picked it up and have been working through it recently. The book includes an incredible amount of references that provide further reading on individual areas.


I confirm DDIA is awesome book, I would recommend to anyone working with database and distributed systems.


Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery (The Morgan Kaufmann Series in Data Management Systems) ISBN-13: 978-1558605084

https://www.amazon.com/dp/1558605088/?coliid=IYEILMZI5DVNM&c...

and

Transaction Processing: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems) ISBN-13: 978-1558601901

https://www.amazon.com/dp/1558601902/?coliid=I2GWJZ9XJ5D4JI&...


Pro SQL Server Internals by Dmitri Korotkevitch

https://www.apress.com/gb/book/9781484219638

It's specific to MS SQL Server but I found the chapters on log file management, indexes & isolation levels to be accessible and enjoyable.


I found this one very good and easy to read:

https://www.amazon.co.uk/Database-Management-Systems-Raghu-R...


Not a lot of content yet, but building databases (even toys) is an incredibly interesting exercise which I encourage many more people to try out, so +1!

For anyone else who is interested in learning how to build a database, can I thoroughly recommend following along with Andy Pavlo's Advanced Database Systems course from CMU[1]. Every lecture is accompanied by reading lists, notes, and assignments. Whats more, I find Andy's style to be very easy to parse even on complex topics.

Even if you think you know a fair bit about this domain, you will likely learn a lot!

[1]https://www.youtube.com/playlist?list=PLSE8ODhjZXjYplQRUlrgQ...


Why the votes? There’s nothing there.


Since you are creating a new file for each entry, I think BTRFS would be a good choice.

https://en.wikipedia.org/wiki/Btrfs


just vote for the title.. this place is really very funny..


why the hell would you share this when its nowhere near done




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: