We originally used szferi/gomdb which worked well but it didn't give us performance visibility with "go pprof" and it was difficult to debug at times. Bolt also takes the approach of simplifying the LMDB API significantly. There are a lot of advanced, unsafe features that aren't available in Bolt (e.g. WRITEMAP, NOMETASYNC). There's more detail in the README:
This is a fairly minor nit, but a persistent one nonetheless: you keep claiming to have removed unsafe features from LMDB, yet you left the most dangerous one in there: NOSYNC provides no consistency guarantees whatsoever, whereas e.g. with NOMETASYNC you merely lose durability. In particular NOMETASYNC retained crash safety, which you don't get at all with NOSYNC.
The way this is worded, it sounds like an improvement, but in reality bolt is no safer to use than LMDB was, it just lacks some configurable tradeoffs that were present in the original engine, and were safer to use than those that remain in bolt.
That's a fair point. Although I'd argue that WRITEMAP and APPEND are more scary. :)
I try to point out in the docs that DB.NoSync should only be used for offline bulk loading where corruption is typically less of an issue (since you can simply rebuild).
The wording was intended to make it sound like an improvement for people wanting smaller, more concise code bases. Bolt is 2KLOC vs LMDB's 8KLOC. Smaller surface area makes it easier to test thoroughly. It's definitely a tradeoff for people who want that really low level control though.
LMDB author here - what have you got against APPEND? Fast bulk-loading is pretty important when you're initializing a system.
Note that APPEND mode won't allow you to corrupt the DB, despite what the doc says - you'll get a KEYEXIST error if you try to load out of order. We just left the doc unchanged to emphasize the importance of using sorted input.
Hey Ben, what (refs/books/lectures) would you recommend for someone who is interested to learn theory behind these kind of DBs and implement it from scratch?
I originally started Bolt to learn about the implementation details of low-level data stores. It started as a pet project but became more serious as I filled it out and added additional layers of testing.
I'd honestly suggest reading through the source code of some database projects. That's the best way to learn the structures and understand the engineering tradeoffs. Some are more approachable than others though. Bolt and LMDB are 2KLOC and 8KLOC, respectively. LevelDB is 20KLOC and RocksDB is 100KLOC, IIRC.
If you want to go further then I'd suggest porting one of those over to a language you know well. It doesn't have to be code you release to the world or use in production but just porting the code really helps to cement the concepts in my head.
To understand why the LMDB implementation does what it does will require much more practical knowledge, not theory. The references in my MDB papers (and the paper itself of course) will give you more of that. http://symas.com/mdb/#pubs
B+trees are trivially simple, in theory. Writing one that is actually efficient, safe, and supports high concurrency requires more than theory. It requires understanding of computer system architecture: CPU architecture/cache hierarchies, filesystem and I/O architecture, memory, etc.
It does not require 100KLOC though. RocksDB is just a pile of gratuitous complexity.
Thanks for the good points! Bolt and LMDB seem approachable, and I'm especially interested how ACID compliance is implemented and how to check/prove the correctness of it...
Can you please compare the locking granularity differences between Bolt and its competitors? File-level locking is pretty coarse-grained and less concurrent than other approaches. Compare with mdbm (recently discussed here) which uses page-level locking.
I've looked briefly at the Bolt documentation a few times and I can't say that I like the idea of giving a function as a parameter to the Update and View functions. It result in some weird looking code with functions within functions, that can't really be refactored out or reused.
Pretty much Javascript callback spaghetti in Go. There might be a way around it, but I haven't found any examples. Perhaps there could be a way where transactions a manually managed.
Nested closures is a different problem than "callback hell". Callback hell is constantly dropping context on the floor with the stack frames, making every event handler like the "goto" that Dijkstra considered harmful. Nested closures, by comparison, are merely messy, but do not cause context loss. They can be refactored, it's just one more level of abstraction to it than you may be used to, and is often probably not worth it. (For example, do you really need a function that you can pass a closure into and get another one back? Maybe, but that's a level of abstraction than can be very cognitively costly; you better be getting some serious benefit for that cost.)
I have observed before that this sort of DSL-like usage can sometimes make people forget that they are dealing with plain old-fashioned language constructs that can be manipulated with other constructs. Languages like Ruby that allow this fact to disappear out of the source code completely seem to make this more likely. Go won't let you forget you're just creating a bunch of closures, but if you've been trained in an enviroment where a "block" is something special or in such a Ruby "DSL" you may have a way of thinking about these issues that inhibits proper refactoring. This is the general case of what okatsu observes before me, that nothing stops you from making real functions or methods to use as the methods.
Also, Go's relatively recent "bound methods" are really useful for this sort of thing.
Bolt author here. The DB.View() and DB.Update() functions are simply wrappers around DB.Begin(). The View/Update functions were not part of the original API but they're incredibly useful in practice and make the application code more idiomatic.
There are several benefits to using the wrapper functions:
1. It lets you return an error to rollback a read-write transaction. This is typically what you want. That error will get passed through as a return from DB.Update().
2. Because of #1, you can have one error check for application errors and system errors (e.g. disk full) after a transaction commits instead of an error check for DB.Begin() and DB.Commit().
3. If you forget to close a read transaction then it will keep old pages around to maintain consistency but this will cause your database to grow without bound. If you use the helper functions then you're protected from this even in the event of a panic().
The API is modeled around database/sql where applicable. Let me know if you find anything confusing and I can update the documentation to make it more clear.
It would be nice if you mentioned the DB.Begin() etc. in the Readme, not only the Update() & friends, so that a casual reader would know he has a choice here. I couldn't find it mentioned there, so how could I know I have to search for it among dozens functions in godoc?
Never really tried yet, but I think when using database/sql.Tx, the following snippet would be quite idiomatic, and I believe it should be perfectly safe:
The "defer tx.Rollback()" makes me cringe a little since it's confusing when you read the code top-down. It also throws out the error returned from Rollback(). Granted, the only possible error is if the DB is closed so it's not a big deal in most situations.
I would actually prefer to have the Begin, Commit in the readme and then a recommendation to use the Update and View. It might just be a difference in learning/understanding, but I understand the reasoning behind Update, View after reading the comments regarding the caviats of manually dealing with the transactions.
> Perhaps there could be a way where transactions a manually managed.
If you take a look at the code, db.Begin is available and gives you access to the underlying transaction. You can use it as you want, however you'll need to be aware of catching any error, rollbacking, committing etc... all that is taken care of for you in db.Update (or db.View for read-only stuff)
This style is also consistent with the rest of the standard library, such as filepath.Walk or sort.Search.
Ah, you're right. But it would be nice if they showed this in the Readme, along the "wrapped" examples.
Actually, going the "db.Begin" way would be more consistent with the rest of the standard library.
Most of the Go APIs go to great length to be "external" (i.e. iterator-like), e.g. bufio.Scanner, and (of special note in this case!) database/sql.Tx! It appears to me the "internal"/callback APIs are introduced in those places where there are really compelling reasons, which justify breaking the consistency: e.g. for filepath.Walk, there's recurrence, so it's mostly giving up to the obviously K.I.S.S. way; in sort.Search, the benefit is that callback allows to evade need for generics/reflection.
> I can't say that I like the idea of giving a function as a parameter to the Update
I remember that was a common pattern at Google when creating APIs interfacing with BigTable which also is a KV store. The premise is that you don't need to load everything into memory at once and instead do your work (for Update operations) or just pick interesting fields (for View operations) in smaller increments in your callback function.
> functions within functions, that can't really be refactored out or reused
This is functional programming but the choice of this paradigm does not impact code reuse. You can pass proper functions or methods and don't have to stick to lambdas.
I'm still new to Go, but the functions don't have to be anonymous, right? What would be wrong with defining all of your functions somewhere (for updates, iterations, etc.) and passing them by name to Bolt's?
Note that using a sqlite wrapper like https://github.com/mattn/go-sqlite3, you also don't have any external dependencies and your application still ships as a single binary. However, there are other advantages of a Go-native storage library (e.g., doesn't require cgo, better integration with tools like pprof, etc.)
My use case would be small local tools that need to keep some state. Essentially anywhere a SQLite would be useful. I don't think this was designed to serve as a database to back anything more 'serious'.
It's not the data authority. It feeds in data from Kafka and structures it in an optimized way for querying. If a node dies then it can copy from another node and restart from an offset in Kafka. Kafka essentially works as a distributed write-ahead log (WAL).
Site usage analytics has a lot to do with state in a way. How about a visit counter? Something lightweight that would persist page count/ip address/lastdate values with an ip address key.
I haven't used GAE's Datastore API but using protobufs on top of Bolt is trivial. Bolt just uses []byte for keys & values. I use gogoprotobufs with Bolt often.
If you're interested in working on projects like this, we are hiring - contact me tom at shopify.com.