Badger – A fast key-value store written natively in Go

blaisio · on May 14, 2017

There is so much negativity in these comments! This project is really cool!

I really appreciate the trend to rewrite C/C++ libraries in Go. It has always been really frustrating that hacky library wrappers for other languages leave performance and features on the table because they're either incomplete or just too hard to implement in the host language. For most of the languages out there, it is always better to have a native implemention.

There are now a number of great embedded key/value stores for Go, which make it really easy to create simple, high performance stateful services that don't need the features provided by SQL.

solidsnack9000 · on May 14, 2017

What negativity are you referring to? Most comments seem to be positive or something to the effect of "how do B+ and LSM compare for this use case?".

> For most of the languages out there, it is always better to have a native implemention.

I'm not sure that's true...imagine a K/V store written natively in Ruby, Python, JS... People have done what DGraph did before, in other languages -- writing log-structured databases in Haskell, &c -- but without a very large community you didn't get the level of improvement and testing that you see with something written in C and used across many languages. The JVM is an example of an environment where the "native" approach has worked out well; but is that "most of the languages"?

ericfrederich · on May 15, 2017

Can C++ make use of this new Go library? It is my understanding that Go can now create shared libraries. That was my holdup for not adopting Go earlier on... it wasn't a good choice to write libraries in.

hactually · on May 15, 2017

You've been able to do that since 1.5

``` go build -o libWithExport.so -buildmode=c-shared myProject ```

josteink · on May 15, 2017

> I really appreciate the trend to rewrite C/C++ libraries in Go

If we're going to rewrite libraries which may be used by other applications, which gets deployed on servers, where people will need to update things in the name of security...

It would be nice if those libraries weren't statically linked, and thus would need to be rebuilt, pushed and redeployed for every one of their dependent libraries which had a security issue.

The only place I can imagine Go-based software being useful, is for code you've built yourself, for use in a always rolling forward cloud-environment you maintain yourself, where updating dependencies and rebuilding is part of the day to day operations.

Trusting Go-code written by other people for anything you deploy publicly sounds like something I'd rather never do. It sounds like a security nightmare.

jerf · on May 15, 2017

"The only place I can imagine Go-based software being useful, is for code you've built yourself, for use in a always rolling forward cloud-environment you maintain yourself, where updating dependencies and rebuilding is part of the day to day operations."

Which is, you will notice, the core of where Go comes from.

Also, you may notice that "server" software is generally moving in this direction very quickly. "Containers" are basically static linking writ large. We're still early in the curve on this but I expect in the next few years we're going to see more people start pointing out that dynamic linking is basically obsolete nowadays (the first few brave souls have already started mumbling it), and even the only putative reason that people keep citing for it being a good idea, the ability to deploy security updates (but not ones that require ABI changes), is fading fast and was less important than people made it out to be anyhow. First, lots of people still basically don't do those updates. Second, even if you deploy the updates you still have to take the service down to restart it. Third, by the time you've instantiated the architecture that lets you manage your dynamic library updates at scale you've also instantiated the architecture you will need to simply rebuild your static-everything containers and redeploy them.

In another ten years I expect it to be simply common knowledge that if you can't completely rebuild and redeploy everything easily for a fix, you're not really operating at a professional level.

pjmlp · on May 15, 2017

Go tooling already supports dynamic linking, just not in all platforms.

_qc3o · on May 14, 2017

Was with you until you said you don't need SQL.

ckaygusu · on May 14, 2017

For really simplistic applications, omitting SQL is a fine choice, especially in golang you don't have the comfort of a well established ORM.

There exists libraries that do provide an ORM, but since the declarative side of golang is limited compared to languages, say Python, I find them unwieldy. Also, the current trend shuns the usage of an ORM in golang and encourages directly or indirectly writing SQL queries and interacting with the database through database/sql and its extending libraries, like jmoiron/sqlx.

_qc3o · on May 14, 2017

For really simple application omitting a k/v store is even better. Why isn't a simple map good enough?

ovao · on May 14, 2017

Lack of persistence is one such reason. Because Go's built-in maps are not completely safe to use concurrently, too, their usefulness as a key/value store is somewhat limited in many applications.

Naturally, it's possible to build a solution to persist Go maps and to make them safe to access across goroutines, but by that point you've built a persistent key/value store.

ckaygusu · on May 14, 2017

Depends on the requirements (persistency, concurrency etc.) No offense, but it's bit pointless to discuss it further without deeper context.

mjaniczek · on May 14, 2017

Now let somebody run a QuickCheck on it, like they did with LevelDB: http://htmlpreview.github.io/?https://raw.github.com/strange...

dmix · on May 15, 2017

All OSS projects should be tested with QuickCheck! I really hope this is a testing system that gets copied by other languages.

I'm curious, is it a requirement to have types in order for QuickCheck to make sense? So you know what type of data to hammer a function with for example.

hyperpape · on May 15, 2017

No, types are not necessary: http://hypothesis.works/articles/what-is-property-based-test....

atombender · on May 15, 2017

QuickCheck is awesome, but is there a good Go equivalent? Anyone tried Gopter? https://github.com/leanovate/gopter

iainmerrick · on May 15, 2017

It's interesting that the main motivation for this was that the Cgo interface to RocksDB wasn't good enough. I hate to turn this into a language war, but it's a big difference between Rust/Nim/etc where you just call C code more or less directly, and Go/Java/etc, which need a shim layer to bridge between C code and their language runtime.

And possibly Go has the right approach! Is it better to make C integration as simple and smooth as possible, to leverage existing libraries, or is it better to encourage people to ditch all that unsafe C code and write everything in Go?

openasocket · on May 15, 2017

I've thought about this a bit, because I use a lot of Go at work.

The reason calling C code from Go is mostly because Go has a different ABI: it does weird things with the stack, calling conventions, etc. But that doesn't stop you from calling assembly written to that ABI directly from Go. In fact, that is done a lot in the standard library. It is possible to create a wonky, C-like, low-level language that compiles into machine code with the Go ABI. Let's call this language Go-- :). With something like that, it might be possible to translate existing C code into Go--. However, the most likely use for Go-- would be for use in performance-critical sections of your Go code.

iainmerrick · on May 15, 2017

I'm skeptical that that would enable reuse of existing C code.

If the workflow requires you to modify or annotate the C code, that's a lot of work (and risky) and you might as well just rewrite it in Go.

If the workflow is totally automatic, so you can just use off-the-shelf C code, that's great! But in that case it's effectively just a C compiler, the "Go--" bit seems like a red herring.

It reminds me a bit of "C--", Simon Peyton Jones' suggestion for a low-level target for Haskell and similar languages. It would remove some C functionality that language runtimes don't really need, and add some extra low-level stuff like register globals and tail calls. I don't think that got much traction, but it may have influenced the design of LLVM.

dis-sys · on May 15, 2017

I am excited about the fact that the design is focused on SSD from day one (not just optimised for SSD). Wondering whether the authors have plan to optimise it for Intel Optane which has lower latency and much higher random IOPS at low QD. Currently I am using a cgo based RocksDB wrapper with the WAL stored on Intel Optane, the rest of the data is on a traditional SSD.

It will also be great to have some comparison on write performance with fsync enabled.

Overall, very interesting project, bookmarked, will definitely follow the development!

skyde · on May 14, 2017

Range iteration latency is very important and might be limited by concurrency. I think you can only get 100K IOPS on Amazon’s i3.large when the disk Request queue is full.

fio [1] can easily do this because it spawn a number of threads

While working with Rocksdb we also found that Range iteration latency was very bad compared to a B+-tree and that RocksDB get good read performance mostly from random read because it's using bloomfilters.

Does anyone know if this got fixed somehow recently?

[1] https://linux.die.net/man/1/fio

mrjn · on May 14, 2017

(Badger author) We have tried huge prefetch size, using one Goroutine for each key; hence 100K concurrent goroutines doing value prefetching. But, in practice, throughput stabilizes after a very small number of goroutines (like 10). I suspect it's the SSD read latency that's causing range iteration to be slow; unless we're dealing with some slowness inherent to Go. A good way to test it out would be to write fio in Go, simulate async behavior using Goroutines, and see if you can achieve the same throughput.

If one would like to contribute to Badger, happy to help someone dig deeper in this direction.

skyde · on May 15, 2017

To fill the queue on Linux goroutine wont be enough you would need to use libaio directly.

sudo apt-get install libaio1 libaio-dev.

mrjn · on May 15, 2017

Go has no native support for aio. Based on this thread, Goroutines seem to do the same thing, via epolls. https://groups.google.com/forum/#!topic/golang-nuts/AQ8JOHxm...

I think the best bet is to build a fio equivalent in Go (shouldn't take more than a couple of hours), and see if it can achieve the same throughput as fio itself. That can help figure out how slow is Go compared to using libaio directly via C.

skyde · on May 16, 2017

While Network socket in Go are using epolls automatically, file are not. From looking at Badger code for example: fd.ReadAt(buf, offset) would block.

See this issue: https://github.com/golang/go/issues/6817

reacharavindh · on May 14, 2017

Probably a naive question, but how does it compare to Redis? When would someone look for a K-V store written in X instead of the already mature Redis?

KAdot · on May 14, 2017

Badger is designed to store data on disk (like RocksDB or LevelDB), while Redis is an in-memory storage which can't store data sets larger than memory.

laumars · on May 14, 2017

You can store Redis on disk too. Not that you gain anything from doing so as persistence can be better achieved by clustering and I've never ran into issues where memory was a limiting factor, even with millions of records in Redis. Frankly if memory was a limiting factor then you'd probably want your KV store separate from your application anyway rather than the embedded approach that Badger takes.

I think the real advantage of Badger is that it's not as sophisticated as Redis. ie you can have Redis-like functionality compiled into your application so less faffing about setting up another daemon / cloud micro-service inside your NAT / VPC / whatever.

deathanatos · on May 15, 2017

> persistence can be better achieved by clustering

If you want persistence, then I'd recommend persisting to disk. While I've not had this fun with Redis, I've written code that took out an entire Cassandra ring. Were stuff only in memory, it would have not been pretty. Just because something is distributed doesn't mean it's guaranteed to never go completely down.

(That said, if you're using Redis as an in-memory cache, this is a potentially acceptable tradeoff.)

akbar501 · on May 15, 2017

Redis is a database management system (DBMS). Badger is a storage engine.

An application developer would not choose Badger, but instead would pick a DBMS such as Redis.

A database engineer would use Badger to develop a DBMS that the application engineer could use. If the database engineer so chooses they could expose a Redis compatible API.

tschellenbach · on May 14, 2017

Yes this is more comparable to level, rocks or at a higher level Cassandra, ScyllaDB, DynamoDB etc.

akbar501 · on May 15, 2017

It's comparable to LevelDB and RocksDB. However, Badger is not comparable to Cassandra, ScyllaDB or DynamoDB as they are distributed database management systems.

Cassandra, for example, has it's own storage engine that's responsible for writing bits to disk.

LaFolle · on May 14, 2017

redis is not an embedded database, while rocksdb/boltdb/badger are.

bogomipz · on May 14, 2017

Can you say what exactly is an "embedded workload"? I have seen this a few times now and tried googling but only ever come up with references to embedded systems and I'm guessing that this is not the same context.

I know rocksdb is LSM-based and was built by FB to address write amplification on SSDs.

detaro · on May 14, 2017

It has no server that runs independently and applications connect to, instead you integrate (embed) it into your program as a library.

bogomipz · on May 14, 2017

Thanks

akbar501 · on May 15, 2017

The terminology can be a bit confusing. Database Management System is what people often call a database. Unfortunately, the storage engine is also called a database.

Badger, RocksDB, LMDB, etc. are storage engines. A process uses these storage engines to write data to memory (note: the storage engine may support persistent, volatile, or both types of memory).

A database management system (DBMS) is a higher level concept that often has multiple processes either on a single server or distributed across multiple servers. Simply stated, each process within a DBMS uses the storage engine to read/write to/from memory (persistent or volatile).

It's important to note that storage engines are a specialized area and require different skills from writing a DBMS. It's a big deal when other people write high quality storage engines because it makes it a lot easier to write a DBMS.

mirekrusin · on May 14, 2017

Silly question - if SSDs or even motherboard had persistent storage of (couple of) 4 KB blocks where you'd be able to fsync < 4 KB (unfinished/non-yet-full pages) of data fast (DRAM + battery) - would that setup speed up writes in databases?

It seems that databases often want to persist/flush data in unfinished (4KB or 8KB) pages when they're being built, once they are full, they don't change much - once full they could be normally persisted. Another kind of pages are those that change very frequently - ie. single "root" page which keeps counters or other stuff in single, root page.

It seems a bit wasteful that multiple "checkpoints" (flushes/fsyncs) on partial blocks are triggering on hardware whole block rewrites. Similarly with "root"/"meta" pages that keep track of just few bytes frequently changing are triggering similar whole page rewrites.

To be honest even some kind of PCI card with little battery and slot for DDR4 would probably do the trick, no? The rest could be implemented in software - as long as you'd have access to fast flushing with battery backed memory that survives hard crash - it should be fine.

Is this silly idea?

lumost · on May 14, 2017

There used to be a lot more innovation and variety in hardware following similar approaches. However, commodity gear was an easier deployment target for software and economics ensured commodity gear provided better performance per dollar than proprietary solutions.

monocasa · on May 14, 2017

A lot of higher end HBAs have battery backed DRAM write buffers for this reason.

libeclipse · on May 14, 2017

This could not have come at a better time, I've been looking for a fast, simple, pure Go key-value store.

Few questions:

- Does this have a log file? If so, what does it log and can it be disabled?

- How is the data stored? (Single file, multiple files, etc.)

- How is the RAM usage?

mtrn · on May 14, 2017

> I've been looking for a fast, simple, pure Go key-value store.

Shameless plug: If you have write-once (or seldom) and read-often kind of access pattern for JSON documents, I wrote a simple (619 LOC), pure Go key-value store, that supports this use case: microblob[1]. It logs in common format, uses a single file backend and scales up and down with RAM..

[1] https://github.com/miku/microblob

libeclipse · on May 14, 2017

Not for this particular project, but I'll keep it in mind for if/when I ever need it!

fortytw2 · on May 14, 2017

off topic: flipping through the dgraph code, I noticed their licensing switch from Apache 2 to AGPLv3, anyone involved around to comment? Adding a draconic open source license is an unwise decision for an early stage database product imo

(https://open.dgraph.io/licensing is a dead link)

AsyncAwait · on May 14, 2017

> Adding a draconic open source license is an unwise decision

AGPLv3 is just a license, it is not "draconic", nor "cancer", (despite what Ballmer wants you to believe), it simply represents an ethical/moral agreement with the ideology of software freedom and since it is their code, they are free to express what they believe via the appropriate license. Nobody is making them do it, nobody is forcing you to use it and nobody is demanding you agree with it, it was a free choice and is therefore as far from "draconic" as one can possibly be, unless you believe that everybody who doesn't subscribe to your worldview is "draconic" by definition.

tpush · on May 14, 2017

I think what fortytw2 was trying to say is that the AGPL is not a wise choice for software that wants to gain the most popularity and usage as possible since usage of AGPL licensed software is categorically banned (even more so than GPLV3) by a some of companies.

As an aside, one can certainly describe something as draconic(or whatever else) if one views it as such; it's just an opinion.

codebeaker · on May 15, 2017

Having just gone through a lengthy review process identifying a suitable license for our soon to be open-source software which has a commercial aspect, and selected AGPLv3, I'm very curious to know what companies have it "categorically banned". We did some research and didn't find that anyone had an issue with it. Whilst AGPL does open up come grey-areas which aren't as well understood as GPL the general reason to use it seems to be as part of a dual licensing scheme where companies with AGPL issues can simply purchase a non-transferrable limited MIT license or similar.

Can you give me any more info on your sources?

tpush · on May 15, 2017

Google bans it for example: https://opensource.google.com/docs/using/agpl-policy/.

I would also be surprised if Apple (being so allergic to the GPLv3) used AGPLv3 software, though that's just speculation.

zellyn · on May 16, 2017

There's a perennial discussion of AGPLv3 here on hacker news. A surprising number of projects select it, then revert to something less toxic to corporations.

sangnoir · on May 15, 2017

> I think what fortytw2 was trying to say is that the AGPL is not a wise choice for software that wants to gain the most popularity and usage as possible

Why should that be a laudable goal? Not all projects are megalomaniac.

vog · on May 15, 2017

Three separate points here:

1) copyleft versus non-copyleft

2) which copyleft license to choose

3) the strategy of the dgraph

Regarding 1), non-copyleft leads to higher short-term adoption, but copyleft is often the better choice long-term. Moreover, history has shown that if you switch from a non-copyleft to a copyleft license, people will feel tricked. So the "early stage" argument doesn't hold. If you want to use copyleft long-term, better be honest and do so upfront.

Regarding 2), whenever you ask a lawyer in that field, they usually tell you that AGPLv3 is almost always what you want, preferable to GPLv3 and most other alternatives. So the "draconic open source license" argument doesn't hold. AGPLv3 just closes large holes which GPLv3 left open, to ensure that people actually stick to the copyleft principle. So if you want copyleft, choose AGPLv3 unless you have a very compelling reason not to.

Regarding 3), the dgraph people seem to see it a similar way:

https://open.dgraph.io/post/licensing/

timClicks · on May 14, 2017

The AGPL for database products isn't unheard of. See also Neo4J.

It makes sense from the host business's point of view. If you are the sole contributor, then you're entitled to do what you like. Moreover, you're also free to charge for commercial licences.

This licencing model wouldn't be appropriate for a community-centric database, such as PostgreSQL. With many contributors to core, no one would be able to arbitrage the situation.

jkarneges · on May 14, 2017

> The AGPL for database products isn't unheard of. See also Neo4J.

And Mongo. Also RethinkDB until the parent company folded.

> This licencing model wouldn't be appropriate for a community-centric database, such as PostgreSQL.

I don't disagree.

It's interesting, though, how counterintuitive this is. I would think that GPL wouldn't be a problem for individual contributors (the types of participants I imagine when I think of a "community"), but for business contributors who don't want competitors to take advantage of their modifications. And yet, anything a business contributes back to an MIT/Apache project is actually less protected than a contribution to a GPL project.

anarazel · on May 14, 2017

> > This licencing model wouldn't be appropriate for a community-centric database, such as PostgreSQL.

> I don't disagree.

> It's interesting, though, how counterintuitive this is. I would think that GPL wouldn't be a problem for individual contributors (the types of participants I imagine when I think of a "community"), but for business contributors who don't want competitors to take advantage of their modifications. And yet, anything a business contributes back to an MIT/Apache project is actually less protected than a contribution to a GPL project.

Well, a lot of them want to, at least temporarily, distribute some features without releasing them. And that simply doesn't work for GPL projects, unless there's a sole owner and all external contributions are made under some form of CLA. There's a lot of open-core type projects, but in my experience they're on average less healthy than projects with multiple contributing entities.

For PostgreSQL there've been a lot of closed source forks, but a lot of them folded and/or couldn't keep up with the amount of changes and thus are based on some super old version (hello Redshift, hello Greenplum). The only ones that appear to be able to keep up are ones 1) that move more invasive changes upstream after a while and religiously rebase after every release, never delaying, or 2) move their modifications into extensions, possibly adding the necessary extension APIs to core PostgreSQL.

wolco · on May 14, 2017

I think the problem is they never will contribute back any modifications upstream. Pick the right license for the project. Apache is for marketshare, gpl is perfect framework for community involvement, agpl is best for control

anarazel · on May 14, 2017

> agpl is best for control

In my personal opinion, AGPL is largely chosen to avoid various cloud providers from profiting significantly from $product, without ever giving back. That's control, but a very specific form of it. I don't personally like AGPLs legalese, it's very very imprecise.

vog · on May 15, 2017

From what I've heard of lawyers in the Free Software world, they most applauf AGPLv3 for its clarity and precision.

Maybe it just seems imprecise to a layperson because they don't know the well-defined meaning of various legal terms? (... and confuse these with their fuzzy meaning in ordinary language)

anarazel · on May 15, 2017

> From what I've heard of lawyers in the Free Software world, they most applauf AGPLv3 for its clarity and precision.

You're sure they were talking AGPLv3 and not [L]GPLv3?

The definitions of what constitutes an interactive program is quite vague (sect 0 and 13). Let's say you have a database server under AGPL (mongo, or say citus). Clearly they support interactive access in some form, but from the perspective of user of an application using said database access is not interactive, nor is it clear how the database could provide such an interactive notice. Various vendors addressed that issue with clarifying notes about their understanding, but that definitely increases doubts of possible users, including their lawyers.

akbar501 · on May 15, 2017

Amazon is a huge threat for any infrastructure company using Apache 2.0. If you gain popularity, then Amazon will be a direct competitor once they host your project. Given that Amazon's services benefit from the IE effect, it's not irrational for an open source infrastructure company to eliminate such a threat via licensing.

icholy · on May 15, 2017

Just to be clear, badger's license is still Apache 2.

ashwin95r · on May 14, 2017

https://open.dgraph.io/post/licensing/

_qc3o · on May 14, 2017

There's that word again, natively. Should just say written in Go. The native part is redundant.

slimsag · on May 14, 2017

I believe the author is using the term 'natively' here to describe the project being written purely in Go, rather than using CGO (i.e. being a wrapper to some database written in C).

I agree 'written purely in Go' would have been a better choice of words, though.

infogulch · on May 14, 2017

Or even "written in pure Go"

defanor · on May 14, 2017

But if it was a wrapper, it would be wrong to say "a key-value store written in Go" (even without "purely").

piokuc · on May 14, 2017

Does it mean anything "written natively"?

Asmod4n · on May 14, 2017

How does it compare to LMDB?

marcrosoft · on May 14, 2017

Why not boltdb?

rakoo · on May 14, 2017

There are basically two usecases for LSM trees vs B+trees:

- Either you have a lot of random new writes and not so many updates, in which case LSM trees will ingest new data as fast as the disk can store them

- Or you care more about read performance, and LMDB and BoltDB will have more predictable (and arguably better) performance

InfluxDB is a timeseries database, so they write a lot of stuff and don't even read all of it. As data gets older it can be pruned efficiently with an LSM-based design, not so easily with B+tree. Dgraph on the other hand seems to sit right in the middle as it wants to be a general purpose database so there's no easy winner here. Hopefully the choice was correct for most use cases.

zimbatm · on May 14, 2017

InfluxDB went from RocksDB (LSM) -> BoltDB (B+Tree) - > Custom (LSM again).

Here is a pretty good writeup: https://docs.influxdata.com/influxdb/v0.9/concepts/storage_e...

libeclipse · on May 14, 2017

They're based on different technologies. Bolt uses B-Trees and Badger/Rocks/Leveldb use LSM trees.

Bolt also uses insane amounts of RAM and writes get slower and slower as the size of the database increases. (Personal experience, don't have benchmarks. Take this at face value.)

nemo1618 · on May 14, 2017

We've experienced these downsides as well, although it seems that boltdb's memory usage can be deceptive due to mmap'ing the db file.

mrjn · on May 14, 2017

(Badger author here) BoltDB is super slow. Doesn't even come close to the performance we need, which is why we decided to write Badger from scratch.

atombender · on May 15, 2017

Thoughts on implementing transactions on top of Badger? I can see a bunch of ways, but it seems you'd have to build an entire high-level transaction model on top in order to get isolation, read consistency and atomicity. You'd also want to do a lot of caching, I think, to make it fast.

If huge, long-running transactions aren't needed, then it may be easier to just let transactions happen in RAM, with a cache to enforce isolation and emulate atomicity. So for exmaple, while a transaction is committing, the system would need to cache the previous (that is, currently committed) version of every key, to avoid having other transactions reading the on-disk data, which is in the process of being updated. It would also need either a redo log or undo log so that, in the event of a cache, any half-written transactions can be either replayed or rolled back.

piotrkaminski · on May 15, 2017

I'm curious if you have any more details here, or comparable benchmarks to share. BoltDB uses B+ trees and you say that a B+ tree approach is worth investigating due to improvements in SSD random write performance, so does BoltDB falsify that hypothesis or do you think it's just not well implemented and the idea still has potential?

mrjn · on May 15, 2017

We have tried with BoltDB. Its performance is really bad. It is just badly implemented, acquires a global mutex lock across all reads and writes. We wouldn't have written Badger if BoltDB worked for us.

RocksDB performs much better and is the most popular and efficient KV store in the market, being used at both Google (Leveldb) and Facebook. Therefore, the benchmarks are against that. Without spending time generating benchmarks, I'll bet that Badger would beat BoltDB any day.

The idea of using B+-trees with value log has potential. One would need to do some obvious optimizations to decrease the frequency of writes to disk. Because SSDs have to run garbage collection cycles, which can affect write performance in a long running task.

But, I think it would make a great research project. And if it comes out to be better than our current approach of LSM tree in read-write performance, I'd switch in a heartbeat; because I think B+-tree might be a simpler design.

benbjohnson · on May 16, 2017

> We have tried with BoltDB. Its performance is really bad.

BoltDB author here. I agree with you that Badger would beat Bolt in many performance benchmarks but it's an apples and oranges comparison. LSM tree key/value stores are write-optimized and generally lack transactional support. BoltDB is read-optimized and supports ACID transactions with serializable isolation. Transactions come with a cost but they're really important for most applications.

Regarding benchmarks, LSMs typically excel in random and sequential write performance and do OK with random read performance. LSMs tend to be terrible for sequential scans since levels have to be merged at query time. B+trees are usually terrible at random write performance but can do well with sequential writes that are batched. They tend to have good random read performance and awesome sequential scan performance.

It comes down to using the right tool for the job. If you don't need transactions, Bolt is probably overkill. There's whole section on the BoltDB README about when not to use Bolt:

https://github.com/boltdb/bolt#caveats--limitations

> It is just badly implemented, acquires a global mutex lock across all reads and writes.

You're welcome to your opinion about it being "badly implemented" but the global mutex lock across reads and writes is simply untrue. Writes are serialized so those have a database-wide lock. However, read transactions only briefly take a lock when they start and again when they stop so they can obtain a snapshot of the root node. That gives the transaction a point-in-time snapshot of the entire database for the length of the transaction without blocking other read or write transactions.

piotrkaminski · on May 15, 2017

Very interesting, thanks for the answer! For the application and data structure I have in mind, I expect to need lots of range iteration over small values (<32 bytes?) and relatively infrequent writes, so Badger might not be a good fit. At a minimum sounds like I'd need to set up my own benchmarks.

Loic · on May 15, 2017

For our system database[0] I am using BoltDB. It is a search engine with the database updated nearly in batch, so write performance is a non issue. The Go/BoltDB custom index has been running without any issues under moderate load for 2 years and the performance is great (a molecule information page can be delivered in less than 50ms to the end user at 99.9% percentile).

So, you should really test your read/write ratio with the size of the keys and payloads before selecting one solution or another. Even on the Badger tests, you can see that it can vary a lot.

[0]: https://www.chemeo.com

renke1 · on May 14, 2017

I was wondering too, but Badger doesn't offer transactions and stuff like that (on purpose). It seems to be more low-level in some regards.

https://github.com/boltdb/bolt

faragon · on May 14, 2017

Figures per minute? Same-process key-value should give millions per second operations (using one thread). And accessing via network, e.g. a la Redis, should be hundreds of thousands per second [1].

[1] https://redis.io/topics/benchmarks

mrjn · on May 15, 2017

You are comparing a memory first store against a disk first store. Everything is much faster if it only has to be stored in RAM before the update is considered successful.

zie · on May 14, 2017

How do I use it? there is zero docs on how to even get started, what the API is like(is there even one?) This could be worth all the peanuts in the world, but if the only way to learn how to use it, is read the Go source, very few people will bother.

wiiittttt · on May 14, 2017

There is a link to the docs on the github page.

https://godoc.org/github.com/dgraph-io/badger

lumost · on May 14, 2017

Go projects typically use godoc to autogenerate documentation from comments, this has the added benefit that a single service can maintain updated docs for every open source go package in existence ex. https://godoc.org/

eternalban · on May 14, 2017

https://godoc.org/github.com/dgraph-io/badger/badger#example...

zie · on May 14, 2017

AH! I was looking in the docs/ dir, and found them to be completely missing. Thanks :)

didip · on May 14, 2017

This is pretty exciting. I would love to see comparisons between badger and goleveldb

mrjn · on May 15, 2017

LevelDB should be slower than RocksDB.

udev · on May 15, 2017

The world wold benefit from some sort of map (or catalogue) of database engines and systems currently available.

Anyone aware of such a thing?

pkroll · on May 19, 2017

A quick search shows there are such things, there's one on Wikipedia for instance [1]. But it's hard to assess everything in a single catalog.

[1] https://en.wikipedia.org/wiki/Comparison_of_relational_datab...

usegolang · on May 14, 2017

How does this compare to boltdb?

libeclipse · on May 14, 2017

See https://news.ycombinator.com/item?id=14336771

thedatamonger · on May 14, 2017

nobody has said it so I must. Badgers? We don't need no stinking badgers!

lucasmullens · on May 14, 2017

Worth noting Badger here is a reference to the Wisconsin Badgers, since it's based on a paper from UW-Madison.

mrjn · on May 15, 2017

Actually, no relation to Wisconsin Badgers. Badger is (and has been) Dgraph's mascot. So, we thought it would be nice to name the key-value store "Badger."