BTrDB: Berkeley Tree Database

lifeisstillgood · on June 26, 2019

As someone who generally finds "couple million rows? Sqlite is fine" works as a philosophy, what generally should I know here?

I get that there are huge timeseries based data flows going round the world (transactions, metrics, weather, self driving telemetry) and sticking those into something is useful.

I understood the world has settled on "use hdfs to store the data, use apache spark to run over those files and reduce your query data set till it is small enough to fit into pandas and then process"

This looks like it is trying to cut out the first two step?

gopalv · on June 26, 2019

> what generally should I know here?

The "real useful" part was a click further in for me.

https://github.com/BTrDB/smartgridstore/blob/master/tools/c3...

Ingress endpoints which is usually the more annoying component, rather than the exact storage format.

> As someone who generally finds "couple million rows? Sqlite is fine

There's still unsolved annoyances with analog timeseries measurements, which aren't easily solved in SQL with a columns + btree index.

So combining an airplane altitude sensor at 0.1Hz with a 10Hz fuel sensor, starts giving you trouble in SQL because the question someone might ask is "when the plane was at 31,000 feet, at 0.85 mach, what was the fuel rate", while the plane might have gone from 30,999 to 31,001 feet without ever recording the intermediate point.

Indirectly speaking, you can do range-scans for each measurement if you consider time as something that is different from a generic column value in a generic SQL engine (to join rows by temporal proximity).

The ASF proposal from IoTDB[1] might make interesting reading on the topic (the frequencies part) or the Boeing talk from XLDB 2018 [2]

[1] - https://wiki.apache.org/incubator/IoTDBProposal

[2] - https://conf.slac.stanford.edu/xldb2018/sites/xldb2018.conf....

SahAssar · on June 26, 2019

> So combining an airplane altitude sensor at 0.1Hz with a 10Hz fuel sensor, starts giving you trouble in SQL because the question someone might ask is "when the plane was at 31,000 feet, at 0.85 mach, what was the fuel rate", while the plane might have gone from 30,999 to 31,001 feet without ever recording the intermediate point.

I thought that was part of the problem that postgres range types solves? https://www.postgresql.org/docs/current/rangetypes.html

dboreham · on June 26, 2019

Type is one thing. Efficient querying (which implies some appropriate index structure) is another.

asah · on June 26, 2019

Weak example to ask precisely when the plane is at X feet, not an inch higher/lower - this is more typically a range query, and the Postgres range datatype is quite good at this, notably GiST indexes for overlap queries: https://www.postgresql.org/docs/9.3/functions-range.html#RAN... https://www.postgresql.org/docs/current/rangetypes.html#RANG...

vog · on June 27, 2019

PostgreSQL allows for all of that: User-defined types with user-defined index structures and whatnot. PostgreSQL is great and flexible, yet transaction safe and fast. You should really try it. It is one of the highest-quality FLOSS projects I have ever seen, not just in term of code quality, but also in terms of project and community management.

scottlocklin · on June 26, 2019

It is hilarious seeing this problem over and over and over again when it was solved in the APL language family ... some time in the 1960s.

i_don_t_know · on June 26, 2019

Could you explain how this is solved in APL?

scottlocklin · on June 27, 2019

I did a startup solving this problem (yet again) in an APL; we sucked at enterprise sales, but our data structures went rather faster than BTrDB. The most famous example of solving this problem being Kx systems. Less famous, but basically using the same ideas (these guys all worked with Art and the Iversons): http://marketgridsystems.com/

And no, I won't explain it in detail. I consider the spark team basically cretins, and it would ruin several of my pals businesses while transferring wealth to hedge funds, exchanges and advertising groups making solutions to this problem available for free. I'll continue to snigger up my sleeve at the fact that decades of open source development still haven't noticed things which were obvious in the 1960s.

Edit add: a great example of this kind of thing, if not this exact thing: some poor guy trying to solve a 25tb problem using spark, parquet files ... finally solving it with ... Awk. Old tools written when 1mb was a lot of memory; sed, awk, tr, cut: vastly better than new tools.

https://news.ycombinator.com/item?id=20293579

i_don_t_know · on June 28, 2019

I guess this is another instance of Alan Kay's "computer science is pop culture". We don't remember our history because reinventing it is so much fun.

reitzensteinm · on June 26, 2019

The nodes of the tree resemble Clojure's HAMTs, just skipping the hash. It's interesting, because Clojure itself opts for RB trees when building sorted maps, but for keys that can be cast to a bit sequence (strings, integers), AMTs are a pretty good option.

My side project is in some respects quite close to BTrDB; tree based analytics. Stumbling across it made my day, because it appears to have vindicated some design choices I was unsure about.

I've reimplemented Clojure's data structures in Clojure, and made the nodes serialize to a garbage collected KV backend (ardb or Redis depending on requirements). This means that they can persist between program runs, be sent over a network, and don't need to fit in memory. Using structural sharing, you can take a big hash map, add a single key to it, and have both the old and the new side by side.

The next piece is cached map/filter/reduce, which saves intermediate results on each node, making batch jobs incrementally compute. You can take a big hash map on disk (my main one contains ~20gb, a map of string->set of data points), add some more values to it, and recalculate the analytics of the new map in a few milliseconds, having both side by side.

The ultimate goal is to enable you to write more or less idiomatic Clojure programs that behave the same whether you're reducing a map with a thousand entries, or a trillion (e.g. by firing up a bunch of AWS Lambda instances).

lifeisstillgood · on June 26, 2019

This sounds fascinating - is your side project open anywhere ?

reitzensteinm · on June 26, 2019

Not yet. I'm using it in prod for my own analytics, so I know it works, but it'll need some clean up and features before release. I do plan on it though.

roskilli · on June 26, 2019

Those interested in TSDBs might also be interested in M3DB.io which is Apache 2 licensed, supports a modified TSZ compression implementation with 11x compression ratio, replication of data in the cluster, streaming of data between nodes on add/remove and a Kubernetes operator.

sa46 · on June 26, 2019

The compression technique to achieve the 11x compression ratio is pretty clever and is detailed in the Gorilla TSDB paper. Timestamps are encoded with a delta of delta encoding scheme which allows 96% of all timestamps to compress to a single bit within a compressed block.

Gorilla paper: https://cs.brown.edu/courses/csci2270/papers/gorilla.pdf

roskilli · on June 26, 2019

It is definitely a huge step forward for time series data collection at scale, great props to the authors.

With M3TSZ vs vanilla TSZ the major difference is that instead of XORing the value component of each datapoint it is determined if the precision of the float value is within a few significant digits and if so, it is turned into an int representation with delta-of-delta of the int value from value to value used to represent the different changes in value rather than just XORing the float bits. Also tracking how many significant digits and dampening the need to encode a change in that from value to value is also performed.

We should write up the specific algorithm differences for the value component encoding.

You can see the relative code here in "writeIntVal": https://github.com/m3db/m3/blob/b25e111aab06fb7cde9f7e1be8b6...

tfigment · on June 26, 2019

Thanks I learned something new. Its not general purpose and has some strict limitations. As usual this seems to skip concepts like quality/status of sensor (maybe qnan can be used as a sort of sentinel value) but is interesting newer entry in the field.

galacticdessert · on June 26, 2019

Unfortunately the website is a bit light on details, but I found some interesting information elsewhere:

https://blog.acolyer.org/2016/05/04/btrdb-optimizing-storage...

https://pingthings.io/platform.html

It look very interesting, and I am a bit surprised to see how a tree structure can deliver such high performances in retrieving the data. Querying a long period of time mean that each top node has to be walked to the bottom, right? Isn't that fairly expensive?

The aggregations on node level are also interesting. It is something that also the company I work for (https://www.energy21.com) does for calculations, but we use a relational database and a notification-based system for performing the recalculations and keeping the aggregations up to date.

Cool stuff.

someguy13 · on June 26, 2019

You should also look at pingthings.io, which is a commercialization of BTrDB

jedisct1 · on June 26, 2019

Where is the pingthings source code, since BTrDB is GPL?

_jyog · on June 26, 2019

https://github.com/PingThingsIO

Literally result number three for "pingthings.io source code"

_jyog · on June 26, 2019

I read about the kdb+ database from a post the other day as "the fastest" database, and the numbers given did seem incredibly impressive. How does this compare?

marknadal · on June 26, 2019

I wish everybody would just follow Redis' setup for baseline tests:

https://redis.io/topics/benchmarks

  > This is an example of running the benchmark in a MacBook Air 11" using a pipelining of 16 commands:
  $ redis-benchmark -n 1000000 -t set,get -P 16 -q
  SET: 403063.28 requests per second
  GET: 508388.41 requests per second

Testing against a Macbook Air is a reasonable, as most developers will have access to one or a windows machine of similar stats.

For instance, I have a 5 year old Macbook Air, and I run this test to benchmark (baseline) my database:

https://htmlpreview.github.io/?https://github.com/amark/gun/...

My / Redis test is not very interesting (not real-world use case), but it at least gives a baseline that removes a lot of variables.

Then every database should implement Jepsen-like tests for everything else. That is what we're trying to do, with creating a Jepsen-like distributed systems correctness (https://github.com/gundb/panic-server) and load testing tool in a language that is easy for any developer to access and use.

pritambaral · on June 26, 2019

> ... as most developers will have access to one or a windows machine of similar stats.

That's a bold assumption. Looking around my office in India here, I find not a single Macbook, a Windows machine, or any machine from '11. It would also be trouble to get my hands on a Macbook Air '11, if I had to do it for the purposes of reproducing a benchmark. I certainly couldn't buy one today, and even if I were to buy a currently-in-stores Macbook it'd be awfully expensive.

A much better proposition would be something far more ubiquitous and accessible, something that can be borrowed trivially or bought easily by the vast majority of people, something that stays stable and available for a number of years. An AWS instance or a Raspberry Pi (1/2/3), for instance.

marknadal · on June 26, 2019

Thank you for calling this out. It was the spirit of what I was trying to go for, because the website use "60GB of RAM (EC2 c3.8xlarge)" which I wouldn't even pay for.

I totally agree with the RasPi suggestion. Though note, the RasPi 3 B+ is actually about as powerful as my 5 year old air, 1.4GHz quadcore vs 1.6GHz quadcore.

pritambaral · on June 26, 2019

> 60GB of RAM (EC2 c3.8xlarge)" which I wouldn't even pay for.

The point isn't to restrict benchmarks to only high-end machines. The point is to use a stable standard that people wouldn't have trouble running for a few hours. I can rent a machine of that identical type for 20 hours, or buy an RPi, for the low price of $35.

> about as powerful as my 5 year old air, 1.4GHz quadcore vs 1.6GHz quadcore

Core counts and clock speeds do not translate to comparable computing power across processor families and instruction sets. E.g., the RPi 4 (1.5GHz, quadcore) is announced to deliver 3x the performance of an RPi 3B+ (1.4GHz, quadcore), and that's on the same instruction set.

philjohn · on June 26, 2019

> Though note, the RasPi 3 B+ is actually about as powerful as my 5 year old air, 1.4GHz quadcore vs 1.6GHz quadcore.

I don't think that's an apples to apples comparison. The Pi 3 doesn't have an out-of-order CPU so can't do instruction level parallelism. The new Pi 4 is better in this regard, but both are handily beaten by Intel CPU's from the last 10 years.

mzs · on June 26, 2019

Berkeley Tree DB uses GPLv3, go figure

https://github.com/BTrDB/btrdb-server/blob/master/LICENSE

ksec · on June 26, 2019

So even Berkeley has abandoned BSD? Or is GPLv3 part of the U.S. Department of Energy ARPA-E program (DE-AR0000430), National Science Foundation CPS-1239552 requirement?

cat199 · on June 26, 2019

'Berkeley' never used only the BSD license.

BSD UNIX used the BSD license (among other things).

There is/were/always will be many projects, departments, professors, etc. doing things needing license at UCB, and each will choose the appropriate one for their circumstances.

fnord123 · on June 26, 2019

Is this in memory like bdb? It says it uses http and capnproto which would indicate that it's not in memory.

kentonv · on June 26, 2019

Looks like they switched to gRPC and removed Cap'n Proto in late 2016 / early 2017, but never updated the web page.

https://github.com/BTrDB/btrdb-server/commit/23fda10224bea80...

https://github.com/BTrDB/btrdb-server/commit/7bb3c8d39b0954e...

I'm unable to find any public discussion about this change or its motivations. Oh well.

fnord123 · on June 26, 2019

That's an interesting choice for a system intended to be high performance. Go's GRPC implementation has been suffering from extremely high resource usage due to https://github.com/grpc/grpc-go/issues/1455

sushilari · on June 26, 2019

Are use case driven DBs helpful? Is the use case and possible adoption crowd large enough to sustain a product this critical down the road?

I worry about the proliferation of infra level tools or adopting a new one. Why not make an extension or fork postgres to demonstrate your concept? (Just an example)

unchurched · on June 26, 2019

Timeseries databases have an entirely different set of technical challenges from relational databases. Also, I wouldn’t exactly call this proliferation. There are relatively few decent timeseries databases in existence, some of which have been around since 1999.

So yes, timeseries databases are incredibly helpful if you have to store, well, large amounts of timeseries data quickly.

cat199 · on June 26, 2019

> large enough to sustain a product this critical down the road?

This may not be a 'product' per se.

also, these libraries tend to be small and self contained and little changing over time. See also:

https://svnweb.freebsd.org/base/release/12.0.0/lib/libc/db/c...

for a view into the original berkeley db as still incorporated in freebsd base (BSD Berkeley DB v1; not the later lineage by same authors which led to sleepycat / oracle Berkeley DB).

> Why not make an extension or fork postgres to demonstrate your concept?

this isn't that kind of database.

this is a k/v store optimized around a particular use case, k/v stores being what is more traditionally considered 'database' before the term 'database' colloquially implied 'relational database managment system' (RDBMS - e.g. a system to manage your k/v database files relationally..)

well, techincally 'database' is any 'base of data'. but.. yes. 'base of data' -> 'k/v store' -> 'rdbms'