Storage engine design

lima · on May 2, 2017

It looks like Yandex recently open-sourced their Graphite engine they built on top of Clickhouse:

Looks really interesting for Graphite-like use cases.

eis · on May 2, 2017

I don't think you want to measure your timeseries datastructure against LSM-trees because the latter is inherently a pretty bad structure to use for timeseries (which are mostly append-only) as a few projects painfully found out.

Anyways, I'm interested in timeseries so I read the article and tried to understand the datastructure but to be honest it opens up more questions than it answers. I applaud people though who try to describe their datastructures that are core to the app. Thanks for that.

  1. What is the exact datastructure of a leaf? You mention a leaf node can hold 1000 datapoints.
  2. Why is using a WAL impossible? That should be possible for any datastructure.
  3. In your example if level 3 is full and everything gets merged into level 4, there are no levels 1-3. How does one query the datastructure? Does it maintain a pointer from the root to level 4?
  4. Related to above: if I want to query all data from last month until now which happens to be the start of level 4, it will first go to root, then to the level 2 Leaf, from there to the level 2 SBlock, from there to the level 3 Leaf, then level 3 SBlock, then level 4 Leaf, then level 4 Sblock, then the first level 4 leaf? That seems a lot of random access. How many iops does a lookup need?
  5. SBlocks need to be kept in memory. If I have ten million timeseries (not uncommon, can be much more), each with 3 levels, then upon startup the app has to load 30M SBlocks into memory?
  6. You say that several trees can be interleaved in a single file, how does that not break the linear IO advantage of columnar layouts?
  7. How do you store information in the "inner nodes" (SBlocks?) to speed up aggregation? Do you store every possible aggregation in there? E.g. sum, avg, min, max, stddev, ...
  8. Storage format of an individual time series is only a part of what a TSBD needs to do, another part is how to find all the timeseries for a particular query. How does that work?

And in general I think you can't have all three:

  A) good write performance
  B) no write amplification
  C) low RAM usage

... because you have to either write data out immediately (and get either 2x+ write amplification or lots of random writes/reads) or buffer it in RAM to batch linear writes.

I think there are some interesting ideas in this structure, it looks to me more like a Linked List of one level deep B+Trees, not a big overall tree.

chaotic-good · on May 2, 2017

1. Each leaf node is a fixed size block that contains compressed values and timestamps. 1000 values is just an example, number of values in one leaf node is variable.

2. Because there is a lot of data-structures. I'm using tree per series. The database can simply store hundreds of thousends of series. Creating WAL per series is not feasible.

3. It maintains a list of roots.

4. One I/O operation per node. You will fetch a leaf node for every ~1000 data points and a superblock for every 32 leaf nodes. It's not as bad as it sounds because you will read data for one series only. To span over 4 levels the series should contain tens of millions of points.

5. Yes. You will need a beefy machine for this with a lot of RAM.

6. Random reads are fast on modern SSDs. It's optimized for SSD (I simply don't have a computer with HDD).

7. It stores only composable aggregations - min, max, count, sum, min/max timestamps.

8. All series names is stored in memory. During the query time this memory is scanned using regexp to find relevant series names and they ids. This is a kind of a temporary solution. It works good enough for the datasets with small cardinality (around 100K series).

chaotic-good · on May 2, 2017

Akumuli is desinged for an SSD and NVMe drives so I chose to have a lot of random reads and writes. My laptop's NVMe drive have a random write througphut around 400MB/s (AFAIR) and my most havy performance test wrote data at rate about 70MB/s (16M data points per second).

ChuckMcM · on May 2, 2017

This is perhaps the most interesting aspect of it to me. When we relax the constraint that 'mass storage access must be a linear and infrequent as possible' what sort of possibilities does that open up in the design space that were previously untenable.

Nice work and thank you for sharing it.

chaotic-good · on May 2, 2017

It's not that easy, actually. The simplest method that can utilize the full throughput of the drive is to use large writes (1MB or larger). This is the fastest possible way to write data to the SSD, period. This method also creates the simplest possible FTL mapping table.

Random reads and writes are significantly slower if you write everything from one thread. To speed everything up you should write in parallel (for example using Linux AIO + O_DIRECT, or libuv + O_DIRECT). OS level buffering and many OS threads will deliver good random write throughput as well.

There are other effects to consider, e.g. read-write interference.

ChuckMcM · on May 2, 2017

I understand. I would expect that you will get an additional boost if you target Intel's 'Optane' technology which, by its design, allows for a much faster channel turnaround and so less interference. And in the fairly recent past other vendors like Texas Memory systems developed strategies which were all RAM and a bit of cleverness to snapshot to HD when the power fails. The point being that with enough money you could brute force the solution, but now the money required it decreasing and so new strategies are opening up.

chaotic-good · on May 2, 2017

If I understand this right, with Intel's Optane you will eventually need to write everything to HDD because data collection happens at steady pace and the cache size is limited.

ChuckMcM · on May 2, 2017

Depends on the size of your data set. Intel's plan, according to their web site, is to replace the SSDs (especially NVME ones) with Optane based solid state memory. The road map has them shipping exabytes of the stuff eventually.

So as I see it you'd be constrained by 32GB Optane modules today, but they will eventually (one, maybe 2 years) be 2 TB modules like the Samsung 960 Pro modules are today. And an M.2 port is really just a PCIe slot so you're looking at systems with maybe 32 TB of Optane storage on the high end within the next 5 years.

ehsanu1 · on May 2, 2017

My understanding is that even SSDs perform somewhat better sequentially (throughput-wise), though the difference isn't quite as dramatic as with HDDs. That said, the 400 mb/s random write speed for nvme mentioned is plenty faster than sequential write speeds most people had access to until recently with SSDs, so that's pretty interesting.

menegattig · on May 2, 2017

Good content for creating time-series database engines that was just posted on other HN thread:

https://medium.com/slicingdice-com-blog/want-to-build-a-new-...

https://news.ycombinator.com/item?id=14246189

evdev · on May 2, 2017

What I've found matters in this area is the mismatch in locality between elements in read batches and elements in write batches. It'd be nice if the emerging DBs that deal with these issues put at least a gloss on the information model and write -> read lifecycle they're targeting.

Otherwise, a lot of these "actually you need X for time series!" are just talking past each other because "time series" means any number of actual patterns.

chaotic-good · on May 2, 2017

What do you mean by gloss?

evdev · on May 2, 2017

FYI I found this a great article otherwise.

evdev · on May 2, 2017

Summarize.

tianyicui · on May 3, 2017

It seems the kind of query for N time series at a specific timestamp or across a small span will be inherently slow because of O(N) block read? Is there some way to support this kind of query efficiently?

stuff4ben · on May 2, 2017

Fascinating stuff! How does this compare to other TSDB's like Influx (which uses LSM-tree) and Druid?

chaotic-good · on May 2, 2017

Akumuli is quite different from InfluxDB. It focuses on single node performance and operational simplicity. Essentially I'm trying to make it a "fire and forget" kind of app. No idea about Druid.

ibotty · on May 2, 2017

and prometheus?

chaotic-good · on May 2, 2017

Prometheus is a monitoring system (pull based), Akumuli is a TSDB (push based). I believe that one can use Akumuli as a long-term storage for Prometheus.

bbrazil · on May 2, 2017

Prometheus is fundamentally a TSDB, see https://fabxc.org/blog/2017-04-10-writing-a-tsdb/ for the design of the next version.

chaotic-good · on May 2, 2017

This is a one more example of the design in which one file holds many series and everything is chunked by time:

- "there is no longer a single file per series but instead a handful of files holds chunks for many of them"

- "We partition our horizontal dimension, i.e. the time space, into non-overlapping blocks. Each block acts as a fully independent database containing all time series data for its time window."

I don't believe this will work out well because it will introduce read amplification during query time (compared to file per series approach that they're using now). And I'm really curious how they managed to get 20M writes per second on laptop. The article states that they're using compression algorithm from Gorilla paper and Gorilla paper authors claims that they managed to get 1.5M on a single machine.

devdevdev1 · on May 2, 2017

It seems very much like the B+ tree approach is just a mental model put on top of the exact same idea that is being argued against. The initial list of "bad things about LSM approaches" has almost exactly the same items on it as the list of features the B+ approach claims to achieve.

Maybe I'm getting this all wrong, but aren't the leaves also representing chunked data, which is compressed.

The Prometheus solution also sequentially places compressed chunks for the same series. The time slicing actually has a lot of benefits and can simply be seen as the first level of the described B+ tree. An index of chunks for a series can then be seen as the second level.

The potential read amplification here seems completely equivalent. Just from my high-level view, all properties of the read and write path seem almost identical.

chaotic-good · on May 2, 2017

>> Maybe I'm getting this all wrong, but aren't the leaves also representing chunked data, which is compressed.

Leaf nodes contain data from one series (this data should be read together) and SSTable with time-series data contains many series and there is no guarantee that all these series will be used by the query.

>> The Prometheus solution also sequentially places compressed chunks for the same series.

I'm not really that familiar with Prometheus internals, especially with indexing part. As I understand it doesn't align writes so there is a lot of write amplification on the lower level that translates to cell degradation and non-optimal performance, but I can be wrong here.

bbrazil · on May 2, 2017

> I don't believe this will work out well because it will introduce read amplification during query time (compared to file per series approach that they're using now).

It'll end up about the same in practice, only the time series data that needs to be read is read.

Query performance is looking quite a bit better with this design.

> And I'm really curious how they managed to get 20M writes per second on laptop.

I understand that was a micro-benchmark of one part of the system. The whole system is looking to be roughly in line with the Gorilla numbers.

chaotic-good · on May 2, 2017

> I understand that was a micro-benchmark of one part of the system. The whole system is looking to be roughly in line with the Gorilla numbers.

This makes sense now. I've found out that the compression algorithm performance numbers affect the overall performance in a big way. On modern SSD the entire workload is CPU bound.

ibotty · on May 2, 2017

Isn't that an implementation detail? Both have to store time series.

batisteo · on May 2, 2017

Yet another project with an Esperanto name