Hacker News new | past | comments | ask | show | jobs | submit login
Time-Series Compression Algorithms (timescale.com)
190 points by todsacerdoti on May 14, 2022 | hide | past | favorite | 30 comments



I found TerminuDB's white paper on graph compression interesting and I think could be useful to time series compression:

Succinct Data Structures and Delta Encoding for Modern Databases - https://assets.terminusdb.com/research/succinct-data-structu...


Navarro's book Compact Data Structures is an excellent survey of this literature! https://www.cambridge.org/core/books/compact-data-structures...

I love the perspective of computing the information theoretic lower bounds for a data structure and then trying to achieve that with as little overhead as possible while still supporting efficient operations on the structure.


Question for people who've implemented these kind of compression schemes: some of these mechanisms require local context (rle and s8b) how do you handle random access in these cases? Is the metadata embedded as a sort of key frame to which we have to advance before decoding the value? Or as a separate block in the headers?

RLE especially sounds like it could degenerate into a binary search per lookup, maybe aided by caching the bounds of the last run and assuming locality and linear-ish usage?

This might sound obvious, but wasn't mentioned in the article, you can also apply integer based compression schemes on dictionary encoded data.

And floating point values don't always need those 64bits when the use case and input data don't require that level of precision.


Most applications don't require true random access.

Instead you're typically reading a range of data, and then you can decompress just the blocks required for the data you want to see.

Caching of partial queries can also help substantially. For example, if many queries involve querying the max() of some per-second data grouped by minute, it is well worth caching that rather than reading the source data every time to calculate the max().

Typically the query engine can keep counts of every subquery and how frequently it's used and how many data points it involves to decide how long to cache it for. As far as I'm aware no opensource tsdb does this, despite it being a massive simple win, especially for alerting systems and dashboards that run very similar queries frequently.


I was thinking the same thing as I was reading this. I doubt you could retain 100% random access. Instead, I think you would generally create "blocks" that are compressed in a time range that is compatible with your application (ie. 1 day blocks or 1 week blocks, etc.) and then when picking date/times you take X blocks and then further refine the query after compression (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At least, that is what I have done in the past in my own home grown apps. Essentially each block then starts and terminates compression - # of blocks is a trade off in compression efficiency vs. granularity.


Correct, almost all timeseries databases divide the data in shards / partitions / whatever-it’s-called, which are then split by column, which are then compressed as a single unit.

Some databases use a fixed block size (eg as you mention, “1 day”), which are simple and stateless to manage, while others dynamically “split” blocks into smaller blocks (frequently called “ranges”), or merge them back later. The latter is significantly more complex, but is a much better approach for varying workloads where you don’t know the right shard size in advance, or need to deal with the possibility of highly varying workloads, eg you have a lot of traffic on specific time of day / day of week/month/year.


> (...) how do you handle random access in these cases?

Given we're discussing time series, random access is something that's not required at all. There is no use case where you're looking for a single, specific data point of a time series. Time series are used to store and retrieve batches of data points in order to process and/or analize/visualize in batches. Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B".


> There is no use case where you're looking for a single, specific data point of a time series.

I'm no expert but it seems trivial to come up with some. The article talked of storing temperature and humidity, so in relation to that:

- "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period. Having to start at the dawn of time is kind of a pain

- "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"

- Record highs are not going to be at night, so if you want to find that, we can skip a lot of data there. Now, temperatures don't change every millisecond so it's not a lot of data even if you would have to start decompressing 50 years back, but in general, this kind of optimization is only possible if you have random access. (A better example here might be if you have petabytes of LHC data compressed this way and want to find something in it.)

Random access is not the common case but not a bad feature to have, either. The comment kind of acknowledges this by mentioning batches, but starts by dismissing the problem


The commenter you're responding to isn't saying that it's literally impossible to retrieve a single point without doing a full scan of the data, but rather that it's not the top priority of these data structures to support queries for individual points.

The GP comment even says:

> Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B"

which seems exactly like what you mentioned with "$now-$period".

These kinds of data structures, often found in OLAP databases, assume that point queries will be less common, and they accept a bit higher latencies for those queries. But those latencies are still fairly small.


> "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period.

Your example relies on a query to retrieve all data points between timestamp A (X days from now) and B (now).

> - "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"

What's your definition of then?

In time series, it's typically a statistic calculated from all data points within a time window. It might be max, it might be a percentile, it might be a summary statistics of some kind.


I have found that a very good approach is to apply some very simple transformations such as delta encoding of timestamps, and then letting a good standard compression algorithm such as zstd or deflate take care of the rest.

Delta encoding of timestamps helps a lot though, because it makes the redundancy more visible to a general purpose compression algorithm.

I used this for the telemetry storage of the Columbus module of the International Space Station, back in ~2010, and then a few times since.

http://blog.klaehn.org/2018/06/10/efficient-telemetry-storag...

https://github.com/Actyx/banyan


> delta encoding of timestamps, and then letting a good standard compression algorithm such as zstd or deflate take care of the rest

This can cut both ways. If the integer data is irregularly distributed, you may increase entropy and give the compressor a harder time than it would have just looking for slightly longer common substrings

I've lost count of the number of compression experiments where some fancier encoding that greatly reduced redundancy in the uncompressed representation also happened to worsen (or at least have no effect on) the compressed output size. Would even be tempted to say it more often than not makes things worse


I tried this with real data (Satellite and Columbus module telemetry). At least for the timestamps, it is always a big win, because values are typically being sampled at a precisely or at least somewhat regular frequency.

For sample values, it is not as clear cut. The best way I found is to just try both approaches and use the one that results in better compression, if you can afford it. But for values I usually don't bother.


For Wavefront I got this down to about 1.5 bytes per metric reported (on average across customers) where each metric was a tuple of:

name, source, timestamp (s), value (int) and a set of name/value pairs

Key to optimizing it was having a database that supported different compression strategies based on the data.


> Key to optimizing it was having a database that supported different compression strategies based on the data.

I keep wondering why automatic detection of a compression strategy is hardly ever used. Different algorithms are just better at different things.

I thought timsort was one of the few things that does this, but a quick look at Wikipedia actually indicates that I misremember. With that gone, off the top of my head I cannot think of any utility or library that does this sort of thing. It doesn't seem hard to just take a few kilobytes of data, try a handful of good algorithms, pick the best one, and repeat after a few megabytes. I imagine the gains should be double digit percentages for at least a fifth of the datasets out there, without necessarily losing out on (de)compression speed. Curious if anyone ever looked into this, though some uncertainty will always be about bias in the dataset selection.

E.g. 9 out of 10 times xz is clearly better than bzip2 (both at max compression setting), but then sometimes bzip2 actually beats xz with some margin. I don't know when or why, but it has me trying both options whenever I care about the final compressed size. I've also had that a lower level of gzip actually compressed better (I think it was -8 vs -9, but it also have been between -1 and -3 or so).


I've always liked this kind of thing. I've also done some experiments with automatically memorizing better starting dictionaries for a corpus eras where the data is small:

https://github.com/spullara/corpuscompression

Another interesting case is using a compressed stream to indicate anomalies in the data when the compression ratio spikes down like in log analysis.


That actually sounds pretty bad to me, when many metrics will be constant or very rarely changing, or perfectly correlated with another metric.

Were you arithmetic coding stuff, or still doing bit-twiddling tricks?


If you think that sounds bad, I would love to see what you think is good and show your work then they can just add it to one of the options. It does basically all the tricks in the article and then some. This is a mean, not a median. The median is likely much lower but no one cares about the median when calculating how much space you need.

RE: Correlated with another metric. That would be extremely expensive one to calculate. You need an algorithm that can operate when ingesting millions of points per second 24/7.


This was a great read. Previously I have implemented some IoT telemetry using a simple protobuf spec using delta encoding and batching a few samples together for each send, so it sends a lot of zeros.

Protobuf uses variable length encoding so most small integers will be sent as 1 byte. The first 16 in the spec use 1 byte overhead to show there is a non-zero value, and subsequent use 2 bytes. So you want to pack all your mostly-zero (or zero deltas) values at the end of the packet, and use the first 16 for your commonly changed values. And it's always useful to determine your desired data resolution and just convert a potential 8-byte double to 2 or maybe 3 bytes length integer (with a fixed multiplication). This also works storing in SQLite: use a View to read out the values in the end-format, and store them as integers in a raw table.

This was _good enough_ for what I was doing but it's great to see how far you can go if you need extreme.


I didn't see binary interpolative coding (BIC) referenced. It is one of my favorites introduced to me by the book "Managing Gigabytes" by Moffett and Bell [1]. It has great compression ratio for sequences and is commonly used in inverted indexes.

There is neat implementation [2] and technical paper [3] by Giulio Ermanno Pibiri, which I just found today by looking for it.

[1] https://people.eng.unimelb.edu.au/ammoffat/mg/

[2] https://github.com/jermp/interpolative_coding

[3] http://pages.di.unipi.it/pibiri/papers/BIC.pdf


Are there similar algorithms useful for compressing stuff like "a list of the top 100 scores in a game?" We already store these as diffs of the list, but we could probably do better than zstd-ing the list-diffs.


This discussion about a Barclay's Bank json list of 74,000 numbers might have some ideas: https://news.ycombinator.com/item?id=28326036

I was able to get it from a 1.3mb json file (uncompressed) to a 151k uncompressed but 4k compressed file using mostly deltas. https://news.ycombinator.com/item?id=28348965


This is great! I'll have to think about how to combine "storing deltas within the list" and "storing deltas of the list," but there should be some significant gains available


You could look into multiset sequence compression methods, where the multiset elements are i sequences max_score(i, t) for at time t for each rank i https://arxiv.org/abs/1401.6410


On a related note, we've been using InfluxDB and running into an issue where the instance will occasionally have writes take several orders of magnitude longer than expected (to the point where a single write can take upwards of a second).

Has anyone else encountered behavior like this?


One notable omission from this piece is a technique to compress integer time series with both positive and negative values.

If you naively apply bit-packing using the Simple8b algorithm, you'll find that negative integers are not compressed. This is due to how signed integers are represented in modern computers: negative integers will have their most significant bit set [1].

Zigzag encoding is a neat transform that circumvents this issue. It works by mapping signed integers to unsigned integers so that numbers with a small absolute value can be encoded using a small number of bits. Put another way, it encodes negative numbers using the least significant bit for sign. [2]

If you're looking for a quick way to experiment with various time series compression algorithm I highly recommend Daniel Lemire's FastPFor repository [3] (as linked in the article). I've used the Python bindings [4] to quickly evaluate various compression algorithms with great success.

Finally I'd like to humbly mention my own tiny contribution [5], an adaptation of Lemire's C++ Simple8b implementation (including basic methods for delta & zigzag encoding/decoding).

I used C++ templates to make the encoding and decoding routines generic over integer bit-width, which expands support up to 64 bit integers, and offers efficient usage with smaller integers (eg 16 bit). I made a couple other minor tweaks including support for arrays up to 2^64 in length, and tweaking the API/method signatures so they can be used in a more functional style. This implementation is slightly simpler to invoke via FFI, and I intend to add examples showing how to compile for usage via JS (WebAssembly), Python, and C#. I threw my code up quickly in order to share with you all, hopefully someone finds it useful. I intend to expand on usage examples/test cases/etc, and am looking forward to any comments or contributions.

[1] https://en.wikipedia.org/wiki/Signed_number_representation

[2] https://en.wikipedia.org/wiki/Variable-length_quantity#Zigza...

[3] https://github.com/lemire/FastPFor

[4] https://github.com/searchivarius/PyFastPFor

[5] https://github.com/naturalplasmoid/simple8b-timeseries-compr...


I love zigzag encoding. The way it uses shift and xor is beautiful. It gives it a performance edge of 3 cycles on my cpu versus signed-leb encoding. One thing I did once, to encode UNICODE lookup tables in a smaller form (to reduce Python binary size) is I found out it can be very profitable to (1) delta encode; then (2) zig zag encode it; and finally (3) apply deflate. For example:

    _PyUnicode_PhrasebookOffset2: size         is      178,176 bytes
    _PyUnicode_PhrasebookOffset2: packed size  is      100,224 bytes
    _PyUnicode_PhrasebookOffset2: rle size     is      282,216 bytes
    _PyUnicode_PhrasebookOffset2: deflate size is       52,200 bytes
    _PyUnicode_PhrasebookOffset2: bz2 size     is       76,876 bytes
    _PyUnicode_PhrasebookOffset2: δleb size    is       47,198 bytes
    _PyUnicode_PhrasebookOffset2: δzd size     is       12,748 bytes
It's so cool.


For anyone else not familiar: δ is a lowercase delta (at least, I had to look it up)


Simple-8b is not talked about in Wikipedia.


Feel free to contribute a paragraph or article.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: