Hacker News new | past | comments | ask | show | jobs | submit login
Zabbix, Time Series Data and TimescaleDB (zabbix.com)
121 points by RobAtticus on May 7, 2019 | hide | past | favorite | 43 comments



Every time TimescaleDB is brought up, I feel the need to point people to their shadily worded proprietary licence[0], and pg_partman[1].

Do the same benchmarks against a pg_partman managed partitioned db and you'll get the exact same performance. We do, at least - 150k or so metrics per second, 10 columns per metric.

Not trying to crap on the TimescaleDB guys, I've found a lot of their writeups extremely useful and can totally see how their commercially supported product fits. However, I like to see pg_partman at least mentioned somewhere in the article/comments. It's awesome and does the same job.

[0]https://github.com/timescale/timescaledb/blob/master/LICENSE

[1]https://github.com/pgpartman/pg_partman


(Timescale cofounder here)

Hey, just wanted to clarify: the vast majority of TimescaleDB code is Apache2, and you can easily compile (and we ourselves build & distribute) Apache2-only binaries.

When we announced a new license in December, we didn't relicense any code, we just said that some future features will be available under a Community or Enterprise License. The code under this "Timescale License" is clearly marked and in a separate subdirectory, and for virtually all users (except the public cloud DBaaS providers), the community features are free.

This is the actual top-level LICENSE file in the repo: https://github.com/timescale/timescaledb/blob/master/LICENSE

And here's a blog post discussing in more depth: https://blog.timescale.com/how-we-are-building-an-open-sourc...


Timescale user here. I actually think your "community" TSL license and Enterprise licenses are a good compromise.

Perhaps what's not immediately obvious is that the TSL license is there to protect against cloud providers offering hosted TimescaleDB without contributing back - systems that add value (e.g. are backed by TimescaleDB for DML) can use TSL-licensed code without any issue.


Thanks for the response & clarification, I'll give the blog post a read.


(TimescaleDB engineer here) There are major feature and capabilities available in TimescaleDB that are not available in pg_partman.

On the query side we implement a whole bunch of planner and execution time optimizations that don't come with plain PostgreSQL (and pg_partman does not implement any query optimizations AFAIK). These include optimizations that have to do with ordering based on time_bucket/date_trunc, execution-time chunk exclusion, etc. These result in query speedups of more than 1000x on many common time-series queries.

TimescaleDB is much more automated than pg_partman and thus easier to maintain and administer. There are a lot less knobs to tune and a lot less things to go wrong in TimescaleDB.

We implement analytical features necessary for time-series analyis: gap-filling, common time-series functions liked time_bucket, first, last, etc.

We also implement a lot of data management functionality geared towards time-series data: scheduled data reordering, schedule data dropping/expiration, etc.

This past Monday we released major feature called continuous aggregates. That automatically maintain a materialized view of aggregates over your time-series data, updating it as new data comes in and correctly handling backfilled data as well.

The two projects are really not comparable in breadth or scope IMHO.


Link to a file that actually contains said license: https://github.com/timescale/timescaledb/blob/master/tsl/LIC...


Clarification: this license is only for Community and Enterprise Features that live in the `tsl` subdirectory. The vast majority of our code (everything not under `tsl`) is licensed under Apache 2.


TimescaleDB confuses me. Postgres is an OLTP database and their disk storage format is uncompressed and not particularly effective.

By clever sharding, you can work around the performance issues somewhat but it'll never be as efficient as an OLAP column store like ClickHouse or MemSQL:

- Timestamps and metric values compress very nicely using delta-of-delta encoding.

- Compression dramatically improves scan performance.

- Aligning data by columns means much faster aggregation. A typical time series query does min/max/avg aggregations by timestamp. You can load data straight from disk into memory, use SSE/AVX instructions and only the small subset of data you aggregate on will have to be read from disk.

So what's the use case for TimescaleDB? Complex queries that OLAP databases can't handle? Small amounts of metrics where storage cost is irrelevant, but PostgreSQL compatibility matters?

Storing time series data in TimescaleDB takes at least 10x (if not more) space compared to, say, ClickHouse or the Prometheus TSDB.


(TimescaleDB co-founder)

TimescaleDB is more performant that you may think. We've benchmarked this extensively: eg outperforming vs InfluxDB [1] [2], vs Cassandra [3], vs Mongo [4].

We've also open-sourced the benchmarking suite so others can run these themselves and verify our results. [5]

We also beat MemSQL regularly for enterprise engagements (unfortunately can't share those results publicly).

I think the scalability of ClickHouse is quite compelling, and if you need more than 1-2M inserts a second and 100TBs of storage, then that would be one reason where I'd recommend another database over our own. But horizontal scalability is something we have been working on for nearly a year, so we expect this to be a less of an issue in the near future (will have more to share later this month).

You are correct however that TimescaleDB requires more storage than some of these other options. If storage is the most important criteria for you (ie more important than usability or performance), then again I would recommend you to one of the other databases that are more optimized for compression. However, you can get 6-8x compression by running TimescaleDB on ZFS today, and we are also currently working on additional techniques for achieving higher compression rates.

[1] https://blog.timescale.com/timescaledb-vs-influxdb-for-time-...

[2] https://blog.timescale.com/what-is-high-cardinality-how-do-t...

[3] https://blog.timescale.com/time-series-data-cassandra-vs-tim...

[4] https://blog.timescale.com/how-to-store-time-series-data-mon...

[5] https://github.com/timescale/tsbs


(MemSQL co-founder here)

How can I not respond to that!

As far as I know we've only faced off against TimeScaleDB on one small account in the IoT space.

You can't really compare columnstore storage (MemSQL) to rowstore storage (Timescale) for scanning and filtering large amounts of data for analytics use cases (of which time series use cases are a subset). I think this fact is reasonably well established at this point (the idea was popularized by the CStore project a decade ago[1]). Even at the small end scanning compressed data in columnstore format is so much faster then rowstore [2] (the data fits nicely into CPU caches and is well suited for SIMD instructions)

I would be happy to compare public customer references with timescale though. MemSQL is well established in the fortune 100 at this point:

  - https://www.memsql.com/blog/real-time-analytics-at-uber-scale/
  - https://www.memsql.com/blog/pandora/
  - https://www.memsql.com/blog/pinterest-apache-spark-use-case/
  - https://www.memsql.com/releases/akamai-real-time-analytics/
  - https://www.memsql.com/blog/real-time-stream-processing-with-hadoop/
  - https://www.datanami.com/2018/05/14/how-disney-built-a-pipeline-for-streaming-analytics/

  [1]: http://db.csail.mit.edu/projects/cstore/vldb.pdf
  [2]: https://www.memsql.com/blog/memsql-processing-shatters-trillion-rows-per-second-barrier/


Thank you for this informative response!

At my previous job we implemented custom sharding and aggregation on top of Postgres 9.4 for timeseries for a monitoring product. We did it to simplify operations as we built a new product (team <4) and we knew it would be years before our scale motivated us to adopt a specialized store.

We were pleasantly surprised, however, with how far this solution took us. 3 years later we were pushing ~30 TB every two weeks and Postgres was handing it well with predictable performance characteristics. We still didn't feel a pressing need to replace Postgres (although we were moving that direction).

It's also worth mentioning that this was 9.4 Postgres which is prior to partitioning and parallelization improvements which have been landing since 9.6. So I would expect even vanilla Postgres to handle even better.

Anyway, I'm a fan of Timescale's work and share your sentiments here almost exactly.


> You are correct however that TimescaleDB requires more storage than some of these other options. If storage is the most important criteria for you (ie more important than usability or performance), then again I would recommend you to one of the other databases that are more optimized for compression. However, you can get 6-8x compression by running TimescaleDB on ZFS today, and we are also currently working on additional techniques for achieving higher compression rates.

This is a weird answer since compression is used by columnar databases like MemSQL and Clickhouse to both save on storage and accelerate queries. Compare this to using a generic a filesystem compression which would both compress worse and make the system slower.


We haven't really found it to be the case that the system is slower with ZFS. As the sibling mentions, you are trading some CPU for better I/O. We usually see better insert performance and similar/better query latency.


Compression may or may not be worse with ZFS defaults, but performance will almost certainly be _better_ with the default ZFS compression settings than an uncompressed filesystem. You're trading a small amount of CPU for IO, and that's usually a really good trade.


We use TimescaleDB with databases between 1-100 million rows (small by some standards, but certainly not tiny) - I love it!

- we use Postgres as our main database, so being able to keep out time-series data in the same place is a big win

- perhaps because because it's a Postgres extension, the learning curve is small

- it keeps timerange-constrained queries over our event data super fast, because it knows which chunks to search across

- deleting old data (e.g. for a data retention policy) is instantaneous, as TimescaleDB just deletes the physical files that back the timerange being deleted

- it has some nice functions built-in, like `time_bucket_gapfill`. Yes, you could write your own functions to do this, but it's nice to have maintained, tested functions available OOTB


There are a ton of projects that will never outgrow TimescaleDB so if you have in house PostgreSQL expertise looks like very decent option.


There's an interesting benchmark against TimescaleDB and InfluxDB from VictoriaMetrics that seem to do better on performance and disk space than both. I consider using it as a remote storage of Prometheus.

https://medium.com/@valyala/high-cardinality-tsdb-benchmarks...


I can't say for sure, but shouldn't insertions be way quicker in Timescale, because the index-changes are limited to the most-recent subtable only, and it's still row-based?

We're considering a move from OpenTSDB to Timescale currently, and something that stands out in Timescale is the wide-table format; we get bundles of metrics at each tick, and having them aligned makes usage easier, and perhaps also saved us some space over having the timestamps repeated per metric.


Consider moving to time series database with PromQL support. It is much easier to write typical queries over time series data in PromQL than in SQL or Flux. See https://medium.com/@valyala/promql-tutorial-for-beginners-9a...


My understanding is that it just gives you a bit of extra room if you have a small to mild timeseries problem. That's still a lot of use cases, but you're right it will never work for larger use cases.

They said they didn't want to reinvent a database engine to solve the timeseries problem, so you have what you pay for.


We are actually doing a fair bit to address larger use cases. We also had customers who are doing 100s of billions of points successfully, so I guess it depends on what you mean by larger.

As our CEO mentioned in a sibling, we are working on a horizontal/scale-out solution for even higher ingest rates, as well as sharding. We're also doing some work for better compression to reduce our disk footprint.

Also since 1.2, we have support for automatic retention policies that help keep the disk usage in check. Yesterday we released 1.3, which contains our first iteration of continuous aggregations that let's you materialize aggregates over the raw data for faster querying. In a future iteration, we'll also allow you to remove the underlying/raw data but keep the aggregates -- another way to improve the disk usage of your data.

All that is to say we do consider ourselves useful for larger use cases, and have a lot of features coming down the pipe to make it even better.


Experiences with Zabbix? I tried it back around a decade ago and wanted to like it, but didn't find it very reliable. And now the details are escaping me. I ended up sticking with Nagios and Opsview. Around 5 years ago I switched to a templated Icinga2 config and have been pretty happy with that, but it's pretty low level.


Surprised to see Prometheus hasn't been mentioned yet, and even Nagios is being mentioned as a better alternative. My company (higher-ed, ~100k combined students/fac/staff) is desperately trying to get away from Nagios. Once you get Nagios to the scale where you have to implement mod_gearman, you've gone too far.

I'd recommend taking a look at Prometheus[1]. It has its own _very_ performant TSDB, there's exporters for just about everything, it's the defacto way that things like Kubernetes expose metrics, and it has first class support in Grafana for visualization.

We POC'd Zabbix, Icinga, ScienceLogic, Instana, Sensu, and Prometheus. Prometheus was our favorite. Take a look at the comparison between it and other popular monitoring products to see if it fits your needs though [2].

[1] https://github.com/prometheus/prometheus [2] https://prometheus.io/docs/introduction/comparison/


The problem I have with Prometheus is, I have most of my nodes in very closed networks I don't have control (Healthcare) and I can't set up proxies so Prometheus can reach them, I can only go outside. So, by now, my best option seems to be InfluxDB, which doesn't look bad to me.


I've been using InfluxDB for ~3 years now for storing metrics (almost exclusively via Telegraf, a few custom ones), and it has been great! It replaced a collectd setup and dramatically decreased load across my fleet.

When I first started using it, it was pretty early and had some issues. In fact, I nearly trashed it. I also didn't like the pull vs. push model from Prometheus. They ended up resolving the InfluxDB issues I was having right as I was about to give up on it, and it's been solid since. I use it with Grafana to generate graphs of system use. I set it up before TICK was a thing.


I was about to like InfluxDB but ever since people say it eats memory and your data, I stopped caring.

https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/FAQ

("How does VictoriaMetrics compare to InfluxDB?")


That hasn't been my experience. I've been running it for ~3 years in our dev, stg, and prod environments. Prod is using 1.5GB of RAM on a 5GB instance. I've never had a data loss issue.



I'd recommend giving a try to VictoriaMetrics. It requires less hardware resources - RAM, CPU, disk - comparing to InfluxDB [1] and it supports PromQL - much nicer query language for typical time series queries comparing to InfluxQL or Flux [2]. It may be used as a drop-in replacement instead of InfluxDB on the ingestion path [3].

[1] https://medium.com/@valyala/insert-benchmarks-with-inch-infl...

[2] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...

[3] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/Sing...


Zabbix is a bit opaque to tune and the support forums aren't super helpful unlike say Nagios.

That said it does some really cool stuff like tree walking across all the HP switches on our network, auto monitoring all ports it finds and then reporting on their stats and on any UP/DOWN states for every port.

Good for detecting unauthorized usage or a device which is rebooting itself.

Its IPMI support is also pretty good, we had it monitoring Supermicro IPMI interfaces with zero issue.

It handles vSphere and auto scans the entire cluster, adding all guests and monitoring them without needing to install an agent on every VM.

All in all a very good solution with some very cool features, but a steep learning curve and not much help on their forums although the docs are pretty good.


I’ve run Zabbix with thousands of monitored hosts. It’s not perfect, and it requires some bending to just how Zabbix wants things done, but it’s nice. We have it monitoring all manner of stuff, hardware, power, cooling, services, batteries, weather, network, disks, etc


Absolutely terrible. Like you said, totally unreliable. Scaling it is insanely difficult. Documentation is weak. Their APIs seem like an afterthought and performance was pretty bad.

Nagios is not great, but it’s reliable and when it breaks you can figure it out.


One thing I like about zabbix is the excellent grafana plugin, which provides a very good ability to view and ack host-by-host alerts from within grafana.

That said, I'm not very familiar with the alternatives.


I've had a similar experience to you with Zabbix in the past. We have it bundled with some HPC stuff I support curently, it's ok but I prefer Sentry/TICK/Prometheus shaped things that we also run.

If you're happy with Icinga2, stick with that. I've used that too at a previous gig and found it better that Zabbix, but my personal take on it. YMMV


My experience with Zabbix has been positive. I deployed it 3 years ago and it's been solid since then. It monitors a few dozen CentOS VMs and a bunch of JBoss/JMS instances.

One feature I particularly like is the zabbix_send command, which I use to push the status of shell-scripted Borg backup jobs into Zabbix.


Zabbix looks like shit and feels like it was made in 1995 but it is great once set up.

If better visuals are needed, I would hook it up to Grafana. I have previously used Grafana with Graphite as backend but it was too unreliable. If it actually works with Zabbix then it could be the perfect match.


Yeah I agree. Zabbix sucked when I tried it many years ago. Definitely not going near it again.


Zabbix is like a step up from Nagios. I don't know how they can even stay relevant with Prometheus.


My experience with timescaledb is - it does not support gorilla encoding. So the storage needs for it is very high.


Gorilla TSDB format paper for those who might not get the reference: https://www.vldb.org/pvldb/vol8/p1816-teller.pdf


The biggest problem I've had with timescale systems is managing the SSDs/HDDs underneath.

Having to resize/grow/stripe/etc. them is a pain.

So we came up with a clever solution that batches chunks to S3:

https://www.youtube.com/watch?v=x_WqBuEA7s8

$10/day for 100M records (100GB data), all costs!

And best yet, reduced DevOps! Very practical, super simple.


Timescale engineer here. Just want to point out that you can also attach additional disks using tablespaces, which are fully supported on hypertables. With a few simple commands, this allows you to add new disks and move old disks out of rotation while still being able to query the old data on them.


We prefer Google Cloud durable persistent storage. It may be started from a few GBs and then resized online up to 64TB per instance. This allows saving money by resizing disks only when needed. Such disks cost $40/TB/month. See https://cloud.google.com/compute/docs/disks/add-persistent-d...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: