I'll answer this here with a similar response that I gave Pradeep (the author) via Twitter.
I think ClickHouse is a great technology. It totally beats TimescaleDB for OLAP queries. I'll be the first to admit that.
What our (100+ hour, 3 month analysis) benchmark showed is that for _time-series workloads_, TimescaleDB fared better. [0]
Pradeep's analysis - while earnest - is essentially comparing OLAP style queries using a dataset that is not very representative of time-series workloads. Which is why the time-series benchmark suite (TSBS) [1] exists (which we did not create, although we now maintain it). I've asked Pradeep to compare using the TSBS - and he said he'd look into it. [2]
As a developer, I'm very wary of technologies that claim to be better at everything - especially those who hide their weaknesses. We don't do that at TimescaleDB. For those who read our benchmark closely, we clearly show where ClickHouse beats TimescaleDB, and where TimescaleDB does better. And - despite what many commenters on here may want you to think - we heap loads of praise on ClickHouse.
As a reader of HackerNews, I'm also tired of all the negativity that's developing on this site. People who bully. People who default to accusing others of dishonesty instead of trying to have a meaningful dialogue and reach mutual understanding. People who enter debates wanting to be right, versus wanting to identify the right answer. Disappointingly, this includes some visible influencers whom I personally know. We should all strive to do better, to assume positive intent, and have productive dialogues.
(This is why one of our values at TimescaleDB is "Assume Positive Intent." [3] I think Hacker News - and the world in general - would be a much better, happier, healthier place if we all just did that.)
I think TimescaleDB is amazing piece of technology but I think you're making arguments much broader than they can be made based on the facts.
The results which TimescaleDB showed to me seems to show what it is better than ClickHouse in TSBS benchmark (or particular configuration) not for Time Series workloads in general.
In my experience "Time Series" workloads can be defined very broadly (by casual user) and querying log of events can be often seen as such
If you would like to discuss facts: We have witnessed 100,000s+ of different time-series workloads over the past 4.5 years, and the patterns they share may surprise you. There is much more similarity than you may think - similarities that have been captured in the TSBS (and described by other TimescaleDB users in this discussion thread).
So while we can debate on an academic level what a "time-series" workload is, if we were to look at the facts we will find that the answer is far more specified that you may think.
Also, Peter, I wonder if you should be more forthcoming with your ClickHouse affiliation.
Everyone reading this thread is aware of my bias because I clearly state my TimescaleDB affiliation. But I didn't realize until very recently (when someone pointed this out to me) that you are affiliated with ClickHouse - eg perhaps as an investor or even founder in Altinity?
It is best practice on Hacker News to be forthcoming with affiliations so that readers can make their own decisions on how to correct for any natural biases made by commenters.
I honestly don't know what time-series databases do that's particularly unique. I've worked databases for 20+ years and a date or datetime has always been an integral part of the dataset and thus everything to me is time-series. I always seem them compared against key-value stores or document-oriented databases or NoSQL platforms, which more speaks to people not knowing how to use the correct datastore in the first place than any particular feature of a TSDB.
Even looking at your benchmark queries, I'm confused what value it provides over a standard OLTP or OLAP setup.
It might help to think about time-series databases from the requirements they're addressing. "Time-Series Database Requirements" [0] is a good summary of the problem space.
The timeseries databases I have used are good (and fast) at answering queries like "mean value of sensor_1 for every 10 min bucket". It can answer this fast. It can handle that some buckets have 1000 points in them, some have 0 or 1. It can calculate the moving average, again correctly with possible missing/unevenly spaced values. It can calculate the rate of change (the deriviative) fast.
Often there are other time-related stuff in there as well, but I think the vast majority of use is fast calculation of "mean/max/first value of sensor(s) for X-second buckets".
> I honestly don't know what time-series databases do that's particularly unique.
I understand the "spirit" of your comment and I agree in general.
However something unique that comes to my mind about some TSDBs is automated aggregation of data points (using min/max/avg/whatever functions) into something less granular after that a metric's specific time becomes older than X.
By using a TSDB for example you'll be able to look at datapoints with a max resolution of 10 seconds for metrics collected during the past 7 days, but after that their max resolution will be (aggregated into) e.g. 60-seconds intervals, and so on.
I think that the theory behind it is that you're probably interested at the details for recent stuff, but the older the stuff gets the more you just want to look at the general trend without caring about the details. I can agree with that. In the end by doing these kinds of aggregations in theory everything should be faster & should use less storage (as for older data there are fewer data points).
I did use Graphite/Carbon for some years, but I didn't like a lot its architecture and had some performance problems => I've replaced it with Clickhouse (I'm not doing any kind of data aggregation) and that's using less space and is quicker (respectively it uses a lot less CPU & I/O) :)
Time series databases are actually a subset of OLAP databases. The main difference between time series databases and OLTP databases is the amounts of data stored and processed. While OLTP databases can process billions of rows per node, time series databases can deal with trillions of rows per node.
The main requirements for time series databases:
- Fast data ingestion (millions of rows per second).
- Good compression for the stored data, since the amounts of time series data is usually huge (trillions of rows per node). The compression also may improve query speed, since it reduces the amounts of data that needs to be read from disk during heavy queries.
- Fast search for time series with the given labels. For instance, search for temperature measurements across all the sensors in the given country with millions of temperature sensors.
- Fast rows processing for the found time series on the given time range. Usually the number of rows to process exceeds hundreds of millions per query.
Typical OLTP databases cannot meet these requirements.
Perhaps I'm wrong, but "timeseries" databases are typically some combination of LSM style append only logs and eventual consistency. In an ACID relational database, i'm not sure you can simultaneously write and read millions of rows per second? If you can I'd love to learn something new :)
You can write several million rows per second through indexing and storage while reading consistent views but it is not trivial. It requires a pretty sophisticated database kernel design even on modern hardware. LSM-style is not a good choice if you require these write rates. Time-series data models are relatively simple to scale writes for as such things go.
I would not want to try this on a traditional relational database kernel, they are not designed for workloads that look like this. They optimize their tradeoffs for slower and more complicated transactions.
They do you can do some pretty good optimizations if you know what type of data is being written to a table, and with Timescale, you are converting a specific table into a Timescale hypertable, which give you different tradeoffs from a standard Posgresql table. End result is really great performance for inserts and queries while maintaining ACID guarantees.
I think you have a marketing problem, not a technical one.
People are cross-shopping Clickhouse/TimescaleDB rightly or wrongly, and it's not clear to community when they should use which. What overlaps on the Venn-diagram and what doesn't, and when they do why would I go one way or the other.
You have to do a better job of showing how you're solving customer problems. Benchmarks are next to useless, an unsolvable problem, I wouldn't waste time on it. If customers are succeeding on your platform, you'll succeed.
While I'm not a current customer of Timescale, I do use the open source version of Timescale extensively, so I feel like I can summarize some of the benefits of Timescale over other TSDB's. The company is a mid size, with awkward data 4+PB unstructured data, with our Postgres cluster hosting about 20 TB of data.
The main advantage from my perspective, is that you can query across data business data and time series data with all the advantages that Postgres has. Time series data while useful on its own, becomes incredibly powerful when it can be combined with your business and production data.
A great example is our outbound network data monitoring. We use pmacct http://www.pmacct.net/ to send network flows to Postgres from our firewall, host inventory data in Postgres, and a foreign data wrapper around our LDAP data to determine user / host assignment, and from that we can correlate every data flow to the user who is assigned to the host that generated that particular flow. This makes for some pretty powerful security reporting. Outside of that, we use Timescale's hypertables in a number of places that aren't explicitly timeseries data, like syslog data, web server logs, etc. This allows for some pretty amazing reporting on log data that is timeboxed, like "give me all the 500 errors from our HTTP log that have an ip address in Finland (did I mention that we load GeoIP data into Postgres every night) in the last 3.5 hours.
Timescale is excellent on its own, and honestly competitive with other TSDB's on its own. Having access to the full Postgres ecosystem with your timeseries data makes Timescale way ahead of everyone else. My story might change when I hit the limits of what a single Postgres host can ingest, but I'm not even close to that scale yet.
Other advantages of Timescale, is having access to real SQL, you don't have to learn a new domain specific query language, you can just use SQL. This admittedly can be a double edge sword. SQL is more complicated than PromQL / InfluxQL, however that comes with quite a lot of extra capability, and the ability to transfer that knowledge into other domains.
I personally really like Timescale, and feel that regardless of anyones benchmarks, no matter how well thought out or not, the advantages outweigh the disadvantages by a pretty large margin.
This is exactly why I'm trying to add Timescale to our infra. InfluxQL is very limited and relying solely on Flux instead of reinforcing my SQL/Postgres understanding is not an option.
FYI, ClickHouse supports querying PostgreSQL data [1], so it is quite easy to mix both analytical data from ClickHouse and relational data from Postgres in a single query.
If you ever would like to share hints and tips or experiences with other Timescale users please get in touch Timescale's community manager, email in profile
I used TimescaleDB in a project and it worked really well. I'd recommend making it more complicated. It only took a few hours to get a working prototype going that AFAIK is still being run in production.
I read [0] when it was originally posted here but it didn't convince me.
The article mentioned flaws of Clickhouse which in my opinion in the context of a TSDB are irrelevant (e.g. "no transactions", "inability to modify data at a high rate", etc...). I'm saying this because in my mind I associate TSDBs to "server metrics collection", therefore it's no big deal even if some/many datapoints are lost, there is usually no need to modify that data, and so on. But I might be wrong, maybe the usecases that you have in your mind are different (e.g. transactional accounting data?).
About deleting data: tables that host timeseries data in Clickhouse are usually partitioned by the fraction (day/month/year) that has to be deleted later => dropping one or multiple such partitions is easy & fast & extremely light on the system (it just gets rid of the underlying files and directories).
The article didn't directly show SQLs about how the tables were defined nor about how the tests were performed, you linked your benchmark suite [1] but I honestly don't want to dig into that as it seems to be complex => to be honest it sounds like something engineered to be better than your antagonist (even if maybe it's not, dunno).
In general Clickhouse has many knobs & levers that can be changed/tweaked which can backfire if not set appropriately, I personally think that whoever uses Clickhouse MUST understand it (I did not at the beginning => got totally screwed up), but at the same time those many knobs & levers provide a lot of flexibility. Btw. indirectly they seem like a "filter" to ensure that only people that are able to use that DB will end up using it :P
So, summarized, I raised my eyebrows a couple of times while reading [0]. E.g. Clickhouse does "merges" in the background, and they can queue up (depending on a lot of stuff), and all that can stress a lot the disks & CPU, so I have no clue what was going on when you got your 156% performance advantage against CH. Maybe you're right, maybe you're not, I just don't know, so I didn't trust that article nor I do now.
Maybe you could be right if you would point to the "reliability" of your DB? E.g. because of the "merges" that are triggered at "unknown" intervals by Clickhouse in the background which can in turn create hotspots of CPU&disk on the host(s) and therefore have negative repercussions on many insert/query-ops, your TimescaleDB could definitely have an advantage here (if it doesn't perform deferred maintenance like CH does), but if that's true then in my opinion that was lost in the article [0].
This has been my experience with ClickHouse as well...that is, you can basically close your eyes while writing the schema and still maintain to get extremely impressive performance.
That being said, ClickHouse also has a ton of clever levers you can pull to squeeze out better performance and compression which aren't used by default, such as using Delta/DoubleDelta CODECs with LZ4/ZSTD compression, etc. Not to mention, MATERIALIZED VIEWs and/or the relatively newer feature MergeTree Projections[1]
I haven't used ClickHouse nor TimescaleDB, but I thought TimescaleDB was competing with the likes of InfluxDB, QuestDB & Prometheus. I guess I'm not surprised that it looses to an OLAP database on OLAP queries.
Are people using ClickHouse as their timeseries backend? IIRC, Clickhouse doesn't perform all that well with millions of tiny inserts.
I've had a really positive experience using ClickHouse as an InfluxDB replacement. Initially I used the BUFFERED table type to overcome the "tiny inserts" problem, but ultimately just batch up writes in my custom line-protocol TCP server which translates line-protocol to JDBC inserts (RowBinary).
Last time I checked I have a few hundred billion rows in the table with a significant compression ratio (not sure off hand). Most importantly, the table is ordered efficiently enough to allow me to query years of metrics (Grafana plugin) at millisecond speed.
Side note, I recall ClickHouse developers mentioning they are currently working on an implementation change which will allow many tiny inserts to be much more performant and realistic to use in the real-world.
The answer is yes. I used ClickHouse to calculate and forecast sales of products at a dozen or so stores. The compression was huge because it's essentially the same data every day except for changes to the inventory. At the time I checked vanilla PostgreSQL, TimeScaleDB and ClickHouse. It wasn't even close when it came to storage or performance. ClickHouse allowed me to work off of an old workstation where I installed Ubuntu.
In my case the data arrived in CSVs with around 20k skus. Had they arrived a couple at a time, I could have created a CSV and written to ClickHouse later or used any of the other storage methods available in ClickHouse.
At first, let's give the definition of `time series`. This is a series of (timestamp, value) pairs ordered by timestamp. The `value` may contain arbitrary data - a floating-point value, a text, a json, a data structure with many columns, etc. Each time series is uniquely identified by its name plus an optional set of {label="value"} labels. For example, temperature{city="London",country="UK"} or log_stream{host="foobar",datacenter="abc",app="nginx"}.
ClickHouse is perfectly optimized for storing and querying of such time series, including metrics. That's true that ClickHouse isn't optimized for handling millions of tiny inserts per second. It prefers infrequent batches with big number of rows per each batch. But this isn't the real problem in practice, because:
1) ClickHouse provides Buffer table engine for frequent inserts.
2) It is easy to create a special proxy app or library for data buffering before sending it to ClickHouse.
TimescaleDB provides Promscale [1] - a service, which allows using TimescaleDB as a storage backend for Prometheus. Unfortunately, it doesn't show outstanding performance comparing to Prometheus itself and to other remote storage solutions for Prometheus. Promscale requires more disk space, disk IO, CPU and RAM according to production tests [2], [3].
Full disclosure: I'm CTO at VictoriaMetrics - competing solution for TimescaleDB. VictoriaMetrics is built on top of architecture ideas from ClickHouse.
I see a lot of really divergent results with these time series database benchmarking posts.
Timescale's open source benchmark suite[0] is a great contribution towards making different software comparable, but it seems like the tasks/metrics heavily favor TimescaleDB.
This article has Clickhouse more-or-less spanking TimescaleDB, but the blog post it references[1] is basically the reverse.
Are the use cases just that different?
As someone who has used both in production environments under various workloads, I can, without a doubt, tell you that Clickhouse spanks the crap out of TimescaleDB.
The only use case where TimescaleDB is more useful is the ability to mutating/deleting single rows but even there, Clickhouse offers some workarounds at the expense of a little extra storage until a compaction is run similar to VACUUM.
Clickhouse is to TimescaleDB what Nginx was to Apache.
> I can, without a doubt, tell you that Clickhouse spanks the crap out of TimescaleDB.
Same. I'm ready to believe my experience is not representative, but I've rarely heard something different after talking to people who've seriously evaluated both.
> Clickhouse is to TimescaleDB what Nginx was to Apache.
Perfect comparison. Except I don't remember Apache cooking some tests to pretend they are faster than nginx, or astroturfing communities :)
Different tools serve different purposes, simple as that.
If TimescaleDB or Apache does the job for you, stick with them.
When you will want to scale / increase performance or just rewrite, chose the better option of the day.
In 2021, Clickhouse should be a recommended default, like nginx.
I think both Clickhouse and TimeScaleDB are great systems with different design goals and approaches. Specifically I think Clickhouse is much better suited to "Event Logs" than "Metrics" storage (Clickhouse Inspired VM does well in this regard)
I would just encourage all vendors to be more humble positioning their benchmarks. In my practice production behaviors for better or worse rarely resemble benchmark results
> Overall, although some TimescaleDB queries became faster by enabling compression but many others became bit slower probably due to decompression overhead. This may be the reason why TimescaleDB disable compression by default
This matches my experience: ClickHouse is generally faster, and a better solution for time series (more robust, more mature, ...) unless you have a highly specific set of constrains (ex: must be able to delete individual records, ...) and sacrificing performance for them is an acceptable tradeoff.
I have no doubt that, as usual, akulkarni will make a good PR job / community outreach to explain why, numbers and experience be damned, TimescaleDB is better!
But I suggest interested readers check the history of previous "creative engineering" around tests that has been done to make TimescaleDB come out ahead: https://news.ycombinator.com/item?id=28945903
In 99% of the case, ClickHouse is the right choice, especially if you care about the license not adding too many restrictions.
I have no doubt that, as usual, akulkarni will make a good PR job / community outreach to explain why, numbers and experience be damned, TimescaleDB is better!
I don't like responding to bullies and people who enter dialogues without good intentions.
But since this is a public forum, I'll answer your comment:
In general: ClickHouse is better than TimescaleDB for OLAP. TimescaleDB is better for time-series. If you don't believe me, that's fine! Each workload is different and you should test it yourself.
p.s. Let's keep HackerNews a more positive place. Negative comments are unnecessary, not productive, and honestly just make the author look immature.
I'm totally in favor for more positivity. It is very easy to criticize something when you don't know what is happening on the other side. HN is a place for Hackers to discuss facts and not to imply what they think others think. If two comparisons differ, there might be several reasons as lack of trials on both sides or lack of a common ground for comparison but a lot here are doing their best to make lives of developers easier.
Sometimes, after trying to engage positively and giving the benefit of doubt, I start to notice some disturbing things. When that happens, I speak my mind, and escalate progressively depending on how trustworthy I believe the person I'm talking to is.
Here, I provided a link to the previous discussion, because personally, I do not appreciate being mislead. I encourage you to check the technical details there if you don't believe me.
But maybe not being 100% positive and supportive is no longer acceptable in 2021? Or maybe it's the complexity of the issues discussed?
So let's give a simpler message: as rkwasni said it best just yesterday: "It's really quite easy, if you don't need DELETE ClickHouse wins every benchmark" https://news.ycombinator.com/threads?id=rkwasny
It's simple as that: if you need deletion, consider TimescaleDB.
For every other conceivable scenario, ClickHouse is likely to come ahead, unless you are doing something very very wrong with it: a virtualization example would be splitting cores across VM with no respect of their shared cache.
When people talk about doing a millions of tiny inserts, it's a bit like that: a misconfiguration. And that's not how it work in the real world: even with plain Postgres, you often use a middle layer to avoid resource issues (increasing max_connections has a cost, that's why pgpool exist!), either directly in your app, or by putting some kind of buffer in front of the real table.
I have spend some serious time with both, think of me what you may, but the CEO of TimescaleDB saying TimescaleDB performance can withstand the comparison with ClickHouse is like Intel marketing department saying Intel CPUs can withstand the comparison with AMD: unless you cook the tests with some highly specific workloads (say with lots of simd/AVX512 stuff, monocore...) to be non representative of the most common scenarios, you're not being honest.
I believe such thinly veiled dishonesty is a much larger problem than a perceived positivity.
This outcome should also be entirely unsurprising and should pass people's basic sniff tests as Timescale works within the existing, mature architecture of PostgreSQL, where-as ClickHouse is a greenfield single-purpose system. Software makes trade-offs.
if you're going to compare these two you really ought to get into their materialized views where the real-world performance comes from, and ideally dive into their respective limitations
>>especially if you care about the license not adding too many restrictions.
Compression is one of the many closed-source/proprietary features in Timescale. Timescale is a great idea, as its just a postgres extension, so no need to add another database, but with such an important feature being proprietary, I end up looking at the fully Open Source ClickHouse and I see the operational overhead of another DB as reasonable trade-off for keeping my stack Open Source and avoiding vendor lock-in.
Right- did you look at any of the source files in that dir? They all have a header that says they are under the "TimeScale License" and if you look it up, you see that the TimeScale license is a proprietary, not Open Source, but rather Source Available license.
What you can also read on that linked page is that the only thing you cannot do when using the Timescale license is basically pull an AWS move and sell TimescaleDB as a service. So when you say "closed source" and "proprietary" that's just really not a good description of TImescaleDB imo. (on the other hand you can grab the Apache version and sell it as a DBaaS etc)
>>"closed source" and "proprietary" that's just really not a good description of TimescaleDB
I suppose we can agree to disagree on this. Perhaps "Open Core, Source Available" is a term you can agree to? I think my original comment was clear that part of Timescale is Open Source, or in other words Open Core.
>> the only thing you cannot do when using the Timescale license is basically pull an AWS move and sell TimescaleDB as a service
Actually, by virtue of this, it prevents me from paying some one else to host a Timescale fork for me... this in turn is a major stumbling block to creating a viable fork if my business interests diverge from Timescales's business interests for any reason. Thus leading to Vendor Lock-in, as per my original comment.
Yeah I suppose that's a valid concern. Even though, you can still choose between major cloud providers inside Timescale Cloud (AWS / GCP currently as I know) so there's that. But it needs to be through Timescale the company (unless you are ok with managing it yourself because in that case you can use the Timescale license and go to any hosting provider). So yeah I agree there's some lock-in but I don't see any other reasonable option for a company to generate revenue without any vendor lock-in
Clickhouse has done a performance benchmark with a much more bigger dataset and they have published the results in their website at [1]
https://clickhouse.com/benchmark/dbms
It sounds like ClickHouse is the default OLAP choice and TimeScaleDB is the time-series workload choice.
Does anyone have a TimeScaleDB implementation that they love for time-series workloads that they are so happy with that they don't miss the non-timescale benefits of ClickHouse?
I think this series of posts confirms the first law of Benchmarketing - for any system one can come up with "unbiased" benchmark which confirms its superiority
Calling this "benchmarketing" sounds like you're saying the entire thing is disreputable which doesn't seem right. This blog post didn't remotely come off as shilling to me. The author does not (seem to) work for either company. They gave it a shot and shared a result. Whether or not it's a good benchmark or representative for your (anyone's) use case is debatable.
I was rather referring to original TimescaleDB article which claims what unlike some others these are real benchmarks, and I encourage all benchmark (including ours at Percona) to be taken with a pound of salt because they tend to have implicit, if not intentional biases and have rather real applicability to real world.
I selected only 11M rows for this blog because I used the dataset linked in TimescaleDB docs[0]. The dataset linked in CH docs has 1.2B rows[1]. The goal was to make comparison on dataset which both of the databases agrees upon.
While I really enjoyed this read, it'd be nice to see benchmarks which also measure Queries-Per-Second under a highly concurrent workload. I've been using ClickHouse to serve live analytics and this was something that I was most interested in.
Not sure it would change results dramatically but the table schemas do not seem fair.
TimescaleDB schema in this benchmark uses NUMERIC (variable size, exact precision) versus Float32 or Float64 for Clickhouse schema.
Would be interesting to see the results with TimescaleDB schema updated to more fair REAL (Timescale/Postgres's float32 equivalent) and DOUBLE (float64) columns as per Clickhouse's schema.
I remember reading that Clickhouse is quite bad at joins, which can be important if you have to build a snowflake schema. Is that still true? Is this something TimescaleDB would be better at?
If you data to join looks like not very huge dictionaries[1] (locations, types, etc) then ClickHouse can show amazing speeds. I had no any problems with a speed of usual joins though.