I just want to say that we wouldn't be here without the support, feedback -- and yes, even the honest critiques -- from the HN community. So thank you everyone.
As we like to say, we've come a long way in the past 5 years, but we're just getting started :-)
And we're hiring globally for our remote-first team!
As a user of Timescale, one of the things I find most lacking with Timescale (and postgres also to be honest) is good educational content on par with MongoDB University; structured courses that teach you database design concepts from first principles and then what problems postgres/timescale solve on top of them. Hands on experience working with datasets and an interactive way of learning more about the types of things you can do.
I realize Timescale and Mongo are very different things, but when I got professionally started with software 10 years ago, the MongoDB courses (and the stanford online Intro to DB course) were immensely helpful. Working with Timescale professionally now I'm often unsure whether I'm doing things suboptimally, e.g. making tables hypertables when a regular table might be better, flexibility and capabilities of indexes with hypertables, and application-facing tooling.
Totally fair and something that I'm actually forming a team to work on! We're starting with some very foundational material [1], that may well be review and it's not as formal / professional as Mongo University or the like, but I am going to be continuing this course and then we'll be iterating more from there. I'd really love some feedback and also your questions, ie what you want to cover or what you find confusing. You can leave comments on the video or in our community Slack channel[2] or forum[3]. Thanks for the feedback and I hope we'll be able to do some of that for you over the coming months!
TimescaleDB has some of the best documentation available for open source and super helpful developer advocates. They are also happy to reach out to dev teams using TimescaleDB (even open source version). We recently did a guest blog post with them:
I agree completely with this. The documentation is great and the community on slack has been incredibly helpful when I've posted there. Even still, I personally think a lot of people would benefit from course content like MongoDB provides.
Thanks for making Timescale, my company ThoughtMetric is an early adopter and we currently ingest about 1M rows of data per day using it. One highly requested feature that we desperately need is continuous aggregates that can join multiple tables together. Right now we have to use regular materialized views to accomplish that and it sucks.
Congrats and I was wondering if you can comment on the current team size? I'm looking at the number of contributors that have created pull requests within the last four months and it is shockingly low (in a good way). Based on the following:
It looks like there has only really been 7-10 full time contributors and for you to have raised what you have with such a small team is quite impressive. Is development happening elsewhere or is my hunch correct?
Edit: Thanks to feedback from mfreed, below is a more accurate picture of development activity:
Hi! So the team is over 100 at this point, but engineering effort is spread across multiple products at this point.
The core timescaledb repo [0] currently has 10-15 primary engineers, with a few others working on DB hyperfunctions and our function pipelining [1] in a separate extension [2]. I think generally the set of outside folks who contribute to low-level database internals in C is just smaller than other type of projects.
We also have our promscale product [3], which is our observability backend powered by SQL & TimescaleDB.
And then there is Timescale Cloud [4], which is obviously a large engineering effort, most of which does not happen in public repos.
Interested? We're growing the teams aggressively! Fully remote & global.
Hey thanks for the insights! I've added all timescale repos for indexing and should have the bigger picture in a few hours. Thanks again for catering to my curiosity.
Timescaler here. Linking a few posts below [0][1] that answer the majority of your very good questions. The posts detail why and how TimescaleDB started and why the founders chose to build a time-series database on PostgreSQL.
While I think that TimescaleDB is a great technology, the support experience of their Timescale Cloud offering was quite underwhelming.
In one occurrence we wanted to create a VPC peering between the a database hosted on Timescale Cloud and our Kubernetes cluster hosted on Azure. For this you need to put the Azure resource group name in a form on Timescale cloud.
Turns out our resource group name contained an uppercase character and the form on Timescale Cloud has a (broken) validation that required the name to be all lowercase. We couldn't easily change that name since that would have required us to re-create our production AKS/K8S cluster.
After contacting Timescale support (as a paying customer) the answer was basically: "Well, we require the resource group name to be lowercase, we can't change that, sucks to be you, bye"
We can live without that VPC peering, so we didn't push that further, but there are zero technical reasons for that restrictions and I would bet that its just a broken validation regex in their backend that they are unwilling to fix.
Hi, thanks you. I'm really grateful for your answer. The Azure documentation seems to hint at the resource group name being case insensitive, so I guess that could actually work. To be honest I don't know if we tried just using a lowercase version.
Again, I want to emphasize that I find it really great that you took the time to answer here.
Hey there. I'm responsible for support here at Timescale. akulkarni asked me to take a look at this.
To be fair, I don't have a good answer on why the input box does require the resource group name to be lowercase, but I ran through the steps end-to-end, and as akulkarni said, the names are case agnostic on the Azure side, so you'd just need to input the resource group name lower case to create it successfully.
Sincerely, thank you for the feedback, as this is how we learn and improve. If you give this a shot and run into any trouble, please do let us know, and we'd be happy to help troubleshoot.
> Looking ahead, our goal is to keep innovating on top of PostgreSQL and to continue adding breakthrough capabilities
Does Timescale contribute back to PostgreSQL or do they truly only build on top of it? https://www.postgresql.org/community/contributors/ only lists two contributors and they both worked on Postgres before joining Timescale.
I would love to run a time-series benchmark against a good column store like Snowflake to see if purpose-built time series databases are actually faster. I have a sneaking suspicion the time scale databases are just reinventing the column store, and that an appropriate non-sabotaged benchmark would show this.
> "timescale ... are just reinventing the column store"
Not reinventing but reimplementing it for Postgres, which didn't have serious OLAP capabilities before. Lots of "newsql" systems are combining OLTP and OLAP by starting at one side and adding the other.
So far Timescale has column-oriented compressed storage and scale-out partitioning, and they're working on matching the compute part.
the main selling point is developer experience I think, rather than building out a bunch of stuff on top of a more general purpose tool you use a specialized DB and save time. They also have some benchmarks against Clickhouse for example
I have just looked up those charts on Timescales website [1] and I am a bit surprised. Never extensively used any of those DBs, but I have seen their sources and must say that expected bigger gaps [2]. Also worth looking: the Taxi Rides Benchmark on Postgres vs Clickhouse [3].
There is a lot of value in having a columnar storage that is fully ANSI SQL and supports all of the goodies that you get in the Postgres ecosystem.
NoSQL databases with their half assed SQL grammar implementations are a real pain to use in real applications where they often have to be handled differently in code vs the RDBMS because either their syntax is slightly different or their connection stack is incompatible.
That is, "I have this seldomly-updated list of ~10000 things, and I'm going to need to join it against my time-series data."
With other time-series databases I've dealt with, it's an afterthought at best and the answer is typically "Enrich the data via flink/benthos/etc. on import and avoid using any kind of join."
Does Timescale's use of PostgreSQL circumvent this issue, both in terms of storage of lookup tables, and performance on join?
Yes, we support the rich set of PostgreSQL's JOIN operations, including against hypertables. It's generally smart enough to only apply these JOINs against the right subset of time-series data if you also have any time predicates (due to the way we perform "constraint exclusion" against our hypertable chunks).
There are other common queries related to what you describe, like a "last point query": Tell me the last record for each distinct object. Here, for example, we've built special query optimizations to significantly accelerate such queries:
There are plenty of distributed relational columnstores that can do joins. Timescale is bringing that to Postgres but you already have options from Clickhouse to Redshift.
Have you written about this anywhere? I'm sure TimescaleDB would love to signal boost that post, and I separately would love to read about how you have it set up and the nitty gritty of the setup.
How are you dealing with backups/WAL and general DB administration? Are you using Timescale Cloud?
Not OP but I run a Timescale instance with the same order of magnitude inserts/sec and have been running it for about 2 years now. The database is closing in on 1 TB on disk. We don't use Timescale Cloud, we just host it on a VM in Google cloud with 8 CPU's and 32 gigs of ram, which seems adequate for now. We do WAL backup using the WAL-G tool which backs the db up to google storage.
Thanks so much for this. 2 years at that insert speed only being about 1TB of data is fantastic.
I've also had many discussions on backups (Barman, backrest, Wal-E/G) etc and always feel like I have to look it up afresh every time to get myself back up to speed on which one I should be using.
Ya the backup solutions are complicated. I don't even remember the differences between those different tools. All I remember is that I spent about a day researching the tools and determined that Wal-G was the best one to use for our use case.
Thanks for going into the nitty gritty, super helpful to know what your setup is like.
In the I’ve had terrible experiences with on-demand IOPS but now I feel like even 1000 provisioned (the least you can get) is too much for the workload I was running… the app was mostly idle but had bursts that would overwhelm
We enquired about Timescale Cloud, to migrate from Cloud SQL as we could use it to solve a couple of problems in a simpler way and give us a chance to create other solutions when needed. We don't have anything in time series yet but were hoping to have it there to experiment and start dipping our toes but found it a bit expensive.
The pricing seemed more of an all or nothing investment and didn't make sense if we weren't ready to start using the extra functionality just yet. Was hoping that it would be similar to Cloud SQL and just cost extra depending on usage.
Hi there, not sure how you did your price comparison, but typically with native compression, performance improvements, continuous aggregates, etc, you can go much further with the same resources on Timescale Cloud than Cloud SQL (or any other generic PostgreSQL provider). Did your math take those performance improvements into account?
I have written time series logging db with sqlite believe that approach has following advantages:
System performance scales well with latest SSD HW.
As compare to cloud base approach that is limited by network/cloud speed.
One can store logs per day / week / year in separate db files as needed.
Backup of small db files for last few days/weeks are trivial with rsync.
Love to hear other pro/con arguments from folks who use Timescale type approach.
Try timeseries with duckdb.org or Clickhouse Local? We use the former for analytical queries (queries over columns instead of rows), at which it is better than Sqlite3.
Is using timeseries database like TimescaleDB a good fit for a chat platform like Slack/Discord/Mattermost? I've heard that Discord is using ScyllaDB, but I have quite a bit of experience in Postgres so it would be nice to use just that.
In any case, does anybody know what are the advantages of one over the other? Big thanks!
Insane! $110M towards yet another Postgres extension. Theoretical CS and hardware has advanced so much, but the people are using the same old boring approaches. Truly sad.
Nice share. Oriole DB updated the readme 17 days ago to say they'd release in February 2022, so I'll definitely be watching this space. These sounds like some really interesting striking-at-the-heart-of-it changes. I hope we have a long time to see which of these pan out & pay off big.
On a side note, nice interesting brief comments & shares elsewhere on this site. :thumbup:
Great report, but I am new B-Trees alone will not enough. The simplest common solution is to switch to LSM Trees for higher write throughput. Thats exactly what Yugabute does, by putting Postgres over RocksDB. Same way as Facebook uses MyRocks = MySQL + RocksDB.
Literally anything. There is so much to do better. Faster I/O, kernel bypass and async filesystem, new persistent data-structures, alternative lock-free concurrency resolution schemes…
Disclaimer: I am highly biased, as I am funding/researching/developing a DBMS myself. Out of necessity though, as we constantly hit bottlenecks in the persistent I/O layer. We are not selling or offering anything, but will soon share some fresh internal results on aforementioned topics.
I don't understand what is difficult or non trivial about these types of databases and when people try to explain it, it usually just gets more confusing. Filtering values over time is just the same operations that you would find in an audio editor or a one dimensional version of what you find in an image editor (weighted averages of values). I wonder how many customers could just use sqlite but don't know anything about computers and end up buying some sort of subscription to a 'new kind of database'.
The web page just drops as many buzzwords as possible - web3, crypto, nfts, monitor soil to fight global warming - it looks like a disaster to anyone who understands the basics of programming.
It does a good job laying out why TSDBs are used and some of the tricks they leverage to store this type of data. See the requirements for the service layed out in the paper:
• 2 billion unique time series identified by a string key.
• 700 million data points (time stamp and value) added
per minute.
• Store data for 26 hours.
• More than 40,000 queries per second at peak.
• Reads succeed in under one millisecond.
• Support time series with 15 second granularity (4 points
per minute per time series).
• Two in-memory, not co-located replicas (for disaster
recovery capacity).
• Always serve reads even when a single server crashes.
• Ability to quickly scan over all in memory data.
• Support at least 2x growth per year
Lots of organizations want to adopt an SRE/devops model and want a similar system. Also one thing you should know is that trying to accomplish this with traditional DBMS is usually possible but since it is not making specifically optimized trade offs it usually is more expensive and requires a lot of tuning/expertise.
Lots of organizations (even legacy companies) have a massive need for this kind of service. Also there are very cheap options out there than can handle the million metric use case for basically a <100$ a month is infra costs. The use case is definitely there and even if it's possible with traditional DBMS systems, it usually cheaper and more performant to use a dedicated TSDB.
If your data is small enough, then sure, any number of well tested data platforms will work for you.
The problem something like timescale tries to solve is dealing with "high cardinality." When you have many unique values the indexing approaches needed to ensure performance start becoming different. You'll run into write performance issues if need indexes on every single column, and every single combination of columns, and each column has a large number of unique values. While the common factor many of these datasets tend to share is that they are being constantly generated by some kind of sensor/probe/live-system, they tend to have a variety of other dimensions that are also high-cardinality.
There are two different things here - the first is people not needing an elaborate solution because computers are fast and the second is that if someone does need a solution with less overhead, why is that difficult?
Values over time like audio is a one dimensional signal. Seeking is basic data structure stuff, filtering is basic signal stuff. There aren't going to be other dimensions like time, which makes the other values just other channels. If you need to combine dense values they can be not only filtered, but filtered into individual distributions.
People give abstract descriptions like you have here, but I'm just not seeing a difficult problem in all of this.
Relational databases are historically either OLTP like Postgres, MySQL, SQLite, etc; or OLAP like Vertica, Clickhouse, Greenplum, Redshift, etc.
The latter group is designed to analyze lots of data (calculating aggregations across billions of rows) and have developed features like storing data as columns, using compression, batch/vectorized processing, scaling out across multiple servers, and other techniques to get that performance. Timescale is an extension to Postgres that brings these capabilities to Postgres and is one of a very few relational databases that offer OLTP+OLAP in a single product.
The time-series niche is what they targeted first, and the product offers lots of useful features around time-related data, but it's also a generic analytical database offering at this point.
> There are people who actually replied with something worthwhile.
And you dismissed it as still not being difficult.
The Timescale folks did all the hard work of finding product / market fit for you. If this problem is so trivial, you'd make a faster, cheaper, more reliable product and steal their lunch. Not only their lunch but Clickhouse, AWS Timestream, etc..
No, I said I don't understand why it isn't trivial or if most of the people buying it don't realize they don't need a specialized database. Only one person was able to come up with anything, other people, including you just go extremely upset at the question without being able to answer it.
The number that stood out the most was 700 million values per second, which is about 11.6 million per second. This is still less than the data in a single 4k image. My guess is that none of this is a feat of engineering, but that it's just enough work that many people that need it won't create their own solution.
I just want to say that we wouldn't be here without the support, feedback -- and yes, even the honest critiques -- from the HN community. So thank you everyone.
As we like to say, we've come a long way in the past 5 years, but we're just getting started :-)
And we're hiring globally for our remote-first team!
https://www.timescale.com/careers