So does this functional SQL just get "transpiled" into analytical SQL? If so, where does "transpilation" occur, in the client, intercepted before hitting the query engine, etc? Or has the query engine itself been modified to handle functional SQL?
If the query engine itself has been modified, are there any performance gains or losses?
This is a pretty interesting and compelling idea. I do wonder if the functional SQL approach might entice developers to write very bad/expensive/ineffective queries. Then again, nothing is stopping developers from doing the same with analytical SQL.
So this is a great question and we may do a whole post on this at some point in the future. There are definitely ways to shoot yourself in the foot, but hopefully we can avoid at least some of them.
The main way this happens now is that the pipeline elements actually return a custom data type that represents the composed set of functions, and we can (for at least part of the pipeline) handle simplification and combination of that into a single set of functions to run over the `timevector`.
With that said, there isn't great support for common subexpression elimination in PG right now, we're hoping we can solve some of that with both subqueries and maybe some fun stuff in the planner/executor, but we're going to try to solve that as we think about moving this from experimental into stable.
Right now we hardly modify the query engine at all, these are just normal functions that are run and they'll probably remain that way for a while. The optimization here will happen more inside the datastructure or within a row to make sure we're not scanning the same `timevector` multiple times ideally. We do hope to add in some vectorization and other bits as we keep going but not 100% where that's going to go yet. I'm also going to ask a colleague to respond and clarify this cause he's working more heavily on it and I know some of the bits but only some of it...
> return a custom data type that represents the composed set of functions, and we can (for at least part of the pipeline) handle simplification and combination of that into a single set of functions to run over the `timevector`
I'm really having trouble wrapping my head around this idea. I guess it speaks volumes about the potential and power of PostgreSQL in general.
Has PostgreSQL ever been used like this in the past, or is this really thinking outside of the box?
We don't use TimeScale or have any intention of in the immediate future (Snowflake provides 99% of what we need atm), but I'm always interested and impressed to hear what they're doing.
Yeah. It's a bit mind bending I guess. Will try to explain but might just need a longer format thing and we will probably do a piece on this in the future.
Essentially, each function returns a special "PipelineElement" datatype that identifies the function, but not what it's acting on. When you string them together, it returns a "PipelineElement" that represents the composition of all the functions with the correct ordering etc. That can then be applied to the incoming `timevector`.
As far as I know, no one's done this before. I don't think it's a completely new idea, Stonebreaker thought about how you can treat code as data and thought that bringing data and code closer together was an important goal of the Postgres project, but this specific way of doing it is pretty new and out there I think!
I think most people would think that in order to do this you'd need to do something like modify the parser or something more invasive.
(We do use one planner hook on the function call, the expression simplification one that allows us to convert something like `timevector(ts, val) -> sort() -> delta() -> abs() -> sum()` into `timevector(ts, val) -> (sort() -> delta() -> abs() -> sum())` so that all the operations can be done in a single pass. But that's pretty minor. If we need to figure something out around the common subexpression elimination we may need to hook into the planner a little more, we'll see).
Oh yeah, and building on this, this is also the source of the ability to actually store the pipeline element in the database, say in a column and apply it later that I mentioned in a comment below. That's one of the cool mind-bending things this allows.
The example I gave below was building on the example in the post, where we had thermometers. Except now think about if you have thermocouples and each thermocouple had its own calibration curve, and they were different types so they might even have different types of equations providing the calibration curve.
You can now store the pipeline that does the correct adjustments and calculates the actual temperature separately for each `device_id` and JOIN to the "calibrations" table to apply the calculation.
It's kinda crazy, but really cool and I'm hoping to do a post on that whole thing at some point to encourage people to play around with it and see what sorts of things it allows...
From your description it sounds like you've implemented in PostgreSQL something akin to Haskell's IO, which is also a datastructure representing operations, with ordering enforced by it beind a monad. That sounds pretty cool.
I can assume that this special syntax allows for better optimizations; the SQL variant is much more flexible, which means you can make much less assumptions, and thus less room for optimizations.
Not sure if they’re doing that (especially for the v1 version of what looks to be primarily a syntactic selling point, rather than a perf selling point). But that’s what I would do.
Yep! you got it exactly right! We're doing it for syntax first and then hopefully go back and start doing interesting performance optimizations on top. There are definitely some things we're thinking about.
Kinda. It's close to that, except in order to make it do that we had to actually make the functions return special types so it's more equivalent to something like
Am I right, that functional pipelines substantially decreases the possibilities of the query planner to consider different plans for execution? I am a long-term functional programmer and avid user of advanced PostgreSQL features. But I've always looked at this dichotomy the other way around. How to bring the capability to choose implementations dynamically based on actual statistics to functional programming?
Would you say that the reduced flexibility of the query planner --- if I am right about that --- is not important to the
target use cases of Timescale? I guess typical timeseries data models are a well understood subset of all relational data models. Therefore the set of useful plans is smaller than usual, I guess.
The notable difference here is that this presents a lot more optimization potential, as the entire pipeline can conceivably be applied in one pass through the table.
This is immediately useful. The number of awkwardly nested subqueries that I write because I don't have this kind of functionality in BigQuery is a huge bummer.
I am curious, how many Haskell programmers do you all have? And more seriously, do you have any plans to help provide this functionality to non-Postgres SQLs, or at least help those trying to take inspiration from it?
Minor nitpick: Nested subqueries ARE awkward, which is why I would express
SELECT device id, sum(abs_delta) as volatility
FROM (
SELECT device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
FROM measurements
WHERE ts >= now() - '1 day'::interval) calc_delta )
GROUP BY device_id;
this way:
WITH temperature_delta_past_day AS
(
SELECT device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
FROM measurements
WHERE ts >= now() - '1 day'::interval //edit - the remainder of this line is a typo: ) calc_delta
)
SELECT device id, sum(abs_delta) as volatility
FROM temperature_delta_past_day
GROUP BY device_id;
To me, it is a lot more natural to use SQL's CTE syntax to 'predefine' my projections before making use of them, instead of trying to define them inline - like the difference between when you'd define a lambda directly inline vs defining a separate function for it.
I don't know if trying to embed a DataFrame-esque api inside of SQL is a thing that would benefit me, but it is an interesting idea.
CTEs are wonderful for readability, but until recent versions of PostgreSQL they were always "materialized" which can have performance implications vs. subqueries. The NOT MATERIALIZED option to CTEs was added in PostgreSQL 12.
(Timescale Engineer)
Came here to say the same thing. Some years back, to pass my CS databases course, we were required to write really complex queries in the computer lab. We had very limited time to do so.
Many people failed because their queries became complex monoliths, hard to debug or optimize when things went wrong.
That's because they limited themselves to SQL-92. We were using Oracle, so there was no reason not to use SQL:1999. I made heavy use of WITH, and it was quite effortless.
Note that your second example has a misplaced pair of parens:
WITH temperature_delta_past_day AS
(
SELECT device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
FROM measurements
WHERE ts >= now() - '1 day'::interval) calc_delta
)
SELECT device id, sum(abs_delta) as volatility
FROM temperature_delta_past_day
GROUP BY device_id;
should probably be
WITH temperature_delta_past_day AS
(
SELECT device_id, abs(val - lag(val) OVER (PARTITION BY device_id ORDER BY ts)) as abs_delta
FROM measurements
WHERE ts >= now() - '1 day'::interval
)
SELECT device id, sum(abs_delta) as volatility
FROM temperature_delta_past_day
GROUP BY device_id;
Yeah. CTEs definitely make it a bit easier to read, though some people get more confused by them, especially because they don't exist in all SQL variants.
And totally agree with that last bit! We want to see if it's useful for folks, it's released experimentally now and we'll see what folks can do with it. One thing that's fun and that we may do a post explaining a bit more is that these pipelines are actually values as well, so the transforms that you run can be stored in a column in the database as well.
And that starts offering some really mind-bending stuff. The example I used was building on the one in the post except now you have thermocouples with different calibration curves. You can actually store a polynomial or other calibration curve in a column and apply the correct calibration to each individual thermocouple with a JOIN...which is kinda crazy, but pretty awesome. So we want to figure out how to use these and what people can do with them and see where it takes us.
We don't have many Haskell programmers, we mostly work in Rust, C and Go, but we're always open to new things...
This is pretty Postgres specific. From the beginning Postgres has focused on extensibility and allowed this sort of stuff with custom functions/operators/types. Many other SQL variants don't have that. It's one of the main things that sets Postgres apart from other databases, see Stonebreaker's great history of this, specifically the stuff on Object Relational databases [1].
We're pretty focused on building on top of Postgres because of that functionality. We do have some other stuff to make Postgres more scalable, and you're welcome to try us out, but if there's something specific that BigQuery offers that you need feel free to file a github issue around that too. But yeah, no plans to do things like this in other databases, they just don't have the infrastructure...
This is really cool! I wonder what was the initial drive for this new feature?
Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL? While I like the idea of custom data types with dataflow-like syntax, implementing a whole new query processing engine for the new data type seems like a lot of engineering work. Also you now have to handle many edge cases such as very very large time series -- I wonder if you have efficient lookup mechanisms on timevectors yet, and various timestamp and value types. If all these new syntax can actually be expressed using SQL, however complex, I think a "lazier" approach is to write a "translator" that rewrites the new syntax into good old SQLs or add a translator at planning stage. This way you can take advantage of Postgres' optimizer and let it do the rest of heavy lifting.
The initial drive for this was some of the work we did on Promscale, along with some observations I made when experimenting with potential optimizations for compression. We saw that a bunch of workloads involved exporting datasets to external tools such as Prometheus and pandas, which would materialize them in memory, and performing analyses there. Furthermore, the queries in the external tool would often be more readable than the equivalent SQL. I'd also observed earlier, while experimenting with decompression optimizations, that, as long as it was small enough, materializing the dataset this way could be surprisingly efficient. This, along with some dissatisfaction in the difficulty of writing certain kinds of queries eventually led to this experiment.
> Is this meant to be a "short-cut" to express complicated SQL queries, or is this meant to adding new semantics beyond SQL?
This is a surprisingly difficult question to answer; due to Postgres's extensibility, the line around "new semantics" is a bit blurry. I believe that everything we implemented could be desugared into SQL queries, though it's possible it would need some custom functions. However, we don't actually do this yet: right now the pipelines are implemented entirely with custom functions and operators without touching the planner at all. There ended up being a 3-way tradeoff between experimentation speed, execution speed, and generality. While we could theoretically get the best performance on large datasets by by adding a translate pass expanding timevector pipelines into subqueries, this ends up in some difficult and brittle code, which isn't worth implementing at this stage in the experiment. In addition, it doesn't necessarily result in the best code for smaller datasets or less general workloads since there are overheads to its generality. Since our hypothesis right now is that there's a demand for tools to work with smaller datasets, we started out with the simpler implementation.
> Kusto has a similar data type "series" that is also created from aggregating over some columns
Cool! I'll have to look into it some point. I think Amazon Timestream also has something similar.
Thanks for the background. I find it fascinating that the small-data scenarios in analytics are still kind of chaotic when it comes to tooling. Full-fledged SQL queries on relations seems heavy but closer to raw data. The timevector custom data type is like a middle ground. Each timevector is essentially a pre-aggregated time series (maybe compressed also) so approach likely adds performance benefit when the task is to analyze many many small time series. Although I still feel supporting 70+ new functions adds a lot of maintenance burden, and people cannot debug/extend this set of functions because they are not SQL. I am wondering if you often find that users just want an out-of-box solutions or they need to have the ability to tweak or adding their own domain-specific logic.
@ekzhu At this point, this is mostly about the "developer experience" for doing time-series analytics within SQL - as you point out - ability to "short-cut" complicated SQL queries.
We actually heard something similar about SQL vs. PromQL - for the more limited domain that PromQL operated in, people really like how simplified it was. (We've also built an observability platform Promscale on top of TimescaleDB [0].) This is one take on bringing this type of simplicity & pipelined analytics to SQL, while remaining fully SQL compliant.
I also hope this isn't out of place, but if these types of problems interest you, we're always looking for great people (and lots of folks here have a research background). Shoot me a note:
Thanks for the response. I enjoy reading your blog. What you said reminds me of the post [0] in which you compared Timescale with InfluxDB and argued that SQL is better. Has your position changed due to new observation regarding usability?
I rediscovered the joy of working with databases while working on an Elixir application with a Postgres/Timescale DB.
Elixir's Ecto library (doesn't really fit the definition of an ORM) lets me compose queries in a functional syntax that then get compiled to SQL before execution. This both enables me to be more productive and results in more readable code that can be reasoned about and be tested easily. Sometimes I still have to dive into "raw" SQL to write e.g. a procedure or a view definition (like Timescale's continuous aggregates), and I definitely miss the pipeline operator when those expressions get complex. SQL is a powerful tool, but it gets clunky and has strange footguns. A feature like what this post describes is definitely a step in the right direction for Postgres. Looking forward to try it out.
Glad you liked it! Please do give us feedback especially about how it is to use with your library…will be intrigued to see how they interact. Also , cool to hear about the library more generally, any particularly good syntax you think we should try to learn from?
This is really freaking cool. I'm not often impressed by clever hacks in SQL, but this solves a really annoying problem and the specific way which you solved this problem with just custom functions and operators is ingenious and reassuring, as opposed to some of the other crazy ideas that I've seen to extend SQL.
I would love to see more about how you implemented the lambdas, as that's the one thing that is still non-obvious to me.
I guess mostly how the syntax works. The `$$` thing I mostly recognize as being a code block in custom functions, but I'm not sure how it works in the context of a lambda. Also, is `$value` just a special identifier for the input?
A separate question...is it possible to have tuple-valued timevectors?
The docs have a bit more on it [1]. The $$ is really just a dollar quoted string[2], you can use any sort of string in there, but dollar quoting just makes it a lot easier to read.
On your other question, not right now, but if it's something you want, file a github issue, it's something we've thought a bit about and it gets hairy pretty quickly, but also increases the power of these things dramatically. We'll probably do something with it, but it may be a little while and we may let this stuff bake for a bit first...
To do something similar without custom extensions, it's worth knowing about Postgres's lateral joins, which allow you to reuse columns from the left side of your join in the right side. It basically realizes the "enrich my results" promise that SQL should always have had, and makes incredibly flexible analytics queries possible without ever needing to indent more than once!
I'm looking for a more efficient way of detecting errors in metrics that store running totals, such as Odometer. I use the delta of Odometer to compute Distance. But decreases in Odometer causes incorrect Distance calculations. Odometer should never decrease. It is an error if Odometer decreases. Would function pipelines be able to help me detect decreases in Odometer?
I think "may not be super straightforward" undersells the learning curve here. I took to SQL pretty naturally but gnarly partitioning is still the easiest way for me to get tripped up. Having taught a lot of data sciencey folks advanced SQL, even simple uses of partitioning tend to be a big barrier compared to what comes before.
Totally understand the concern, for now, it’s just for a particular datatype, all of the other operations are happening normally inside your query and you can still use window functions...
And btw, I love window functions and will probably be doing a post on them, but they can be a little bit weird and have some really odd ordering properties with aggregates etc.
The point of this is really much more about providing choices for folks. But yeah, if we find that people are just getting really confused we could re-consider, that’s part of why we released it experimentally.
We're planning on providing various outputs, json is definitely one format we'd want to support, want to file a github issue with the layout that you'd want and we can start getting some feedback to implement?
It appears that all these extra functions are written in Rust, but the core Timescale product is written in C. Why did you do it that way? Would you do it differently if you were starting from scratch today?
This is all true! We have a bit more info on that choice in this post [1].
But in general the Toolkit extension [2] is meant to have a bit of a lighter touch than the core TimescaleDB extension [3], it's meant for moving faster and working almost entirely with the custom functions/types/operators type approach rather than the core extension which is integrated much more deeply into the planner & executor and deals with a number of other hooks. It would be possible to do that in Rust, but because it's so deeply integrated with the main Postgres codebase there would just be a lot of translation going on and you'd get a lot less of the benefits of working in Rust. So C is more natural for much of the core DB extension.
PGX [4] is also a huge help for writing this sort of extension. The point of Toolkit is really to move much faster, try lots of things, mostly using the more SQL facing API (CREATE FUNCTION, CREATE AGGREGATE and the like). And PGX helps manage all of that really nicely. The rest of Rust being available is also great and we like working in the language, but without PGX it'd be a lot harder and we couldn't move nearly as fast.
I think a big value add would be to have print-based debugging features for the functions (I'm not familiar with PostgreSQL custom functions so apologies if that already exists). E.g. a very common scenario while debugging a complicated query is to verify why the output is not as expected and that means repeatedly taking subparts of the query until we zero in on the problem. CTEs make it somewhat better but not ideal. If we have support for some sort of (conditional) print output (separate to the main output), it might help turning them on if/when needed (sort of like loglevels). Just thinking aloud - I'm sure there are lots of practical/conceptual roadblocks with it.
What's the explain plan for a pipeline like this look like?
In my experience one of the biggest things scaring devs away from moving more logic over the wire towards the db is lubrication between version control and database state. If I end up defining 10s or 100s of custom pipeline operators in git how do I keep that in sync with the db? and am I responsible for doing all that legwork myself?
Love how easily you can nest functions/use the output of one as input to the other. Do these functions currently only work with timevector inputs? What if I had some other NxM matrix I wanted to pass through a function pipelines?
Not everyone can use datomic though, for many reasons (including cost). I'd love to use datomic, but its just not an option for me, while postgres is (and the DigitalOcean hosted postgres is quite affordable too and supports TimescaleDB too).
Would that be the community version of Timescale then?
Looks like this is a whole new extension, does DO allow you to install your own, or do you think it'd be easy to make a case for them adding this one? (I'm not a DO user myself)
This is in the TimescaleDB Toolkit extension [1] which is licensed under our community license for now and it's not available on DO. It is available on our cloud service fully managed. You can also install it and run it for free yourself.
If the query engine itself has been modified, are there any performance gains or losses?
This is a pretty interesting and compelling idea. I do wonder if the functional SQL approach might entice developers to write very bad/expensive/ineffective queries. Then again, nothing is stopping developers from doing the same with analytical SQL.