Beware that simply adding a column-oriented storage engine to a row store like Postgres is not going to get you anywhere near the performance of a ground-up columnar system like Redshift or Snowflake. This paper explains why [1]. Short version: most of the benefits are in the column-oriented execution engine, which differs in every aspect of its implementation from a row-oriented execution engine.
Correct, though it depends whether you are CPU-bound or I/O-bound. We see the latter a lot more often for large data sets.
Columnar storage for PostgreSQL is especially relevant in cloud environments. Most database servers in the cloud use managed, network-attached disks because of durability, availability, and encryption-at-rest requirements. However, those do come with a performance penalty compared to local SSDs. The VMs also have IOPS and bandwidth limits, partly to manage capacity within the IaaS platform.
If you can reduce the data size by 10x, then you are effectively increasing your disk bandwidth by that much as well. Moreover, you can keep more data in memory, so you will read much less data from disk, plus you'll only read the columns used by the query. Hence, you're likely to see speed ups of more than 10x for some queries, even without column-oriented execution.
That's exactly what we've seen. I don't know how Citus exactly stores data, but we found difference of 30x between gzipped Parquet and "generic Oracle table".
There's a huge difference whether your analytical query is scanning full 30 GB or 1 GB (or maybe even half that or less if you need to scan just some columns).
Wouldn't it be possible to create a new type of index in Postgres (or maybe it already exists) which would take all the data of a column and simply lays it out in columnar format in memory, with all the benefit of compression ?
I agree,l that it's never going to be as effective as something built from the ground up as a column store. But the performance and storage improvements can still be immense over standard Postgres. And if you're already using Postgres, it means one less service to configure and maintain, and means you can easily combine both standard relational data with columnar data. For a lot of folks, that a huge win.
One of the gotchas of columnar storage (coming from Redshift) is that you lose all of the compression benefits if you have just one column that’s fat or hard to compress.
In Redshift columns are stored in blocks. You want to fit roughly the same number of column values per block across all your columns. But if you have one column where a small number of values can fit in a block, the rest of the columns end up leaving most of the block space unused. The result is wasted disk space and poor query performance.
This Postgres extension has similar-sounding storage ideas with stripes, but it’s not clear to me if it suffers from the same issue.
My first test to vet this would be a table with 50 columns of ints and one column of md5 hashes stored as varchar.
Stripes aren't really "scanned". They are more of a logical concept that tracks where the physical data for each column is, and only fetches what it needs.
If I understand what you are asking, let me restate: "can you apply a predicate first on column A before reading column B, so that you can avoid reading column B if the predicate on A doesn't match?".
The answer is: "sometimes". Predicates match some rows and not others, so matching rows may be mixed in with non-matching rows, so it's not always possible to avoid the IO to read column B. However, if the matching and non-matching rows happen to be separated (e.g. if your table is naturally in time order and you have a time-based predicate), then it's able to do this optimization. Please see the section on "Chunk Group Filtering".
Do you have a source for this, or a code sample that can demonstrate it? This would be an extremely naive implementation of columnar storage. There are some truly hard cases around long variable-length strings, but any halfway decent columnar storage engine should be able to handle columns with different widths.
The compression actually works as advertised, but that document outlines why you don't want to have the sortkey skew from other columns. The disk I/O balloons as it goes and fetches a larger number of blocks from the other columns that match what you're filtering in the sortkey.
They call this issue out in the docs specifically when the fat column is a sort/distkey. rsbb in the Intermix slack has studied this in the most detail I know of outside AWS but they haven’t published their book yet.
Several different compression algorithms would do very well for this. E.g dict compression for low cardinality wide columns. I'd recommend familiarizing yourself with a handful of different encodings for columnar data stores before you give up on the idea.
I'm curious to see how this compares in real life to TimescaleDB hypertables with compression - which to me, reads as much the same thing. I'm wondering if Citus is bringing a lower level implementation of idea possibly?
The access method approach followed in Citus is indeed lower level and more generic, which means it can be used on both time series data and other types of data.
You can then use the columnar access method to compress old partitions (see the end of the doc), and use distributed tables to shard and parallelize queries and DML.
There are a lot of differences that need to be taken into account before making a comparison.
1. TimescaleDB implements the compression on a hifher level, the underlying table storage/access method remains the same
2. TimescaleDB doesn't compress latest data, allowing you to keep fast writes and edits for recent data, but also allows you to benefit from compression on row based data
3. Although not currently available, it is possible to have a TimescaleDB hypertable with a column based access method
4. Comparing would have to take into account the data model, access methods (types of queries), ingestion vs query comparison (batch vs real time), backfilling and editing, etc
I agree that this (Columnar) would be closer to Parquet.
TimescaleDB users have seen 98% (ie over 50x) compression rates in some real-world cases (e.g., for some IT monitoring datasets), but compression ratio will definitely vary by dataset. (For example, a dataset of just 0s will compress even better! But that's probably not a realistic dataset :-) )
The reality is that Citus and TimescaleDB [0][1] take very different approaches to columnar compression, which result in different usability and performance trade-offs. In reality one should choose the right tool for their workload.
(As an aside, if you have time-series data, no one has spent more time developing an awesome time-series experience on Postgres than the TimescaleDB team has :-) )
Kudos to the Citus team for this launch! I love seeing how different members of the Postgres community keep pushing the state-of-the art.
Reassuring to see big new features like this coming out after the Microsoft acquisition, a healthy sign that this open source product continues to see serious ongoing investment.
yes but its more like vitess. which is basically more like sharded masters, but most of the time this works way better than something like galera or bdr.
cstore_fdw had the same limitations, so even if they say these limitations may not persist forever, i am not very hopeful they really want to solve this problem.
But they could solve it with just a little bit of work:
* Create a hidden bitmap information to store whether the „row“ of the columnar table is still valid
* When updating/deleting values only set the bitmap information to zero to indicate the value is no longer valid
* Every updated „row“ is added into a special chunk at the end of the table in uncompressed format
* When vacuuming the table the old rows with bitmap=0 are deleted and the values from the special chunk are merged
So you would have same performance if update/delete is never done as the complete bitmap index is filled with 1. And every update/delete will make it just a little bit slower, as most often only old date is stored in compressed format not much uncomprossed values will be stored. And a vacuum full would optimize these old tables again.
Not the previous poster, but my use case is storing site analytics (i.e. "Google Analytics"-like stuff) and allowing people to retain pageviews only for n amount of days.
It sounds like you might be able to do that using range partitioning instead, similar to the example in the blog post. Then just delete entire partitions when they are older than x days.
Site 1 may choose to keep things for 30 days, site 2 for 50 days, and site 3 may choose to keep things infinitely.
The solution I use now is to just run a background job where day with "delete from hits where site_id = ? and created_at < now() - interval ?". It's pretty simple, but works.
There are probably some clever things I can do: partition by site ID and then date range, or something like that. But this has a bunch of downsides too:
- I like to keep things as simple as possible to make it easier for people to self-host this, which is a big reason I use PostgreSQL in the first place. If it was just some SaaS I was running then I wouldn't have a lot of problems turning up the complexity a bit if it gave me serious benefits, but when you distribute software you expect other people to be able to run then it's a bit more of a trade-off.
- The entire thing also runs on SQLite, and I'd like to minimize the number of special PostgreSQL/SQLite branches when possible.
Ideally, what I'd like is that people can optionally plug in something like Citus if they need it (and many don't! Even SQLite actually works fine for many) and have standard SQL "just work" without sweeping architectural changes. They can switch it on/off as they please. I don't mind adding a few "if using_citus then ..." exceptions here or there, but there's a limit, especially since many people just don't need it (but will still "stuck" with a much more complex table structure because of it).
This, for me anyway, is the appeal of things like Citus or TimescaleDB vs. more specialized solutions like Clickhouse or whatnot. I don't need to get the "fastest possible solution", and these solutions strike a good middle ground between complexity/ease of setup vs. speed.
There is also a second use case for UPDATE (the above is mostly for DELETE): right now I pre-compute some data and run most queries on that, rather than the main append-only table with events because this is rather faster in various cases. Because the data is aggregated by hour or day, it needs to update those rows when new events come in. The way I do that now is with a unique key and "insert [..] on conflict [..] do update [..]", but there's a bunch of other ways to do this (but this it probably the easiest).
In principle those tables should become obsolete with a column store, but when I tried cstore_fdw last year this wasn't really fast enough (although this version may be, I didn't try yet). Even if it would be fast enough, this still means I'd have to write all queries twice: once for the "normal" PostgreSQL/SQLite use case, and once for the Citus use case, which isn't especially appealing.
So, tl;dr, I want Citus to be an optional thing people can use, and maintain compatibility with mainstream PostgreSQL and SQLite (and, possibly, other engines in the future too).
Perhaps this is a rare use cases? I don't know. But this is my use case.
I would need just random updates. In almost all cases i wouldn‘t update old records, but sometimes some data needs to be records needs to be rewritten or deleted because of gdpr requests.
Amazon really missed the boat on Citus, a few more great acquisitions like this and Azure is going to look great (yes I know it can be still self installed on AWS).
i recently started diving into postgresql and it seems to me that there is a patchwork of HA solutions (with patroni being the most feature-full) but no real multi-master solution for postgresql released under an open source license.
There's BDR (bi-directional replication) but apparently 2ndquadrant pulled it back under a proprietary license, am i right?
what's the current status of postgresql multi-master HA?
I just yesterday set up pg_auto_failover, and it was such a breeze compared to Patroni/PAF/repmgr. Less moving parts and a much simpler, but flexible setup. They did a really good job there! Also forced me to finally implement multihost support in my Postgres.js driver.
> i recently started diving into postgresql and it seems to me that there is a patchwork of HA solutions (with patroni being the most feature-full) but no real multi-master solution for postgresql released under an open source license.
true multi master is barely needed. but there is citus which uses the gpl, which can be run in multi master.
and yes patroni is really really awesome.
It is a custom format that was originally derived from ORC, but is very different at this point. For instance, all the metadata is kept in PostgreSQL catalog tables to make changes transactional.
Memory, IO, Computation and DB connection concurrency all 4 have to be kept in the mind to achieve optimum latency across wide range of complex queries.
It has been my observation that if you follow 3NF design, table partition and carefully crafted Materialized Views works 8 out of 10 times when data size is huge.
Definitely not intending to take credit away from any other teams. There are so many good Postgres extensions and forks in this ecosystem.
What we've done is add a Columnar storage feature into the Citus open source extension to Postgres, as part of Citus 10.
One way to think of it: Citus 10 now gives Postgres users (those who are also using the Citus extension) a columnar compression option, for use on a single node and/or a distributed Citus cluster.
Thanks. Regardless of who did what, columnar stores are non-trivial engineering and anyone who produces a production-ready one is worthy of admiration.
I was the PM for the software side of Greenplum DB for a while. I also don't know of an earlier example of a separate column oriented storage option on top of a Postgres base. Vertica and ParAccel (which became Redshift) both also started from a Postgres base (though Vertica claims otherwise in marketing material) but were a little later. In any case, "Column oriented storage for Postgres" is almost 20 years old. This also isn't the first open source implementation.
[1] https://stratos.seas.harvard.edu/files/stratos/files/columns...