You should try set track_io_timing = on before running the query. That will show you the actual amount of time spent on i/o. Additionally, if you are using RAID, and haven't tuned it yet, you should try increasing the effective_io_concurrency parameter.
RE: effective_io_concurrency --- any idea how this should be tuned for SSD or PCIe flash? The manual only gives an example of how to tune for RAID (by number of drives).
My understanding is that the effective_io_concurrency represents the number of data drives that partcipate in the i/o system excluding the parity drives.
From my experience, for each SSD drive you can set it between 5-8 (if there are in raid0).
BTW in 9.6 the effective_io_concurrency [0] can be different for each table space (especially useful if you have multiple drives of a varying quality [ssd, spinning] in one server).
It's somewhat related to the number of data drives, but as you point out it's a bit more complicated in practice - it's not just the number of drives. For example due to parallelism built into SSDs (each SSD is actually a stack of smaller drives). For rotational devices it's complicated by TCQ/NCQ, optimizations that require a queue of commands for each device to actually work.
On modern SSD, values 5-8 per device are definitely on the low side. See for example this thread from 2014, on (then released S3500 SSD from Intel):
That shows the performance increases up to effective_io_concurrency=256 (of course, in reality there may be multiple queries running on the system, sharing the I/O paths).
It's quite difficult (or impossible) to get a single "right" value, as it depends both on the workload and hardware (say, whether you have SSDs attached directly or through a RAID controller).
Some months ago I played with BRIN to determine whether we could use Postgres as a time series store. I started writing a letter to the Postgres mailinglist but never got around to sending it. I'll post it here in case anyone is interested:
1. Background
I've been playing a bit with large sets of time series data on PostgreSQL 9.6 beta 1. I have two tables containing about 5,000,000 "meters" and 120,000,000 hourly "readings". The 120,000,000 "readings" represent one day's worth of data and the active set of data would be 90 days.
CREATE TABLE meter (
id integer NOT NULL,
version integer,
name character varying(255) NOT NULL,
primary key(id)
);
CREATE TABLE reading (
instant timestamp without time zone NOT NULL,
value integer NOT NULL,
meter_id integer NOT NULL,
primary key(meter_id, instant)
);
I'm aware that there are many other ways to model time series data but I have a preference for this one because it's simple, it maps seamlessly with our development tools (Hibernate ORM), and the data is easy to process with analytical queries.
(In the end we settled on a different model because this solution was not performant)
2. Findings
The table "reading" is 5068 MB and the primary key constraint creates a multicolumn BTREE index which is an additional 3610 MB. Unfortunately no other index type is available for use as primary key. It would be nice to be able to use a slower but more space efficient (224 kB) BRIN index instead. Another nice option would be an index organized table.
Due to the large per-row overhead in PostgreSQL tables you should probably investigate other ways to store the table data first if you are trying to preserve space.
In one application I achieved a lot savings by transposing the time series, and meters, to columns (arrays also have some overhead, so go for columns) and manually partitioning along one axis to avoid storing one key. Perhaps not applicable in this case, but someone might find it useful, I had a more or less fixed set of meters and times.
Ex
CREATE TABLE reading_20160917 (
meter_pos int PRIMARY KEY,
"meter1 @ 00:00" smallint not null,
"meter1 @ 01:00" smallint not null,
...
"meter2 @ 00:00" boolean not null,
"meter2 @ 01:00" boolean not null,
...
"meter3 @ 00:00" pg_catalog.char not null,
"meter3 @ 01:00" pg_catalog.char not null,
...
);
```
Some meta-programming is required to make that manageable though but PostgreSQL is quite flexible in that regard, so with the proper set of views and functions you can have a more naïve view of the data for querying.
If that approach won't do I would look into foreign data wrappers, something like CitusDB perhaps, or just go for an actual time series store.
In the end I did what you are suggesting but implemented it using the PostgreSQL ARRAY datatype. It was easy support with Hibernate and the data was usable for analytical queries. Disk usage dropped down to roughly 1 GB.
Just don't assign a primary key. Of course, keep the ID field in the table. You can also declare composite primary keys, but you probably only want to index fields that'll you'll be filtering, joining or grouping on.
Having clustered tables seems to be a prerequisite for using this kind of index. How does this work in practice?
The CLUSTER command is a one-time reordering, and I assume a seriously expensive operation. But it doesn't affect any later insertions. So would one just run a CLUSTER periodically, or are there better ways to do this?
Right, it is a prerequisite, and there are some complexities in practice.
You're right -- CLUSTER is a seriously expensive operation and locks the entire table, to boot. So we didn't really do that. Instead, we used pg_repack which is a great tool to do online clustering without the locks.
Additionally, we use pg_partman (another great tool!) to partition our data by month. Once a given month is finished, we can ensure the partition table for that month is ordered optimally and then never worry about it again.
This particular table is mostly appended to, so it's not so bad. If you were updating or deleting records, you'd have to run pg_repack more aggressively (or take similar actions) to get good performance.
A lot of data are naturally clustered for various reasons (particularly data with ids and timestamps). Also, note that BRIN indexes don't have to use ranges in the summary--for instance, for enums or other domains of small cardinality, they could use bitmaps or bloom filters. I believe that BRIN indexes will become much more useful when those alternatives are implemented.
We are using these on materialized views with an explicit order by to do that for us. Have found these most helpful for those types of scenarios (summary / reporting data where we can control refresh/population) - but obviously that is our use case and it may not be very efficient for others.
Very interesting article. We had a database class last semester which hardly mentioned anything new, making me believe that I generally knew most things there were to know about databases. I'm sure there will be people here saying this doesn't even go that deep, but this reminded me how much more is going inside a database.
It also makes me think database technology is less dead than it seems. Hot topics these days are things like redis and mongodb, and even nosql is already old news and on its decline (when looking at typical HN front page topics anyway). Yet there is still so much tech put in good old relational databases.
As others have mentioned, if your workload is mostly appends and the data is appended in the correct order, you don't need to redo the clustering. In our case, the clustering was a one-off to get the data in the right shape.
(Although if you do need to redo clustering, look into pg_repack.)
Yeah, that's what we did in the part we said "Pretend this is where we stopped" :) We pre-computed a few major dimensions and choose which table to query based on which columns are being requested. BRIN indexes were still a win vs B-tree, though.
We keep being impressed by Postgres--we keep being able to flog it harder and harder instead of moving to something heavier duty.
They actually do work like that in Postgres, and in general for secondary indices. Like, I'm positive there are some btree variants that deal with duplicate data better, at the cost of slower writes or more locking (just an example), but we're talking about normal btrees.
In his book "An Introduction to Database Systems", DB2 Architect CJ Date describes an index design that compresses indexes of similar strings. So if entries in the index block start with the same first few letters, the first entry has the complete string, and subsequent entries use the format: <count-of-matching-characters> <remaining-characters-that-are-different-than-previous>. The algorithm burns some CPU to conserve I/O.
For example:
0 let
3 ter
6 s
4 uce
For the words: let, letter, letters, lettuce.
I can't speak to whether Postgres uses that index format; I haven't checked. Just saying an index as described above is a textbook idea that's had already been around for a long time when I learned it some 30 years ago.
I also want to share, Date's book is the best book I've ever read on information processing theory. It brought several foundational ideas into my skill set.
You should try set track_io_timing = on before running the query. That will show you the actual amount of time spent on i/o. Additionally, if you are using RAID, and haven't tuned it yet, you should try increasing the effective_io_concurrency parameter.