You should try set track_io_timing = on before running the query. That will show you the actual amount of time spent on i/o. Additionally, if you are using RAID, and haven't tuned it yet, you should try increasing the effective_io_concurrency parameter.
RE: effective_io_concurrency --- any idea how this should be tuned for SSD or PCIe flash? The manual only gives an example of how to tune for RAID (by number of drives).
My understanding is that the effective_io_concurrency represents the number of data drives that partcipate in the i/o system excluding the parity drives.
From my experience, for each SSD drive you can set it between 5-8 (if there are in raid0).
BTW in 9.6 the effective_io_concurrency [0] can be different for each table space (especially useful if you have multiple drives of a varying quality [ssd, spinning] in one server).
It's somewhat related to the number of data drives, but as you point out it's a bit more complicated in practice - it's not just the number of drives. For example due to parallelism built into SSDs (each SSD is actually a stack of smaller drives). For rotational devices it's complicated by TCQ/NCQ, optimizations that require a queue of commands for each device to actually work.
On modern SSD, values 5-8 per device are definitely on the low side. See for example this thread from 2014, on (then released S3500 SSD from Intel):
That shows the performance increases up to effective_io_concurrency=256 (of course, in reality there may be multiple queries running on the system, sharing the I/O paths).
It's quite difficult (or impossible) to get a single "right" value, as it depends both on the workload and hardware (say, whether you have SSDs attached directly or through a RAID controller).
You should try set track_io_timing = on before running the query. That will show you the actual amount of time spent on i/o. Additionally, if you are using RAID, and haven't tuned it yet, you should try increasing the effective_io_concurrency parameter.