Nice article. I love databases too for similar reasons but, as someone that desi...

gaius · on Nov 9, 2014

Oracle talks directly to NFS servers bypassing the OS file access layer altogether these days, just using a socket, reading and writing from a huge chunk of memory it's grabbed off the OS on startup, and now manages itself.

BadassFractal · on Nov 9, 2014

Doesn't Postgres rely on OS caching? Would you say it's missing out on large performance gains based on that?

justinsb · on Nov 9, 2014

I believe that active data is stored twice (once in the Postgres buffer pool, and once in the FS cache). This is not ideal, because it is not making optimal use of RAM. To minimize this effect, PG recommends a relatively small buffer pool, which is not great if you believe the DB can do a better job than a generic OS.

I think this is quite fixable as well (I think I've even fixed it myself once) - just use O_DIRECT. Any PG maintainers able to tell me if PG still "double-buffers" and whether this patch would be useful if I could recreate it?

jeffdavis · on Nov 9, 2014

It can be stored twice, but I don't think that's the ordinary case. Pages that are hot in PG's buffer cache are likely to stay there, making the same page in the OS buffer cache cold (because there aren't many requests for it). That's not always true, because writes to hot pages will end up going through the OS buffer cache maybe a couple times per checkpoint cycle, but it's still not (on average) stored twice. I don't have empirical numbers here, unfortunately, so someone else can correct me and fill in details.

At least from the discussions I've seen, there isn't a lot of interest in using O_DIRECT or otherwise taking on the I/O scheduling problem into postgres. It's not particularly exciting to me, because

* Takes on the I/O scheduling problem, rather than the simpler method now of just handing pages to the kernel for lazy writing, and fsync'ing at checkpoint time.

* Requires a lot of new code, tuning, configuration, etc. with a high maintenance cost.

* Not portable, so it's easy to make a tweak that helps on kernel and HW configuration X but hurts on kernel and HW configuration Y.

* Not very strategic: it helps some workloads with a lot of real I/O by some small constant factor; which doesn't necessarily open up new use cases or market opportunities

In my opinion, it's much better to focus on innovative features, or more low-hanging performance gains (which exist in postgres), or scale-out features. All of those will be slowed down if the code becomes bulkier (making correct patches harder to write, especially for new developers) and the maintainers become distracted by I/O scheduling issues.

It seems more like something to do when innovation slows enough, otherwise it doesn't seem worth it.

justinsb · on Nov 10, 2014

Fair enough. Thank you for the very thorough answer - and for saving me the effort of putting together the patch!

ddorian43 · on Nov 9, 2014

Postgresql also uses Shared Buffer Cache which seems to be what he's talking about ?

ddorian43 · on Nov 9, 2014

by bitmap-structured you mean bitmap indexes?

any opensource db that schedule their own io? I guess postgresql, innodb?

what is your opinion on tokudb fractal-trees ?