Hacker News new | past | comments | ask | show | jobs | submit login

Nice article. I love databases too for similar reasons but, as someone that designs database engines, some of the technical points are off the mark. I never really stop learning in this area, the technical range is incredibly deep and nuanced.

Some of the points that caught my eye as being quite off:

- Contrary to footnote 2, modern database designs bypass the OS file system cache and schedule their own I/O. This has an enormous performance impact versus the OS cache (2-3x is pretty typical) and is a good litmus test for the technical sophistication of the database implementation. It is the primary reason many open source database engines, even "write-oriented" ones, have relatively poor write performance to disk.

- The three data retrieval models enumerated are textbook but if you were designing a new engine today you probably would not use any of them for a general purpose design. Modern spatial access methods (ex: Hyperdex, SpaceCurve, Laminar) are superior in almost every way though the literature is much sparser. Also, a number of real-time analytical databases use bitmap-structured databases (ex: ParStream), which are incredibly fast for some types of query workloads.

- Many distributed database challenges are a side effect of "one server, one shard" type models. It is not necessary to do things this way, it is just simpler to implement; some distributed database systems have thousands of shards per server. The latter model is operationally more robust and better behaved under failure and load skew.

- Tombstones are usually trivial if the database engine is properly designed. Complications are a side effect of poor architecture. The big challenge for tombstones is deciding when and how tombstoned records are garbage collected. It is outside the scope of the normal execution pathways but you also don't want a Java-like GC thread in the background.

Of course, any of these is a long blog post in itself. :-)




Oracle talks directly to NFS servers bypassing the OS file access layer altogether these days, just using a socket, reading and writing from a huge chunk of memory it's grabbed off the OS on startup, and now manages itself.


Doesn't Postgres rely on OS caching? Would you say it's missing out on large performance gains based on that?


I believe that active data is stored twice (once in the Postgres buffer pool, and once in the FS cache). This is not ideal, because it is not making optimal use of RAM. To minimize this effect, PG recommends a relatively small buffer pool, which is not great if you believe the DB can do a better job than a generic OS.

I think this is quite fixable as well (I think I've even fixed it myself once) - just use O_DIRECT. Any PG maintainers able to tell me if PG still "double-buffers" and whether this patch would be useful if I could recreate it?


It can be stored twice, but I don't think that's the ordinary case. Pages that are hot in PG's buffer cache are likely to stay there, making the same page in the OS buffer cache cold (because there aren't many requests for it). That's not always true, because writes to hot pages will end up going through the OS buffer cache maybe a couple times per checkpoint cycle, but it's still not (on average) stored twice. I don't have empirical numbers here, unfortunately, so someone else can correct me and fill in details.

At least from the discussions I've seen, there isn't a lot of interest in using O_DIRECT or otherwise taking on the I/O scheduling problem into postgres. It's not particularly exciting to me, because

* Takes on the I/O scheduling problem, rather than the simpler method now of just handing pages to the kernel for lazy writing, and fsync'ing at checkpoint time.

* Requires a lot of new code, tuning, configuration, etc. with a high maintenance cost.

* Not portable, so it's easy to make a tweak that helps on kernel and HW configuration X but hurts on kernel and HW configuration Y.

* Not very strategic: it helps some workloads with a lot of real I/O by some small constant factor; which doesn't necessarily open up new use cases or market opportunities

In my opinion, it's much better to focus on innovative features, or more low-hanging performance gains (which exist in postgres), or scale-out features. All of those will be slowed down if the code becomes bulkier (making correct patches harder to write, especially for new developers) and the maintainers become distracted by I/O scheduling issues.

It seems more like something to do when innovation slows enough, otherwise it doesn't seem worth it.


Fair enough. Thank you for the very thorough answer - and for saving me the effort of putting together the patch!


Postgresql also uses Shared Buffer Cache which seems to be what he's talking about ?


by bitmap-structured you mean bitmap indexes?

any opensource db that schedule their own io? I guess postgresql, innodb?

what is your opinion on tokudb fractal-trees ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: