Phill, parsing has little to do with what goes on at query execution time. Parsi...

Nitramp · on June 26, 2012

I've actually written a compiler from database queries to native code (JIT'ed bytecode actually, but doesn't make a difference here) once.

Some queries in databases are indeed CPU bound, but the general statement that well-optimized databases turn to be CPU bound is not correct like that. It all depends on the use case.

If you're thinking full table scans with complicated processing on a database that's entirely in memory, then yes, you'll be CPU bound. But many (most?) databases are not entirely in memory, and there's no reason they need to be. With reasonable memory and indexing, most queries boil down to only a few IOs/disk seeks, so you're looking at <50ms, which is entirely fine for your average web request.

That's the case for the majority of "get by ID" style queries, which in my experience really are the lions share of queries, and in particular are also what needs to be fast. A complicated reporting query can take a couple of seconds, but your "main page/item view" must be fast.

If you have random reads on a large database that cannot be kept in memory, this will be the majority of your workload. Those queries will spend all their time waiting for the disk, and more or less none interpreting your query tree, so optimizing the query tree processing part makes no sense at all.

TL;DR, in my experience optimizing query tree interpretation does not yield significant benefits for the majority of queries, and IMHO won't be a significant differentiator for MemSQL outside of marketing.

vasilizo · on June 27, 2012

@Nitramp: Fortunately for all of us this world is full of unsolved problems many of which exist outside of consumer web.

There are use cases where 50ms is way too long, and where jitter associated with disk I/O is not acceptable. Capital markets for example. You are looking for sub-millisecond latency to be in the intraday post-trade game, and single digit milliseconds for quote-to-trade. The number of instructions in the hot code path (and cycles per instruction) actually starts to matter. Lock-free sturcutres utilized for storage are also important. This combination gives MemSQL some rather unique low-latency properties that certain folks can exploit to their delight.

To your other point, random reads and key-value lookups is where MemSQL shines. Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.

chris_wot · on June 27, 2012

Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.

I'm not sure where you are getting this information. According to the MemSQL documentation, they have one and only one isolation level, READ COMMITTED. But READ COMMITTED takes write locks and blocks reads, but doesn't take any read locks. Blocking behaviour still (and should!) still occur.

It sounds like you are looking for dirty read behaviour, but from the literature MemSQL doesn't support READ UNCOMMITTED.

Update: Read a later FAQ on the MemSQL website. They are using MVCC, so write locks are irrelevant. My apologies to the MemSQL team.

vasilizo · on June 27, 2012

Chris, MemSQL utilizes versioning to implement READ COMMITTED. In this implementation readers are never blocked.

It is addressed in this FAQ: http://developers.memsql.com/docs/1b/faq.html#c1-q4.

chris_wot · on June 27, 2012

OK, didn't see that you are using multiversion concurrency control.

I didn't realise you had a new FAQ up on the website - I'll have a look.

tintor · on June 27, 2012

@Criss Isolation levels are used to specify semantics only. DBMS is free to implement them in different ways. Take a look at this VLDB 2012 paper about optimistic concurrent control in main memory databases, like MemSQL, which also describes implementation of isolation levels: http://arxiv.org/pdf/1201.0228v1.pdf

chris_wot · on June 27, 2012

Ah. Sorry, didn't realise that you were using MVCC. That puts a different light on things.

gbog · on June 27, 2012

Get by id queries are majority when the database is used as a key-value store, and the only reason to such an underuse of RDBMSes is precisely because under heavy load some of them have hard time doing proper queries, with joins, subqueries and all the stuff databases are designed for.

Nitramp · on June 27, 2012

It's not "underusing" an RDBMs, those are just the lions share of queries in many applications. Also note that a join say from some parent item by ID to children who have a parent_id makes essentially two lookups by ID in two indices, so the same reasoning applies; this is not just SELECT * FROM bla WHERE id = :x.

onetwothreefour · on June 27, 2012

I don't know how to tell you this, but query plans parsing is never the bottleneck in any database. You're just plain wrong.

Compiling queries to C++ is great for some BS bullet point on a website with BS claims of performance gains, but it's only a niche feature at the end of the day. There are more important things that should be getting tackled first: multi-node distribution, better fsync handling, etc.