Hacker News new | past | comments | ask | show | jobs | submit login

Phill, parsing has little to do with what goes on at query execution time. Parsing is typically done once for a given parameterized query (and MemSQL automatically parameterizes queries with constants). After parsing comes compilation which produces a query plan. The query plan is cached and that is the thing that is actually used to evaluate the query when it's re-submitted again. We are discussing the differences in how MemSQL handles executing compiled queries. Executing compiled queries is what a (well tuned) database spends most of its time doing. It is the hot code path. And if you cut the instruction count in half you can expect latency to decrease proportionally and throughput to increase.

You are making all these skeptical claims that are technically wrong. I suggest you download MemSQL, compile a query and read the generated C++ code. Then download MySQL source and look at that too. At that point you will be coming to your own conclusions and I'll be happy to resume this debate.




I've actually written a compiler from database queries to native code (JIT'ed bytecode actually, but doesn't make a difference here) once.

Some queries in databases are indeed CPU bound, but the general statement that well-optimized databases turn to be CPU bound is not correct like that. It all depends on the use case.

If you're thinking full table scans with complicated processing on a database that's entirely in memory, then yes, you'll be CPU bound. But many (most?) databases are not entirely in memory, and there's no reason they need to be. With reasonable memory and indexing, most queries boil down to only a few IOs/disk seeks, so you're looking at <50ms, which is entirely fine for your average web request.

That's the case for the majority of "get by ID" style queries, which in my experience really are the lions share of queries, and in particular are also what needs to be fast. A complicated reporting query can take a couple of seconds, but your "main page/item view" must be fast.

If you have random reads on a large database that cannot be kept in memory, this will be the majority of your workload. Those queries will spend all their time waiting for the disk, and more or less none interpreting your query tree, so optimizing the query tree processing part makes no sense at all.

TL;DR, in my experience optimizing query tree interpretation does not yield significant benefits for the majority of queries, and IMHO won't be a significant differentiator for MemSQL outside of marketing.


@Nitramp: Fortunately for all of us this world is full of unsolved problems many of which exist outside of consumer web.

There are use cases where 50ms is way too long, and where jitter associated with disk I/O is not acceptable. Capital markets for example. You are looking for sub-millisecond latency to be in the intraday post-trade game, and single digit milliseconds for quote-to-trade. The number of instructions in the hot code path (and cycles per instruction) actually starts to matter. Lock-free sturcutres utilized for storage are also important. This combination gives MemSQL some rather unique low-latency properties that certain folks can exploit to their delight.

To your other point, random reads and key-value lookups is where MemSQL shines. Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.


Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.

I'm not sure where you are getting this information. According to the MemSQL documentation, they have one and only one isolation level, READ COMMITTED. But READ COMMITTED takes write locks and blocks reads, but doesn't take any read locks. Blocking behaviour still (and should!) still occur.

It sounds like you are looking for dirty read behaviour, but from the literature MemSQL doesn't support READ UNCOMMITTED.

Update: Read a later FAQ on the MemSQL website. They are using MVCC, so write locks are irrelevant. My apologies to the MemSQL team.


Chris, MemSQL utilizes versioning to implement READ COMMITTED. In this implementation readers are never blocked.

It is addressed in this FAQ: http://developers.memsql.com/docs/1b/faq.html#c1-q4.


OK, didn't see that you are using multiversion concurrency control.

I didn't realise you had a new FAQ up on the website - I'll have a look.


@Criss Isolation levels are used to specify semantics only. DBMS is free to implement them in different ways. Take a look at this VLDB 2012 paper about optimistic concurrent control in main memory databases, like MemSQL, which also describes implementation of isolation levels: http://arxiv.org/pdf/1201.0228v1.pdf


Ah. Sorry, didn't realise that you were using MVCC. That puts a different light on things.


Get by id queries are majority when the database is used as a key-value store, and the only reason to such an underuse of RDBMSes is precisely because under heavy load some of them have hard time doing proper queries, with joins, subqueries and all the stuff databases are designed for.


It's not "underusing" an RDBMs, those are just the lions share of queries in many applications. Also note that a join say from some parent item by ID to children who have a parent_id makes essentially two lookups by ID in two indices, so the same reasoning applies; this is not just SELECT * FROM bla WHERE id = :x.


I don't know how to tell you this, but query plans parsing is never the bottleneck in any database. You're just plain wrong.

Compiling queries to C++ is great for some BS bullet point on a website with BS claims of performance gains, but it's only a niche feature at the end of the day. There are more important things that should be getting tackled first: multi-node distribution, better fsync handling, etc.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: