Tim, if you want to talk substance, you have to peer deeper under the covers. And if you do that, then you just can't overlook the fact that the query execution model used by MemSQL is significantly different than that used by old school relational databases, including MySQL. And what's different is that MemSQL translates you SQL query into extremely efficient C++ code. Code that is compiled and executed natively. Whereas MySQL, SQL Server, Postgress, Oracle - all of these products evaluate queries by interpreting their respective tree representations of your SQL queries. So which database do you expect to be faster? The one that runs native code or ones that interpret?
This is a huge differentiator we are talking about here. For someone who has been around databases for quite some time (of which I had spent about 5 years working on SQL Server engine) this is very exciting. Is it not?
>And what's different is that MemSQL translates you SQL query into extremely efficient C++ code. Code that is compiled and executed natively. Whereas MySQL, SQL Server, Postgress, Oracle - all of these products evaluate queries by interpreting their respective tree representations of your SQL queries.
This sounds like absurd cargo culting. I've never designed a database but parsing the SQL can not have ever been the bottleneck; the fact that you rarely write to disk is where any speed improvements can be found.
You are asserting that databases are always I/O bound and never CPU bound. Your assertion is wrong. A properly tuned database will become CPU bound (whether it's MemSQL or MySQL). At that point hyper-efficient execution results in higher throughput and lower latency.
> At that point hyper-efficient execution results in higher throughput and lower latency.
I don't get excited for 1% decreases in latency. I'm willing to bet money that the performance penalty behind parsing the sql queries is asymptotically a constant - or at least, a tiny fraction of the time spent computing the data set.
I feel especially confident in this because memsql never brags about the specific increase in speed. This can't be hard to measure.
Phill, parsing has little to do with what goes on at query execution time. Parsing is typically done once for a given parameterized query (and MemSQL automatically parameterizes queries with constants). After parsing comes compilation which produces a query plan. The query plan is cached and that is the thing that is actually used to evaluate the query when it's re-submitted again. We are discussing the differences in how MemSQL handles executing compiled queries. Executing compiled queries is what a (well tuned) database spends most of its time doing. It is the hot code path. And if you cut the instruction count in half you can expect latency to decrease proportionally and throughput to increase.
You are making all these skeptical claims that are technically wrong. I suggest you download MemSQL, compile a query and read the generated C++ code. Then download MySQL source and look at that too. At that point you will be coming to your own conclusions and I'll be happy to resume this debate.
I've actually written a compiler from database queries to native code (JIT'ed bytecode actually, but doesn't make a difference here) once.
Some queries in databases are indeed CPU bound, but the general statement that well-optimized databases turn to be CPU bound is not correct like that. It all depends on the use case.
If you're thinking full table scans with complicated processing on a database that's entirely in memory, then yes, you'll be CPU bound. But many (most?) databases are not entirely in memory, and there's no reason they need to be. With reasonable memory and indexing, most queries boil down to only a few IOs/disk seeks, so you're looking at <50ms, which is entirely fine for your average web request.
That's the case for the majority of "get by ID" style queries, which in my experience really are the lions share of queries, and in particular are also what needs to be fast. A complicated reporting query can take a couple of seconds, but your "main page/item view" must be fast.
If you have random reads on a large database that cannot be kept in memory, this will be the majority of your workload. Those queries will spend all their time waiting for the disk, and more or less none interpreting your query tree, so optimizing the query tree processing part makes no sense at all.
TL;DR, in my experience optimizing query tree interpretation does not yield significant benefits for the majority of queries, and IMHO won't be a significant differentiator for MemSQL outside of marketing.
@Nitramp: Fortunately for all of us this world is full of unsolved problems many of which exist outside of consumer web.
There are use cases where 50ms is way too long, and where jitter associated with disk I/O is not acceptable. Capital markets for example. You are looking for sub-millisecond latency to be in the intraday post-trade game, and single digit milliseconds for quote-to-trade. The number of instructions in the hot code path (and cycles per instruction) actually starts to matter. Lock-free sturcutres utilized for storage are also important. This combination gives MemSQL some rather unique low-latency properties that certain folks can exploit to their delight.
To your other point, random reads and key-value lookups is where MemSQL shines. Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.
Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.
I'm not sure where you are getting this information. According to the MemSQL documentation, they have one and only one isolation level, READ COMMITTED. But READ COMMITTED takes write locks and blocks reads, but doesn't take any read locks. Blocking behaviour still (and should!) still occur.
It sounds like you are looking for dirty read behaviour, but from the literature MemSQL doesn't support READ UNCOMMITTED.
Update: Read a later FAQ on the MemSQL website. They are using MVCC, so write locks are irrelevant. My apologies to the MemSQL team.
@Criss
Isolation levels are used to specify semantics only. DBMS is free to implement them in different ways.
Take a look at this VLDB 2012 paper about optimistic concurrent control in main memory databases, like MemSQL, which also describes implementation of isolation levels:
http://arxiv.org/pdf/1201.0228v1.pdf
Get by id queries are majority when the database is used as a key-value store, and the only reason to such an underuse of RDBMSes is precisely because under heavy load some of them have hard time doing proper queries, with joins, subqueries and all the stuff databases are designed for.
It's not "underusing" an RDBMs, those are just the lions share of queries in many applications. Also note that a join say from some parent item by ID to children who have a parent_id makes essentially two lookups by ID in two indices, so the same reasoning applies; this is not just SELECT * FROM bla WHERE id = :x.
I don't know how to tell you this, but query plans parsing is never the bottleneck in any database. You're just plain wrong.
Compiling queries to C++ is great for some BS bullet point on a website with BS claims of performance gains, but it's only a niche feature at the end of the day. There are more important things that should be getting tackled first: multi-node distribution, better fsync handling, etc.
It is not about the cost of parsing SQL text. Think of it as intermediate bytecode for SQL that is interpreted as query is being executed. Every SQL database does some variation of it.
What is unique about MemSQL is that this bytecode is actually compiled to native code using GCC.
I agree with what you have to say, but I don't see how you can determine that this query:
SELECT * FROM table WHERE id > 5 LIMIT 10;
is actually of the same form as this query:
SELECT * FROM table WHERE id > 10 LIMIT 5;
without actually parsing the two. I mean you will incur a parsing overhead either way. What I think MemSQL is trying to optimize out is the actual execution of the query. I mean rather than interpreting and running it with checks performed in every iteration, they are compiling the code. I don't know how much of a difference this would make without actually seeing the numbers.
That's why prepared statements are preferable because those two statements have to be parsed as unique queries. They are actually the same query with different parameters.
As far as I know that's the same on just about all databases.
At least some databases I've used have parsed out constants in the client library code and submitted prepared statements to the server, passing those constants in. That meant that the server got better use out of its query cache; on the server side, those are the same query.
Not sure which ones if any do this today, the optimisation might no longer be relevant - I spotted DBs doing this 10 years ago when CGIs that didn't bother with prepared statements were rife.
Not every parsing is equal. MySQL and other databases produce a tree as they parse a query. If you just want to match parameters you can avoid memory allocations and just replace parameters with ^ or @ signed in place in the query string.
Actually, it can have a significant impact but it's a corner case. It's a moot point though - as an example, SQL Server stores plans in a memory cache. [1]
edit: incidentally, one of the corner cases I can think of is joining 10 tables, which gives you 10! combinations for the query processor to work through. This is irrelevant in MemSQL, because you can only join to another two tables, and you cannot do right outer for full outer joins.
The original DB2 for VSE and VM (not DB2 for z-series/390) which was the productized version of System R did exactly this - compiled to native code. Subsequent DB2 implementations chose not to go down this path - even for single OS systems like DB2/390 and DB2/400.
In any event, I'm skeptical if this is going to make very much of a difference for the following reasons:
1. The time spent by the runtime/interpreter in "evaluating" the operators or expressions is really small relative to the heavy lifting of the actual relational operators to say nothing of I/O etc.
2. Most serious implementations don't really interpret the plan using a giant switch statement. For instance, with PostgreSQL, the interpreter puts in function pointers on first iteration. Thereafter it's only slightly worse than compiled native code.
Re 2.: you can get significant improvements when compiling to native code. You still have to dereference those function pointers, which is essentially the same as a big switch statement (if not worse, your function pointers might be anywhere in the heap, not in your cache line).
When you compile to native code, you can essentially save the dynamic dispatch on query parts, plus all the other benefits you get from compilation.
E.g. if you do good type analysis, you can also compile to much more efficient data representations, which will again have benefits for memory access. The compiler can also inline code, do loop lifting, and all those goodies.
But overall I strongly agree with 1), any time spent doing execution is usually dwarfed by IO.
"And what's different is that MemSQL translates you SQL query into extremely efficient C++ code."
What does this mean? Does the SQL get converted into actual C++ source code, then compiled with an internal C++ compiler? That seems like a weird thing to do. Or is it translated directly into an AST? If yes, I can't imagine other db's not doing it? I don't understand the claim being made here.
Yes, they compile SQL statements into linux binaries using a bundled GNU compiler toolchain.
I am also curious if there is some way the developers have demonstrated the benefits of this approach vs. what other systems do. Most of the CPU time in a simple query in VoltDB is spent in networking, validating and suffering the cache-miss pain of walking index data structures. I can't see clearly how compiling to native code makes any of that much faster.
In-memory storage layer is very efficient. There are no buffer pool pages to manage or cache misses that result in I/O. That puts an accent on what goes on between storage engine and network, that is on query execution. Tight loops in native code is a good way to cut down on CPU cycles required to evaluate a query.
Actually, no. You speak like database engines have to parse a query every single time it's run. That's not true - any reasonable database engine has a plan cache. Just because something is compiled into object code, doesn't make it that impressive.
In fact, making a plan completely stable can cause problems if the data distribution changes dramatically. I asked about this problem in a previous article about MemSQL, and the answer was that the plan won't expire unless you drop an index or add a new one. [1]
This is a huge differentiator we are talking about here. For someone who has been around databases for quite some time (of which I had spent about 5 years working on SQL Server engine) this is very exciting. Is it not?