Tim, if you want to talk substance, you have to peer deeper under the covers. An...

phillmv · on June 26, 2012

>And what's different is that MemSQL translates you SQL query into extremely efficient C++ code. Code that is compiled and executed natively. Whereas MySQL, SQL Server, Postgress, Oracle - all of these products evaluate queries by interpreting their respective tree representations of your SQL queries.

This sounds like absurd cargo culting. I've never designed a database but parsing the SQL can not have ever been the bottleneck; the fact that you rarely write to disk is where any speed improvements can be found.

tigerBL00D · on June 26, 2012

You are asserting that databases are always I/O bound and never CPU bound. Your assertion is wrong. A properly tuned database will become CPU bound (whether it's MemSQL or MySQL). At that point hyper-efficient execution results in higher throughput and lower latency.

phillmv · on June 26, 2012

> At that point hyper-efficient execution results in higher throughput and lower latency.

I don't get excited for 1% decreases in latency. I'm willing to bet money that the performance penalty behind parsing the sql queries is asymptotically a constant - or at least, a tiny fraction of the time spent computing the data set.

I feel especially confident in this because memsql never brags about the specific increase in speed. This can't be hard to measure.

tigerBL00D · on June 26, 2012

Phill, parsing has little to do with what goes on at query execution time. Parsing is typically done once for a given parameterized query (and MemSQL automatically parameterizes queries with constants). After parsing comes compilation which produces a query plan. The query plan is cached and that is the thing that is actually used to evaluate the query when it's re-submitted again. We are discussing the differences in how MemSQL handles executing compiled queries. Executing compiled queries is what a (well tuned) database spends most of its time doing. It is the hot code path. And if you cut the instruction count in half you can expect latency to decrease proportionally and throughput to increase.

You are making all these skeptical claims that are technically wrong. I suggest you download MemSQL, compile a query and read the generated C++ code. Then download MySQL source and look at that too. At that point you will be coming to your own conclusions and I'll be happy to resume this debate.

Nitramp · on June 26, 2012

I've actually written a compiler from database queries to native code (JIT'ed bytecode actually, but doesn't make a difference here) once.

Some queries in databases are indeed CPU bound, but the general statement that well-optimized databases turn to be CPU bound is not correct like that. It all depends on the use case.

If you're thinking full table scans with complicated processing on a database that's entirely in memory, then yes, you'll be CPU bound. But many (most?) databases are not entirely in memory, and there's no reason they need to be. With reasonable memory and indexing, most queries boil down to only a few IOs/disk seeks, so you're looking at <50ms, which is entirely fine for your average web request.

That's the case for the majority of "get by ID" style queries, which in my experience really are the lions share of queries, and in particular are also what needs to be fast. A complicated reporting query can take a couple of seconds, but your "main page/item view" must be fast.

If you have random reads on a large database that cannot be kept in memory, this will be the majority of your workload. Those queries will spend all their time waiting for the disk, and more or less none interpreting your query tree, so optimizing the query tree processing part makes no sense at all.

TL;DR, in my experience optimizing query tree interpretation does not yield significant benefits for the majority of queries, and IMHO won't be a significant differentiator for MemSQL outside of marketing.

vasilizo · on June 27, 2012

@Nitramp: Fortunately for all of us this world is full of unsolved problems many of which exist outside of consumer web.

There are use cases where 50ms is way too long, and where jitter associated with disk I/O is not acceptable. Capital markets for example. You are looking for sub-millisecond latency to be in the intraday post-trade game, and single digit milliseconds for quote-to-trade. The number of instructions in the hot code path (and cycles per instruction) actually starts to matter. Lock-free sturcutres utilized for storage are also important. This combination gives MemSQL some rather unique low-latency properties that certain folks can exploit to their delight.

To your other point, random reads and key-value lookups is where MemSQL shines. Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.

chris_wot · on June 27, 2012

Since you never wait on I/O and don't take locks, optimizing execution makes a lot of sense actually. All that's left is execution and network.

I'm not sure where you are getting this information. According to the MemSQL documentation, they have one and only one isolation level, READ COMMITTED. But READ COMMITTED takes write locks and blocks reads, but doesn't take any read locks. Blocking behaviour still (and should!) still occur.

It sounds like you are looking for dirty read behaviour, but from the literature MemSQL doesn't support READ UNCOMMITTED.

Update: Read a later FAQ on the MemSQL website. They are using MVCC, so write locks are irrelevant. My apologies to the MemSQL team.

vasilizo · on June 27, 2012

Chris, MemSQL utilizes versioning to implement READ COMMITTED. In this implementation readers are never blocked.

It is addressed in this FAQ: http://developers.memsql.com/docs/1b/faq.html#c1-q4.

chris_wot · on June 27, 2012

OK, didn't see that you are using multiversion concurrency control.

I didn't realise you had a new FAQ up on the website - I'll have a look.

tintor · on June 27, 2012

@Criss Isolation levels are used to specify semantics only. DBMS is free to implement them in different ways. Take a look at this VLDB 2012 paper about optimistic concurrent control in main memory databases, like MemSQL, which also describes implementation of isolation levels: http://arxiv.org/pdf/1201.0228v1.pdf

chris_wot · on June 27, 2012

Ah. Sorry, didn't realise that you were using MVCC. That puts a different light on things.

gbog · on June 27, 2012

Get by id queries are majority when the database is used as a key-value store, and the only reason to such an underuse of RDBMSes is precisely because under heavy load some of them have hard time doing proper queries, with joins, subqueries and all the stuff databases are designed for.

Nitramp · on June 27, 2012

It's not "underusing" an RDBMs, those are just the lions share of queries in many applications. Also note that a join say from some parent item by ID to children who have a parent_id makes essentially two lookups by ID in two indices, so the same reasoning applies; this is not just SELECT * FROM bla WHERE id = :x.

onetwothreefour · on June 27, 2012

I don't know how to tell you this, but query plans parsing is never the bottleneck in any database. You're just plain wrong.

Compiling queries to C++ is great for some BS bullet point on a website with BS claims of performance gains, but it's only a niche feature at the end of the day. There are more important things that should be getting tackled first: multi-node distribution, better fsync handling, etc.

tintor · on June 26, 2012

It is not about the cost of parsing SQL text. Think of it as intermediate bytecode for SQL that is interpreted as query is being executed. Every SQL database does some variation of it. What is unique about MemSQL is that this bytecode is actually compiled to native code using GCC.

dhruvbird · on June 26, 2012

I agree with what you have to say, but I don't see how you can determine that this query:

SELECT * FROM table WHERE id > 5 LIMIT 10;

is actually of the same form as this query:

SELECT * FROM table WHERE id > 10 LIMIT 5;

without actually parsing the two. I mean you will incur a parsing overhead either way. What I think MemSQL is trying to optimize out is the actual execution of the query. I mean rather than interpreting and running it with checks performed in every iteration, they are compiling the code. I don't know how much of a difference this would make without actually seeing the numbers.

jakejake · on June 26, 2012

That's why prepared statements are preferable because those two statements have to be parsed as unique queries. They are actually the same query with different parameters.

As far as I know that's the same on just about all databases.

bazzargh · on June 26, 2012

At least some databases I've used have parsed out constants in the client library code and submitted prepared statements to the server, passing those constants in. That meant that the server got better use out of its query cache; on the server side, those are the same query.

Not sure which ones if any do this today, the optimisation might no longer be relevant - I spotted DBs doing this 10 years ago when CGIs that didn't bother with prepared statements were rife.

dhruvbird · on June 26, 2012

@nikita I see: http://developers.memsql.com/docs/1b/faq.html#does-this-mean...

Do you mean you determine placeholders without actually parsing the SQL?

@jakejake The docs. seem to suggest that the benefits of prepared statements are won without using prepared statements in MemSQL.

nikita · on June 26, 2012

Not every parsing is equal. MySQL and other databases produce a tree as they parse a query. If you just want to match parameters you can avoid memory allocations and just replace parameters with ^ or @ signed in place in the query string.

http://developers.memsql.com/docs/1b/faq.html gives some more details.

chris_wot · on June 27, 2012

Actually, it can have a significant impact but it's a corner case. It's a moot point though - as an example, SQL Server stores plans in a memory cache. [1]

1. http://www.sqlteam.com/article/what-query-plans-are-in-sql-s...

edit: incidentally, one of the corner cases I can think of is joining 10 tables, which gives you 10! combinations for the query processor to work through. This is irrelevant in MemSQL, because you can only join to another two tables, and you cannot do right outer for full outer joins.

chris_wot · on June 28, 2012

Interested in knowing why this has been voted down! A response would have been more useful. Would love to know what is inaccurate about my comment.

saileshkrish · on June 26, 2012

And so the wheel turns.

The original DB2 for VSE and VM (not DB2 for z-series/390) which was the productized version of System R did exactly this - compiled to native code. Subsequent DB2 implementations chose not to go down this path - even for single OS systems like DB2/390 and DB2/400.

In any event, I'm skeptical if this is going to make very much of a difference for the following reasons:

1. The time spent by the runtime/interpreter in "evaluating" the operators or expressions is really small relative to the heavy lifting of the actual relational operators to say nothing of I/O etc.

2. Most serious implementations don't really interpret the plan using a giant switch statement. For instance, with PostgreSQL, the interpreter puts in function pointers on first iteration. Thereafter it's only slightly worse than compiled native code.

SK

Nitramp · on June 26, 2012

Re 2.: you can get significant improvements when compiling to native code. You still have to dereference those function pointers, which is essentially the same as a big switch statement (if not worse, your function pointers might be anywhere in the heap, not in your cache line).

When you compile to native code, you can essentially save the dynamic dispatch on query parts, plus all the other benefits you get from compilation.

E.g. if you do good type analysis, you can also compile to much more efficient data representations, which will again have benefits for memory access. The compiler can also inline code, do loop lifting, and all those goodies.

But overall I strongly agree with 1), any time spent doing execution is usually dwarfed by IO.

roel_v · on June 26, 2012

"And what's different is that MemSQL translates you SQL query into extremely efficient C++ code."

What does this mean? Does the SQL get converted into actual C++ source code, then compiled with an internal C++ compiler? That seems like a weird thing to do. Or is it translated directly into an AST? If yes, I can't imagine other db's not doing it? I don't understand the claim being made here.

jhugg · on June 26, 2012

Yes, they compile SQL statements into linux binaries using a bundled GNU compiler toolchain.

I am also curious if there is some way the developers have demonstrated the benefits of this approach vs. what other systems do. Most of the CPU time in a simple query in VoltDB is spent in networking, validating and suffering the cache-miss pain of walking index data structures. I can't see clearly how compiling to native code makes any of that much faster.

tigerBL00D · on June 26, 2012

In-memory storage layer is very efficient. There are no buffer pool pages to manage or cache misses that result in I/O. That puts an accent on what goes on between storage engine and network, that is on query execution. Tight loops in native code is a good way to cut down on CPU cycles required to evaluate a query.

chris_wot · on June 27, 2012

Actually, no. You speak like database engines have to parse a query every single time it's run. That's not true - any reasonable database engine has a plan cache. Just because something is compiled into object code, doesn't make it that impressive.

In fact, making a plan completely stable can cause problems if the data distribution changes dramatically. I asked about this problem in a previous article about MemSQL, and the answer was that the plan won't expire unless you drop an index or add a new one. [1]

1. https://news.ycombinator.com/item?id=4126330