MariaDB 10.1 can do 1M queries per second

bratao · on Oct 19, 2015

MariaDB recently released the version 10.1 as GA. https://blog.mariadb.org/mariadb-10-1-is-stable-ga/

We been using it for a large dataset, and has been fantastic. Compared to MySQL 5.7 and PostgreSQL, it have the advantage of supporting the TokuDB engine out of box. My data uncompressed is 3TB, with it , we can fit in 300GB with all indexes. Read Free Replication with TokuDB (https://github.com/percona/tokudb-engine/wiki/Read-Free-Repl...) also enable us to have a very cheap VPS as slave.

_Codemonkeyism · on Oct 20, 2015

On the Postgres list

"Sadly, fractal trees are an invention of Tokutek and are heavily and publically patented."

allan_s · on Oct 19, 2015

About TokuDB and PostgreSQL

How does one ,by tunning a vanilla PostgreSQL (on a per instance or per database or per table basis) can get the same kind of advantages (and tradeoff) than in MySQL/MariaDB by switching from InnoDB to TokuDB ?

ozgune · on Oct 20, 2015

You could write a foreign data wrapper for TokuDB. For some examples: https://wiki.postgresql.org/wiki/Foreign_data_wrappers

You can also mix-and-match. You could use PostgreSQL's regular storage engine for some tables, and TokuDB for some others.

leif · on Oct 20, 2015

TokuDB is built for a single-process model with threads for connections. This is incompatible with PostgreSQL's multi-process model, and patching TokuDB to support such a model would be a large effort. Not impossible, but a lot of work.

mathnode · on Oct 19, 2015

Which aspect in particular do you wish to emulate from TokuDB in PostgreSQL?

ak4g · on Oct 20, 2015

That compression ratio is one I'd love to have. PG's built-in data compression is limited to large values in a single row - if you have a hundred billion 100-byte tuples, TOAST isn't going to help.

The closest thing I know of to achieve something comparable with PG would be to use a data directory mounted on something with transparent filesystem-level compression (meaning, practically speaking, ZFS). This gives you the less-than-ideal choice of a not-mainline-linux filesystem (for your database's data directory, which is worth being nervous about) or running an OpenSolaris descendant, which is a big departure for plenty of people who have only ever run production dbs on Linux servers.

dmpk2k · on Oct 20, 2015

On a tangent: I think it's rather sad that many nerds aren't prepared to try a different UNIX, when we're talking the lifeblood of their business: data.

ZFS is the only filesystem it's reasonable to trust critical data on right now (I can't think of any other OSS self-validating Merkle trees that have been hammered in production for nearly a decade...), yet somehow some minor differences in Unicies trumps that.

I'm going to make a bold claim I know will stir the nest, but I feel confident in making given all the bad shit most filesystems miss: if you're not running your OSS RDBMS on ZFS right now, and you don't have compelling specialized needs to explain why not, you shouldn't be let near a production DB due to plain negligence and/or incompetence.

kbenson · on Oct 20, 2015

It depends on how you slice it. If your expertise in this system is less than your others, it could cause real problems down the line. If a mistake was made in the setup, maintenance or recovery of the system because it's different, it would be just as easy to say it was negligent to use it instead of something that was well understood by your team. Trade-offs like this are made all the time, to claim negligence is to ignore the complexities of the situation.

ak4g · on Oct 20, 2015

This. I'm more than willing to try different UNIXes, and I have. But when we're talking about running a business-critical database, it's a hard requirement that both myself and my teammates are able to maintain, keep patched, and generally competently administer the system. Trading an OS that my team has decades of combined experience running in production with for one where we collectively have zero production experience is a non-starter.

AlisdairO · on Oct 20, 2015

A good DBMS includes a bunch of integrity-preserving measures to deal with data corruption. No reason not to run your DBMS on ZFS as far as I'm aware (edit: at least if the workload does not trigger excessive fragmentation), but I'd reserve the word 'negligence' for, say, not using an appropriate backup solution.

jzwinck · on Oct 20, 2015

Something funny about this: where are the OSS RDBMSs which do not rely on a general purpose filesystem? Because in the commercial RDBMS space that is a thing.

kbenson · on Oct 20, 2015

My impression, wrong or not, was always that those were features from when a good volume manager could not be counted on in the host system.

brobinson · on Oct 20, 2015

I've used PostgreSQL with BTRFS + LZO compression on Linux. After I set it up, our large queries (2 minutes or > runtime) saw a speedup of 5-10x and our TPS went up by around 25%.

BTRFS has been in the stable kernel for over two years, so it might be worth checking out!

agopaul · on Oct 20, 2015

Mechanical or flash storage? Compression could give advantage on mechanical disks due to the reduced time spent on seeking, but I'm curious to know how they perform on SSD for which seek time is reduced to a minimum if not 0.

brobinson · on Oct 20, 2015

It was 4 SSDs in a RAID-0 configuration.

swasheck · on Oct 19, 2015

could you explain why the tokudb engine is an advantage?

bratao · on Oct 19, 2015

TokuDB is a very modern engine that uses a data-structure called "Fractal Trees". Imagine something like a B-Tree with buffers in the non-leaf nodes. It also heavily compress the row data.

The main point of TokuDB is not writing everything to the disk thanks to the buffers, and with the very efficient compression, more data fits in the memory and saves the space for big-data applications. ( please someone correct me if I´m wrong)

In our tests TokuDB was vital for our startup. With it, we can use cheap dedicated servers and our performance is amazing. We tried PostgreSQL, MongoDB 3.0 and MySQL 5.7 and they can´t fit our data in a 2TB disk or were slow in our tests.

tomcam · on Oct 19, 2015

Not writing everything to disk ASAP makes me... queasy...

colanderman · on Oct 20, 2015

The GP is slightly incorrect. The buffers are on disk; the benefit comes because the buffers are laid out in transaction order, so they can be written sequentially under any workload; whereas a standard B-tree is laid out in sorted order, so can only be written sequentially under sequential workloads. The benefit is similar to that of log-structured databases.

ddorian43 · on Oct 19, 2015

The log is always written, what's not written asap is changes on the tree (changes go into the buffers first and then trickle down when buffers are full).

leif · on Oct 20, 2015

This is the critical insight. TokuDB uses a write-ahead log which is synced according to the configuration, and can be made as immediate as full fsync on commit. This provides the strongest durability available on a single machine.

Where TokuDB gets its speed boost is by delaying the random reads associated with updating the indexing structure (the Fractal Tree). The buffers are written to disk on checkpoint, but because they're buffers, the potentially random writes are localized to a smaller number of nodes high in the tree, which minimizes the number of disk seeks required. Since sequential I/O is cheaper than random, the sequential writes to the write-ahead log are very fast, so even in very strict durability configurations, TokuDB can easily outperform databases which use random writes to update the indexing structures, such as the B-trees used by InnoDB and most other RDBMSes.

More details here: https://www.percona.com/blog/2011/09/22/write-optimization-m...

mfisher87 · on Oct 19, 2015

I don't have a good understanding of TokuDB, but I'm not sure the buffers are necessarily in memory. The problem the buffers solve is the cost of rebuilding the index when you insert or delete data -- buffers allow you to delay the rebuild so as to batch many updates together. Even if the buffers were in memory, you'd still have the old (inconsistent) index on disk (presumably the index is marked inconsistent until the writes in the buffers are applied). You just need to rebuild it to match the new data.

ucho · on Oct 20, 2015

Isn't delaying index updates until more operations accumulate a standard for all DB for 20 or more years? Without it write operation would be painfully slow.

rch · on Oct 19, 2015

It's fine for people who use RDBMSs for data processing, like loading daily log data into tables and running one or more massive queries to yield averages, etc. etc.

Not ideal, but simple and valid enough.

nickpsecurity · on Oct 19, 2015

It's fine for regular loads, too, so long as they do regular saves to disk and have a high availability setup. Plenty of setups out there using in-memory, checkpointed DB's and RAM drives on reliable hardware without serious problems.

acomjean · on Oct 19, 2015

Tokutek and some staff where common vititors to the boston mysql meetup. If I remember they were using "fractal trees" instead of b-trees. Trading cpu cycles for disk io. It was interesting tech.

http://highscalability.com/blog/2014/8/6/tokutek-white-paper...

Quote:"The idea behind FT(fractal Trees) indexes is to maintain a B tree in which each internal node of the tree contains a buffer. When a data record is inserted into the tree, instead of traversing the entire tree the way a B tree would, we simply insert the Eventually the root buffer will fill up with new data records. At that point the FT index copies the inserted records down a level of the tree. Eventually the newly inserted records will reach the leaves, at which point they are simply stored in a leaf node as a B tree would store them. The data records descending through the buffers of the tree can be thought of as messages that say “insert this record”. FT indexes can use other kinds of messages, such as messages that delete a record, or messages that update a record."

some other links:

http://web.archive.org/web/20150414215556/http://www.tokutek...

http://assets.en.oreilly.com/1/event/36/How%20TokuDB%20Fract...

bcoates · on Oct 19, 2015

Are there any comparisons between fractal trees and the database cracking/adaptive indices advocated by [1]? It seems like they're going after the same use pattern but cracking is relatively simpler and even faster for inserts.

[1] http://db.csail.mit.edu/pubs/abadi-column-stores.pdf

leif · on Oct 20, 2015

Cracking is kind of similar in that it delays the "sorting work" done in the indexing structure. However, cracking is fairly heuristic and therefore hard to analyze without an intimate understanding of the workload. Fractal Trees do pretty much the same thing under all workloads (modulo optimizations for things like sequential and Zipfian workloads), so they're easier to analyze and prove bounds for.

An interesting new development is http://getkudu.io/ which applies some ideas common with Fractal Trees and other write-optimized data structures (like LSM-trees) to column stores.

At Tokutek, we had some designs for how to implement column store-like data models on Fractal Trees (we called them "matrix stores" because they had some of the benefits of both row stores and column stores), but we didn't get around to implementing them.

acomjean · on Oct 19, 2015

Not that I know of. Interesting though. I have to look into column stores more though.

mathnode · on Oct 19, 2015

My favourite, TokuDB has online DDL. Which is a welcome change to the MySQL ecosystem.

I am pretty sure the only other engines that support online DDL is Sybase and DB2.

Online DDL generally means lock free DDL operations.

Not even Oracle, MSSQL, or Postgresql do online DDL. But at least they have transactional DDL, even if it isn't entirely lock-free all the time.

Compared to InnoDB/XtraDB; TokuDB lacks foreign key support, because this was worked into InnoDB years ago because the MySQL server layer doesn't support any kind of constraints. I am hoping for MariaDB to start implementing more modern SQL features which would this be available in supported storage engines.

coolgeek · on Oct 20, 2015

https://dev.mysql.com/doc/refman/5.6/en/innodb-online-ddl.ht...

leif · on Oct 20, 2015

That's not really online, it still rebuilds the entire table, it just does it in the background and shows the results once it's done. This is pretty similar to how TokuDB hot indexing works, but TokuDB's hot column add/rename/delete is truly online, the results are visible immediately and the actual disk work is done in a lazy fashion.

sjwright · on Oct 19, 2015

For our workload (discussion forum, custom engine) the move to TokuDB was so effective that we ended up completely eliminating a number of front-end caching stages, because rendering the page from scratch every time ended up being very nearly as fast as pulling from the cache.

(The hot path was a single well-optimised query that retrieved all content for the page. The query didn't need to change.)

mooreds · on Oct 19, 2015

Wow, that's a great datapoint.

sjwright · on Oct 19, 2015

To be more precise, the cache was still faster than building on the fly, but given the quantity of content we were dealing with, the cache was a resource hog. Were it not for TokuDB our next move probably would have been to deploy a dedicated cache server with huge amounts of RAM.

As it stands now, this content is now being handled by a single (and quite unremarkable) MariaDB slave and assembled on-the-fly like any other web page.

For anyone dealing with gigabytes of text content, I highly recommend giving it a go. You don't even need to screw with your master, just change the table type on the slave.

Jgrubb · on Oct 20, 2015

Hey if you're still paying attention to this thread I'd love to pick your brain about this topic. contact in my profile if you don't mind. Thanks.

nekitamo · on Oct 19, 2015

It has very high insert speeds, and amazing compression (up to 500% in my tests). It's very powerful if you're inserting terabytes of data every day on commodity hardware.

cies · on Oct 19, 2015

I found this: https://en.wikipedia.org/wiki/TokuDB

Fractal tree indexes[1]... Dude!

1: https://en.wikipedia.org/wiki/Fractal_tree_index

imakesnowflakes · on Oct 19, 2015

Can I now refer to the same temporary table multiple times in a query?

sown · on Oct 19, 2015

Do you use a lot of joins, writes along with the queries? I'm curious as to what might happen with other profiles.

suneilp · on Oct 19, 2015

This is sorta misleading. A quote from the article, "The changes in the MariaDB source code leading to those impressive performance improvements are part of porting MariaDB to Power8."

Pretty frustrating. I was pretty excited since MariaDB is a drop in replacement for MySQL (more or less). But you have to use it on a totally different architecture which can't be justified without a lot of deliberation.

spudlyo · on Oct 19, 2015

From my reading of the article, a lot of the performance improvements come getting rid of the mutex in front of their hash based metadata cache by using a lock free hashing algo. These changes originally came from Oracle's MySQL 5.7 work[1]. Looks like MariaDB merged the lock-free hashing stuff into their version, and applied it to the table definition hash which also suffered from lock contention.

These are changes that all architectures can benefit from.

[1]: http://mysqlserverteam.com/removing-scalability-bottlenecks-...

glogla · on Oct 19, 2015

It's also 2 socket, 20 core, 8 thread per core = 160 hardware thread system. And it's not general queries, but just "point selects" from 20 tables of 1 million rows combined.

That's far from typical hardware to deploy maria/mysql on, and far from realistic workload.

suneilp · on Oct 19, 2015

Yeah, I want to drool but utilizing that many threads in "practical" scenarios seems difficult. It would probably be great for messaging systems though!

glogla · on Oct 19, 2015

Not just that, but database performance drops quickly with writes.

1M transactions per second with 100 % reads might as well mean 10K transactions per second with 95 % readsm 5 % writes ("the usual OLTP workload" per common wisdom) and 1K transactions per second on 50 % reads 50 % writes. Or even less. Or much more. It's difficult to even quess.

arielweisberg · on Oct 19, 2015

What stood out to me was that the measurement was taken against 20 tables.

That makes me wonder if a smaller number of tables would have hit contention even in a read only workload.

Well it's not the only database that has this problem. I wonder if partitioning is enough or if you really need separate tables.

hyperpape · on Oct 19, 2015

"The data set is 1 million rows in 20 tables. Fewer tables can be used, but below 4 tables the performance drops somewhat due to a hot spot in the table definition cache."

morgo · on Oct 19, 2015

Similar numbers (1.6M) can be achieved with Intel and MySQL 5.7: http://www.infoworld.com/article/2994381/database/mysql-57-d...

suneilp · on Oct 19, 2015

Thanks for the article. That was pretty interesting. Besides impressive performance, I like their support for JSON.

bpicolo · on Oct 20, 2015

JSON is a cool option, though I have seen it abused too many times to represent relational data worse : (

maxdemarzi · on Oct 20, 2015

Another 1M q/s nonsense benchmark to add to the list : http://maxdemarzi.com/2015/10/16/benchmarks-and-supercharger...

claudiug · on Oct 19, 2015

I'm curious, does anybody know what are the advantages over PG?

Pg, have a lot of tractions this days, I use it for quite a while, but I never use mariaDB. So, I will be quite curious about features wise :)

jand · on Oct 19, 2015

Ok, a big selling point of current versions is:

Since version 10.1, MariaDB has merged the "classic" server with the galera server. Galera promises easy-to-do scaling (see [1],[2] for details) which is nice to have.

Please read the linked info and decide for yourself if this feature sounds tempting enough to you.

[1] https://blog.mariadb.org/mariadb-10-1-1-galera-support/

[2] https://aphyr.com/posts/327-call-me-maybe-mariadb-galera-clu...

ddorian43 · on Oct 19, 2015

Isn't scaling usually in the way of sharding? While in this case galera it's just replication (master-master sycn), and you can only scale read access.

sleepyhead · on Oct 20, 2015

> does anybody know what are the advantages

it's MySQL but not from Oracle.

> over PG

if you are already running MySQL then you don't have to migrate to a new completely new platform.

no1youknowz · on Oct 19, 2015

I see that fractal tree indexing was brought up back in 2003 for PG. http://www.postgresql.org/message-id/511B5F12.7060500@vmware...

But I don't see any patches. Would love to see something like this developed.

Also isn't there cstore_fdw for PG for compression? Although I know there are limitaitons with this, which has prevented me from using it. :(

laurencerowe · on Oct 19, 2015

Further down that thread:

> Fractal tree indexes are patented. It is distributed as commercial extension to MySQL. So we can't include it into PostgreSQL core.

http://www.postgresql.org/message-id/CAPpHfdtV46zQjp88HR5U3h...

deciplex · on Oct 19, 2015

Here's the patent:

http://www.google.com/patents/US8185551

Regarding using the tech in OSS, there is an interesting discussion here: https://www.reddit.com/r/linux/comments/1hn7rg/whats_next_af... (see the comment from bradleykuszmaul a little ways down).

jdc0589 · on Oct 19, 2015

wow... thats pretty depressing

rch · on Oct 19, 2015

It's been a while - patents are designed to expire, and in this area they sometimes actually do.

hyperpape · on Oct 19, 2015

Sometimes? I wasn't aware there were exceptions.

leif · on Oct 20, 2015

The TokuDB implementation of Fractal Trees uses a single-process, multi-threaded model, which is incompatible with PostgreSQL's multi-process model. In theory, one could patch TokuDB to be suitable, but this is a ton of work.

esmet · on Oct 19, 2015

It's unrelated work. This is the "fractal tree" in question, open source: https://github.com/Tokutek/ft-index

ddorian43 · on Oct 19, 2015

Column store separates each column into separate file, while this stores the row in the index and does block compression of rows (most high level).

Column stores are only good for big reads and big inserts.

mtanski · on Oct 19, 2015

sequential (clustered) reads and primary append writes

_delirium · on Oct 20, 2015

Fwiw, the thread you linked is from 2013, not 2003. I don't believe the technique existed in 2003 (or if it did, it wasn't under this name).

programminggeek · on Oct 19, 2015

How does this compare to the standard MySQL performance?

morgo · on Oct 19, 2015

Comparable number is 1.6m on Intel hardware http://www.mysql.com/why-mysql/benchmarks/

exabrial · on Oct 20, 2015

Is this 1M qps with no foreign keys and no transactions? There's quite a few open source databases that can do that.

pbowyer · on Oct 20, 2015

Definitely no foreign keys according to another post in this thread. Which is not something I want to go back to, however much I'd like this level of compression.

uberneo · on Oct 20, 2015

no parallel processing / shrading yet?

on Oct 19, 2015

[deleted]

arthursilva · on Oct 19, 2015

Actually, yes.