Dynimize: Speed Up MySQL with CPU Performance Virtualization

Rafuino · on May 9, 2018

Does the presence of a HDD or SSD across the three architectures make a difference, or is this purely a CPU problem to be solved? I see the testing made sure to fit all tables in memory with a large enough innodb buffer pool, so was storage even a factor?

https://dynimize.com/blog/discussions/dynimizer-mysql-cross-...

It seems to me the only system with SSDs (Ivy Bridge) sees higher transactions per second improvement than the other systems.

https://dynimize.com/performanceSpeedup

davidyeager · on May 9, 2018

These specific tests we did were read only where the working set fits into memory, so SSD vs HDD doesn’t matter because they were CPU bound tests to highlight the performance improvements. So storage isn’t a factor here. If that wasn’t the case then faster storage would help make it more CPU bound and Dynimizer would make a bigger impact when using SSD vs HDD.

esterly · on May 9, 2018

If you have a DB bigger than RAM yes SSD makes a huge difference. I haven't considered non SSD in servers for 3 years when factoring in engineering cost https://dba.stackexchange.com/a/59903 see DHH circa 2009 https://signalvnoise.com/posts/1509-mr-moore-gets-to-punt-on...

willvarfar · on May 9, 2018

This doesn't seem to be JITing actual queries (a la https://www.pgcon.org/2017/schedule/events/1092.en.html)

It seems to be JITting the generic binary code.

Would these same gains be seen from compiling MySQL for the actual target architecture gentoo-style e.g. -farch=native ?

viraptor · on May 9, 2018

It looks like it's taking measurements from live process, so probably not arch=native. But most likely the runtime profile guided optimisation would do a similar job.

leesalminen · on May 8, 2018

Nice website, and very cool looking product!

Has this been run in production anywhere? Were there any instances of corrupted writes? Inaccurate reads?

What’s your plan for monetization down the road?

davidyeager · on May 9, 2018

Yes it’s definitely being used in production. We’re starting to collect production use cases and will provide some on our website soon. Here’s an example website with growing traffic using MariaDB + Dynimizer with Wordpress, and they found Dynimizer very helpful: https://www.cgmagonline.com

In terms of innacurate reads or corrupted writes, that would be a bug if it ever happens. That would not be part of normal operation and would not be expected. That said, all software including MySQL, gcc, and Linux are full of bugs and Dynimizer is not immune to that of course. However it has been stress tested thoroughly with MySQL, MariaDB, and Percona Server up to MySQL 5.7, MariaDB 10.2.

leesalminen · on May 9, 2018

Understood, and thank you for the reply. I look forward to seeing some use cases published. Definitely bookmarking the site now.

Best of luck with this. Exciting stuff!

desdiv · on May 9, 2018

Very, very cool. Section 7 of the manual[0] gives some hints on how this black magic works:

> 7. Workload Requirements

> To obtain benefit from the current version of Dynimizer, all of the following workload conditions must be met:

> A small number of CPU intensive processes - On a given OS host where the workload is running, the workload must be comprised of one or a few CPU intensive processes. Optimizing a large number of processes at once is not recommended.

> Long running programs - The processes being optimized have long lifetimes, and their workloads are long running in order to amortize the warmup time associated with optimization.

> x86-64 - Optimized processes must be 64-bit, derived from x86-64 executables and shared libraries, which must comply with the x86-64 ABI and ELF-64 formats. Most statically compiled applications on Linux meet this requirement.

> Dynamically Linked - Target processes must be dynamically linked to its shared libraries. Statically linked processes are not yet supported. Most Linux programs are dynamically linked.

> No self modifying code - The target application must not be running its own Just-In-Time compiler such as those found in Java virtual machines. This therefore excludes Java Applications.

> Front-end CPU stalls - The workload wastes a lot of time in CPU instruction cache misses, instruction TLB misses, and to a lesser extent branch mispredictions.

> User mode execution - Much of that wasted time is spent in user mode execution (as opposed to kernel mode), as Dynimizer only optimizes user mode machine code.

> Because of these requirements, Dynimizer takes a whitelist approach when determining if programs are allowed to be optimized, with MySQL and its variants being the currently supported optimization targets on that list for this early beta release. Other programs are not currently supported, and while they can be used with Dynimizer, they should be very thoroughly tested by the user or system administrator before being deployed in a production environment.

> Future versions of Dynimizer may eliminate many of these workload requirements, broadening the variety of applicable scenarios as well as further increasing the performance delivered in previously beneficial cases.

[0] https://dynimize.com/manual

jnwatson · on May 9, 2018

The real important bits: "Front-end CPU stalls - The workload wastes a lot of time in CPU instruction cache misses, instruction TLB misses, and to a lesser extent branch mispredictions".

My educated guess is that it relocates the hot path of the text segment to better pack into the instruction cache. Cool.

davidyeager · on May 9, 2018

Sure does.

bryanh · on May 9, 2018

Interesting slide on Twitter showing some more detail: https://twitter.com/tuxology/status/988886194355298304

turbohz · on May 9, 2018

I wonder if similar techniques can be applied to PC games. Specially for older ones, considering they use less threads and certain CPU features where not available at the time.

jrk · on May 8, 2018

Cool, but a bit grandiose and historically arrogant:

"[T]he industry's first CPU performance virtualization software"

"A New Frontier For JIT Compilers… JIT compilers use as input a virtual machine code format… Dynimizer [uses] real machine code as input instead"

Just 20 years after Mojo, Dynamo, DynamoRIO, etc:

http://program-transformation.org/Transform/BinaryOptimisers http://www.dynamorio.org https://www.complang.tuwien.ac.at/andi/bala.pdf http://cseweb.ucsd.edu/~lerner/mojo.ps …

davidyeager · on May 9, 2018

Of course these projects were a major source of inspiration for Dynimizer. However they are not JIT compilers. They are more like virtual machines or binary translators. Today DynamoRIO and Mojo (which ended up as Intel PIN) are used for program introspection and analysis, not for application acceleration.

jnwatson · on May 9, 2018

"Dynamic binary translation" is the term of art. Which of course VMWare and VirtualPC were doing 20 years ago in dynamically translating x86 ring-0 code to ring-3 code.

Dynamizer is translating x86-64 to faster x86-64, but the concept is the same.

DynamoRIO was actually talked about for application acceleration. There was at least a PoC that did dynamic function inlining.

lucio · on May 9, 2018

Their product seems to be simpler to install and directed to specific software and workload. It's really amazing to get 10% more TPS by just running a background process.

Maybe they should present themselves as "fire and forget TPS optimizer"

Twirrim · on May 9, 2018

It's the first they'd ever heard of, probably.

jtanworth · on May 9, 2018

or the Rosetta tech that would dynamically recompile and optimise on OS X for the PPC->Intel transition.

ShroudedNight · on May 8, 2018

This is very cool stuff.

I would be very interested to hear a sampling of the war stories that came out of building this. I had a friend working on the IBM zPDT JIT at one point, and while I unfortunately can't remember many of the details at the moment, I remember boggling (in that sort of emergently satisfying way) at some of the 'oh shit' moments that came up.

davidyeager · on May 9, 2018

Lots of ‘oh shit’ moments. Lots.

davidyeager · on May 9, 2018

Here are the slides from Percona Live 2018:

https://www.percona.com/live/18/sites/default/files/slides/A...

SafPlusPlus · on May 9, 2018

Mentioned as the installation method in those slides:

  sudo bash c 'bash <(wget O  https://dynimize.com/install) default'

Come on, please don't teach people horrendous security practices... :(

davidyeager · on May 9, 2018

Fair enough. This is valuable feedback. We have provided a more secure method on our home page and will provide package managed downloads as well, however the reality is that nothing is 100% secure. In the meantime, you can just download the script and inspect it (it’s pretty simple) and then do it all manually if you prefer.

stephenr · on May 9, 2018

It’s depressing how commonplace `curl|(ba)sh` has become.

This will sound clichéd but I blame the rise of “poor mans devops” whereby management fires all the ops, and lets developers manage infrastructure.

da_chicken · on May 9, 2018

I agree, it may be cliche, but I think the exact same thing whenever I see this kind of practice, too. Or that the developer that has never had to manage a live system with users that know his phone number and his boss's phone number.

"Oh, this is just for a test mock up. Nobody is supposed to actually use this to install it for real."

Well, to experienced people it makes you look moderately stupid, and to inexperienced people it looks like an elegant solution. It's actively hostile to secure system planning.

It reminds me of the NPM left-pad debacle[0] and some of the criticism[1] that came up from that.

0: https://www.theregister.co.uk/2016/03/23/npm_left_pad_chaos/

1: http://www.haneycodes.net/npm-left-pad-have-we-forgotten-how...

stephenr · on May 9, 2018

I’ve given up waiting for nodejs to become a reliable environment. Just recently the `is-even` package came to light and highlighted that things aren’t getting any better than when leftpad was a thing.

I can’t wait to see tc39’s response to the `is-even` shit show after they decided to just add leftpad to the stdlib.

da_chicken · on May 9, 2018

Wow, I hadn't heard about the is-odd/is-even/is-number thing. That's hilarious and awful.

Reminds me of: https://github.com/jezen/is-thirteen

stephenr · on May 9, 2018

The “best” part of it all is that apparently js engines have an internal optimisation for `foo % 2 === 0`, because it’s such a common thing.

This clown was using a bit wise operation in `is-even` “because everyone already knows about % 2 === 0`, and thus was hurting performance (on top of whatever extra memory is used for the module, function call overhead etc)

ddtaylor · on May 9, 2018

Just pasting commands alone into a terminal is pretty insecure now too. Their are proof-of-concepts that show some control characters and other invisible characters will make it to the clipboard and even someone pasting into a text editor won't see them.

stephenr · on May 9, 2018

Between that and delivering different responses to curl|sh vs a browser or regular curl [1] you’d think this kind of bullshittery would be abandoned, but no.

1: https://www.idontplaydarts.com/2016/04/detecting-curl-pipe-b...

martin_ · on May 9, 2018

This looks like seriously impressive technology, and yields impressive results, but: I don't think I'd be comfortable with the idea of something rewriting my database in production. What I -would- probably be OK with is having dynamize analyze my workload in a staging/load test environment, and then producing a new binary for me which I could then run through its paces.

Beyond actual errors being produced, I'm wondering what'd happen in weird scenarios such as one where by primary gets heavily optimized for its write load and creeps up to, say, 80% CPU or so at peak.. What then happens then if my replica which has been heavily optimized for its read load gets promoted in a failure scenario and gets pegged at high CPU?

Final thought here is if this tech really is solid, when is AWS going to start shipping it with my VM?

davidyeager · on May 9, 2018

This does not rewrite your database. It optimizes the live in-memory machine code of the mysqld (MySQL Server) process. It must run on the same OS host as the mysqld (MySQL Server) process being optimized. So if you are using this on the master and not on the replica, the replica won’t be touched. Hope that makes sense.

grogers · on May 9, 2018

I think the point was that profile guided optimization relies on the workload staying relatively fixed. If the workload suddenly changes (like promoting a readonly slave to be the writable master) the assumptions made by PGO may not be valid, and performance could be worse than if no modifications were made in the first place.

I think you'd just have to measure scenarios before using it in production.

davidyeager · on May 9, 2018

Dynimizer can react to drastic changes in workload and reoptimize depending on how you configure it.

martin_ · on May 9, 2018

grogers explained my intent well, especially around the promotion of a replica to a primary work load (thank you). I think we're arguing semantics between optimizing machine code and rewriting mysqld. I want to emphasize that Dynamizer looks super awesome, and I do intend to try it, so no offense was intended. Well done!

davidyeager · on May 9, 2018

Thanks! Sorry for terse tone... just responding quickly from a smartphone while travelling :-) Your concerns are definitely valid and we are working on improving the situation.

Dynimizer can detect a drastic workload change as you described and reoptimize in response. That’s the default operation which can be turned off. What has been observed is that if we optimized for say a write-heavy workload and then change it to read-only without reoptimizing (or vice versa), it will still show an improvement, just not as much. Hope that makes sense.

davidyeager · on May 9, 2018

For a future update we will be optionally caching optimizations to forgo the optimizations/warmup period. You can then move those cached optimizations from test to production to reduce uncertainty. Coming soon. Many advantages there over PGO which is very difficult to use.

indexerror · on May 9, 2018

Very cool product.

From the /product page:

> ...It profiles applications using the Linux perf_events subsystem and interfaces with a target application's machine code through the Linux ptrace system call. When optimizing a program, it loads a code cache into the target program's address space...

ggambetta · on May 9, 2018

@davidyeager: In the legend of the graph in Dynimizer System Overhead in https://dynimize.com/product, both series are labelled as "Without Dynimizer".

Couple of questions. Since this seems to be a very general technology, why the emphasis on MySQL (and DBs in general)? Marketing? Also, I found Dynimize vs Dynimizer confusing - is that company name vs product name?

davidyeager · on May 9, 2018

Yes we will correct the legend in that graph, thanks for reporting it.

Dynimize is the company, Dynimizer the product. We may ditch the name Dynimizer and just go with Dynimize to avoid confusion. Thoughts?

It is a general purpose approach to optimization and MySQL is just a starting point. It was chosen first because it has a broad user base and is relatively easier to support compared to many other Linux programs: single process architecture, long process lifetimes, OLTP workloads are known to spend much of their time in front-end CPU stalls on the CPU side which are effectively targeted with profile guided compiler optimizations, and it’s statically compiled. We’ve tried it on MongoDB and seen similar benefit but not supported yet. Coming soon. Windows will probably require some driver development for effective sample based profiling and will happen later on. We will improve the effectiveness of our other optimizations that don’t target front-end CPU stalls and better support multiprocess workloads with short process lifetimes, which will allow us to target many other types of programs in the future.

Hope that clarifies things a bit.

psandersen · on May 9, 2018

This looks really interesting!

Wonder if it could be integrated into the whole OS/kernel, and if it can help with typical ML workloads like running a randomforest in scikit learn.

lucio · on May 9, 2018

Very simple-to-use product. There's almost no friction and you get 10% extra TPS. They'll make a lot of money.

blantonl · on May 9, 2018

How are they making money?

mmerlin · on May 10, 2018

Once Oracle buys them

z3t4 · on May 9, 2018

I'm very skeptical running a script from a random web page that promises to make my program faster.

etaioinshrdlu · on May 9, 2018

Indeed, it is literal insanity to run this on anything production that you care about.

jnwatson · on May 9, 2018

It isn't any different than running your production in a container, a VM, or the cloud, all of which can significantly affect what's actually going on.

JeanMarcS · on May 9, 2018

Does it makes a difference on VM / containers / VPS ? Or does-it works only on server CPUs ?

davidyeager · on May 9, 2018

Works with VMs or on a VPS. We have done a lot of testing on KVM, Xen, and a bit on VMware. Still need to do a bit of work to properly support containers.

chatmasta · on May 9, 2018

I wonder, do the authors have a reverse engineering background? It seems like a lot of concepts from reverse engineering were applied to the JIT compiler, which I find incredibly cool.

jacob019 · on May 9, 2018

The main infographic shows an impressive improvement in TPS, but what about single query execution time? I often see the CPU maxed out by complex queries against large tables.

lincolnq · on May 9, 2018

So neat. I'd love to read a paper about how this works in more detail. Does it exist?

jlgaddis · on May 9, 2018

What I saw of this looked pretty neat... until the web page crashed Safari (on iPad).

spacemanmatt · on May 9, 2018

Heyo, workloads that fit in RAM are not that interesting. I hope they have identified their market ahead of time, for their sake, because I don't think it's valuable.

davidyeager · on May 9, 2018

That was used to highlight the maximum improvement expected. When the working set doesn’t fit into RAM then you will get some combination of a smaller amount of performance improvement plus a reduction in CPU usage. The faster the storage, the greater the increase in tps that you’ll see. Note that replication is often CPU bound. We will be applying this to non-database workloads as well in the near future.