A recent example illustrates the power of this approach:
Cloudflare's WAF (web application firewall) basically generates
Lua code for the (highly non-linear) maze of firewall rules. An
incoming attack triggers certain rules and the corresponding paths
are turned into linearized traces. These can be heavily optimized
by LuaJIT, much more so than you could ever hope to do with a
static compiler.
When a different attack wave is started, it'll spawn different
traces -- i.e. the generated code dynamically adapts to the
specific attacks.
The result is a much higher firewall throughput than you might be
able to achieve with a data-driven or a static compilation
approach. This directly equates into money saved on machines
needed to handle the load.
Added to that is the fact that the WAF configuration (which is automatically generated Lua code) is different for each customer and so we rely heavily on caching and JITing to get the performance.
The bottom line is that LuaJIT is very fast and very flexible.
Wow I'd need some evidence to believe JIT can do a better job than a 'static' compiler. What can JIT do to optimize 'much more so'? How about a little more? If you know some common conditional jump stats you can do slightly better. Is there anything else?
A static compiler has to generate code that handles every possible path through the code. This tends to pessimize code downstream of a control flow merge point, because such code has to be compiled assuming it can be reached via multiple paths. A trace compiler can compile only the paths that are actually taken on a given workload, and compile the (dynamically) uncommon cases into exits to the interpreter.
The firewall rule example Mike Pall gives is a pretty good one. You have code that has many different paths through it, but only specific ones are triggered, dynamically, by a particular attack. Those can be compiled into traces that assume the other paths are not taken (except to guard the possible control flow splits with trace exits). This can allow an optimizer to make much more aggressive assumptions on the actually-taken paths.
"Contrary to intuition, we demonstrate that it is possible to use a piece of software to improve the performance of a native, statically optimized program binary, while it is executing. Dynamo not only speeds up real application programs, its performance improvement is often quite significant."
This is a common misinterpretation of the Dynamo paper: they compiled their C code at the lowest optimization level and then ran the (suboptimal) machine code through Dynamo. So there was actually something left to optimize.
Think about it this way: a 20% difference isn't unrealistic if you compare -O1 vs. -O3.
But it's completely unrealistic to expect a 20% improvement if you'd try this with the machine code generated by a modern C compiler at the highest optimization level.
Claiming that JIT compilers outperform static compilers, solely based on this paper, is an untenable position.
But, yes, JIT compilers can outperform static compilers under specific circumstances. This has more to do with e.g. better profiling feedback or extra specialization opportunities. But this is not what this paper demonstrates.
Many compiler optimizations have strong non-linear costs in terms of the number of control flow edges. A static compiler has to punt at a certain complexity. OTOH a JIT compiler is free to ignore many edges, since it may fall back to an interpreter for cold paths or attach new code anytime later if some edges become hot.
One interesting example is auto-vectorization (SIMDization) where static compilers have to generate code for all possible combinations of vector alignments in case the underlying alignment of the participating vectors is not statically known. This quickly gets very expensive in terms of code space. OTOH a JIT compiler can simply specialize to the observed vector alignment(s) at runtime, which show almost no variation in practice.
Thank you for your comment. Very interesting points.
Isn't it wrong to say "compiled their C code at the lowest optimization level" though? As I read the paper, they compiled their C code at -O2. It's hard to work backwards in time to know exactly what that meant when they published their paper, but I suspect it meant something very similar to the current gcc definition:
-O2 Perform nearly all supported optimizations that do not involve a space-speed tradeoff.
Dynamic binary translation can sometimes get performance improvements. This is because you would still statically compile your program, and then at runtime, the DBT system dynamically decodes and re-encodes the binary instructions. With a DBT system, you can do tracing, by identifying hot paths and such and lining those basic blocks up in your code cache.
you have a very large number of options so getting code locality between the ones that are applicable is important, at a guess. performance in large state machines is very much locality dependent but this way you get to do this dynamically not statically.
Interesting to see JIT compilation used for low level and what I assume high perf workload. NetBSD is supporting in-kernel lua code, I wonder if dynamic compilation can produce expressive and flexible performant execution too in drivers.
ps: I wonder if it's related, maybe cloudflare uses netbsd and pushed for kernel embedding, or they just benefited from netbsd lua love.
We are entirely Linux not NetBSD and are not currently using Lua in the kernel at all. We did, however, sponsor some of Mike Pall's work on LuaJIT based on our particular workload: http://luajit.org/sponsors.html#sponsorship_perf
We use Lua for the WAF (as Mike says), but also for all request processing. This is in part because we use Nginx and in part because employ agentzh (http://agentzh.org/) and he works on OpenResty. Fairly recently we moved all processing of requests through CloudFlare to a Lua-based system inside Nginx.
That's >5B pages views per 24 hours going through LuaJIT-ed code.
I'll add a more serious question. This is a nice optimization, but wouldn't clever attackers respond in turn by eventually throwing highly varied traffic patterns to fluster the JIT?
Hundreds of servers in 23 locations worldwide, each server running multiple instances of Nginx: http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-... The key fact is that this type of scale is possible with Nginx + Lua + hundreds (not thousands) of servers.
If that were to happen (and I think it's a big "if" because of the complexity of predicting what will throw it off given that the attacker doesn't know the code we are running) then we'd see the WAF latency increase and alarms would be generated immediately. That, in turn, would cause a bunch of other mechanisms to come into play.
Oh I get that this is the key fact, I put a smiley in there because I was fishing for a more exact number that is unlikely to be given :)
Thanks for your answer. Its a very neat setup. (My intuition is that its not a case of if but when, but you've probably won yourself quite a bit of time until that happens)
Just as a side comment, optimisations to the attack paths are unlikely to make a substantial impact on overall performance, given that attack requests are only a fraction of all requests processed at any given time. (Excluding DoS, of course, but that's not an area where you'd expect a WAF to do much.) In a WAF, you want to heavily optimise then non-attack paths.
From: Mike Pall
...
[I refuse to contribute to stackoverflow anymore, due to some
recent incidents. I can't even edit *my own* answers to correct
them, without some anonymous fool rejecting my edits. Ok, so maybe
they don't know I wrote the damned thing. But then they really
shouldn't be allowed to moderate Lua-specific questions. This is
just plain unacceptable.]
--Mike
This really is a shame (edit: the SO situation, not Mike's decision. He's the one who should have mod powers).
Ignoring the issue of whether the edit should have been accepted or rejected, it was subject to moderation because he appears to have (inadvertently) created multiple separate accounts, and was using one to attempt to edit a post made by the other; I don't think self-edits require review.
I suspect it's due to SO's treatment of low-reputation users. For those who don't frequent SO, when you first sign up, you are severely limited in functionality. So called "privileges" are unlocked as you gain reputation. Specifically, until his rep reaches 2000 (roughly 200 upvotes to his answers), his edits need to be approved by someone who has a high enough rep.
I guess the system works fairly well in preventing spam and gaming of the system. In this case, it seems to have backfired, leaving an expert feeling unwanted. BTW, his reputation on SO seems to be around 200 only [1] (around 20 upvotes altogether), which is very low.
In this case, the issue is probably two-fold (and serves to highlight some of the imbalances in the reputation system). Firstly, his area of expertise / activity is less popular than the mainstream ones on SO (lua / luajit only have 5000 questions tagged collectively, which is around 100x less than C#, Java, etc).
Secondly, as an expert, his answers probably cater for highly specific, difficult technical problems. These typically fly under the radar and get a handful of upvotes (unless they show something of interest to the rest of the communities, in which case a perverse upvoting cycle can happen via the popular questions list, with leading answers getting several hundreds of upvotes).
Disregarding confirmation bias w.r.t. reputation (i.e. how anything one of the SO super stars says is immediately upvoted), the most popular answers seem to be ones that involve a popular language, address a common misunderstanding / lack of knowledge (usually at a trivial or fairly elementary level), and have an element of surprise to it, which makes people share the answer. As an example, JavaScript's type coercion is always popular, even though it's been discussed dozens of times: [2]
These two issues, taken together, mean that you could spend an afternoon prototyping some guy's difficult problem, solving it, writing up a good answer, and getting only 3 upvotes for it.
A similar thing can be said for less populated time-zones; if you're living in one, most likely the questions you are answering are not on the "first page" by the time most of the SO world wakes up / has their morning coffee.
As an aside, I suspect most reputation-based systems suffer from these effects. Some of it can be seen right here on HN. Nevertheless, I guess they work slightly better than having no reputation system, as long as people aren't purposely exploiting it.
He's being a prima donna. Edits by a third-party get reviewed, there is no way you could accept that edit (which was from a different account) without allowing other random people to edit his posts in the same way. You can't blame stack overflow for erring of the side on respecting the author's intent.
Somewhat tangential - the LuaJIT source [1] is an absolute masterpiece of software engineering in C. The detailed comments (especially on the optimizations - "luajit/src/lj_opt...") are incredibly interesting reading.
** [A recursive backpropagation algorithm with backtracking, employing
** skip-list lookup and round-robin caching, emitting stack operations
** on-the-fly for a stack-based interpreter -- and all of that in a meager
** kilobyte? Yep, compilers are a great treasure chest. Throw away your
** textbooks and read the codebase of a compiler today!]
What trace compilation buys you is: 1) elimination of method call boundaries in analysis; 2) elimination of data flow merges.
Dynamic languages benefit particularly from these characteristics because their semantics are replete with method calls and data flow merges (e.g. after any generic operation). But static languages can have these qualities too.
To give credit, where credit is due: the original work on trace compilation is much, much older. The paper you cited is an application.
The fundamental papers to hunt for are Joseph A. Fisher's publications on trace scheduling (sadly, his PhD thesis from the 70ies is nowhere to be found online) and the Multiflow reports from the 90ies. The Dynamo paper built upon that foundation ten years later in '99 (get the full HP report, not the short summary). A related research area is about trace caches for use in CPUs with various papers from the 90ies.
AFAIK there's no up-to-date comprehensive summary of the state of research on trace compilers. Most papers don't even scratch the surface of the challenges you'll face when building a production-quality trace compiler.
> AFAIK there's no up-to-date comprehensive summary of the state of research on trace compilers. Most papers don't even scratch the surface of the challenges you'll face when building a production-quality trace compiler.
In general, there are few good + comprehensive resources for advanced compilation techniques. I would gladly fork over $$$ if you wrote a textbook on tracing compilers aimed at people who have expert-level knowledge in more traditional compilation methods (aka, not JIT).
The bottom line is that LuaJIT is very fast and very flexible.