LuaJIT for ARM now in Git head: benchmarks

dgl · on June 3, 2011

There are claims that ARM's "Jazelle RCT"[1] technology could improve JIT (and not limited to Java). Anyone know if that's just marketing or if there could be worthwhile speedups from this on more recent cores?

[1]: http://wiki.tiprocessors.com/index.php/Cortex-A8_Features#Ja...

mikemike · on June 3, 2011

IMHO Jazelle and ThumbEE are just marketing gimmicks.

Jazelle was (is?) half-proprietary and is only useful for a really stupid Java bytecode interpreter. Nowadays, Java is JIT-compiled with much better performance.

And ThumbEE came way too late for it to be useful. It requires setup/teardown, doesn't work well with existing code and adds several other complications. It has very few features that may be of use for any VM that has already undergone some basic optimizations.

Believe me, you don't want to waste your time with this.

wmf · on June 3, 2011

Too bad ARM11 is an ancient processor.

It would be interesting to see LuaJIT vs. Dalvik.

mikemike · on June 3, 2011

Ok, I made some quick benchmarks, comparing LuaJIT git HEAD and Dalvik 1.2.0 (Android 2.2.1) on the MSM 7201A (528 MHz ARM11 soft-fp). I don't have access to a newer Dalvik VM for this device. The Dalvik JIT compiler is definitely enabled, because performance gets much worse when I disable it.

Here are SciMark scores (HIGHER numbers are better):

    ARM11    SciMark SMALL | FFT     SOR      MC    SPARSE    LU
    -----------------------+---------------------------------------
    Lua 5.1.4         0.60 |  0.50    0.92    0.36    0.55    0.69
    LuaJIT git        4.34*|  2.61*   6.70*   3.91*   3.13    5.36*
    Dalvik 1.2.0+JIT  3.35 |  2.35    5.65    1.09    3.39*   4.27

Those are not too relevant, though, since it's a soft-float device. The maximum speedup is limited by the high cost of the soft-float operations (e.g. 62 cycles for a double-precision FP ADD).

And here are some simple integer benchmarks, run time in seconds (LOWER numbers are better):

    ARM11            fannkuch 10   nsieve-bits 8   binary-trees 11
    ---------------------------------------------------------------
    LuaJIT git           7.45*          3.04*           2.05*
    Dalvik 1.2.0+JIT    12.83          10.42            4.02

Note that binary-trees is a GC-intensive benchmark where LuaJIT usually loses against Java VMs, since they have a much better GC. Not so for Dalvik it seems.

The winner for each benchmark is marked with a '*'. Looks good. ;-)

mikemike · on June 5, 2011

After some improvements to the code generator, LuaJIT git HEAD now wins on all SciMark sub-benchmarks:

    ARM11    SciMark SMALL | FFT     SOR      MC    SPARSE    LU
    -----------------------+---------------------------------------
    LuaJIT git        4.48*|  2.73*   6.93*   3.96*   3.44*   5.36*
    Dalvik 1.2.0+JIT  3.35 |  2.35    5.65    1.09    3.39    4.27

martincmartin · on June 3, 2011

How does LuaJIT decide to use int vs. float, given that Lua has a single number type with floating point semantics? Do you notice that a variable happens to always contain an int, then compile it as an int with guards for overflow and non-integral division?

mikemike · on June 3, 2011

LuaJIT/ARM uses the dual-number VM mode, where numbers can be either represented as 32 bit integers or as a double. It uses lazy normalization, so conversions happen as needed. This is invisible to the user, but internally there are two different number types.

So there's usually already a strong indicator that a variable is an integer or a double just from looking at the internal type. The interpreter usually has two paths for all operations on numbers and the integer path ist the fast path. The JIT compiler adds guards to check for the proper types and emits specialized code.

Also, the JIT compiler pro-actively narrows doubles to integers, whenever it's beneficial to do so. The logic is quite involved -- you can take a look at the big comment block in lj_opt_narrow.c.

This optimization is also active on e.g. x86/x64, where there's only a single underlying number representation (a double). But I should mention that on x86/x64 it doesn't necessarily pay off to perform _all_ operations on integer types, since this would waste the extra FPU registers and the massive extra bandwidth of the FP units in these chips. The branch unit is already quite busy and you're effectively serializing the code with all of those overflow checks.

So the JIT-generated machine code for x86/x64 and for ARM may be quite different for the same inputs. And I'm not talking about the instruction set differences alone.

E.g. compare the output of these two commands:

    luajit -jdump -e "local x=0; for i=1,100 do x=x+1 end"
    luajit -jdump -e "local x=0.5; for i=1,100 do x=x+1 end"

The generated code for the inner loop on x86/x64 is the same in both cases:

    ->LOOP:
    addsd xmm7, xmm0
    add ebp, +0x01
    cmp ebp, +0x64
    jle ->LOOP

The generated code for ARM uses either integers + overflow checks or soft-float code:

    ->LOOP:
    mov   r10, r4
    adds  r4, r4, #1
    blvs  ->exit 2
    add   r11, r11, #1
    cmp   r11, #100
    ble   ->LOOP

    ->LOOP:
    movw  r3, #0
    movt  r3, #16368
    mov   r2, #0
    bl    ->softfp_add
    add   r11, r11, #1
    cmp   r11, #100
    ble   ->LOOP

Note: the mov r10, r4 is to preserve the value prior to the overflowing calculation in exit 2. Yes, one could avoid that for an addition by undoing the calculation. But this won't work in general, e.g. for a multiplication.

wtallis · on June 3, 2011

Is there any reason why this version of LuaJIT wouldn't perform well on an ARM Cortex A-series? Has the mircoarchitecture changed enough that code optimized for a single-issue limited out-of-order execution ARM11 would run badly on the dual-issue speculative out-of-order Cortex A9?

The most significant instruction set changes have been for SIMD/FP operations (which aren't applicable to most code), and Thumb, which might not be very useful for a JIT.

mikemike · on June 3, 2011

LuaJIT/ARM runs fine on an Cortex-A8 or Cortex-A9. The instruction scheduling for the simpler in-order architectures is beneficial for the newer architectures or for out-of-order architectures, too.

And even though it needs at minimum ARMv5, it does take advantage of some ARMv6 and ARMv7 instructions, if available. But this has a rather small impact on the performance.

The only real issue is that this is a soft-float port for now. So it doesn't take advantage of VFP, even if the CPU has it. I'll add hardware floating-point support, when/if I get a follow-up sponsorship for this.

hesdeadjim · on June 3, 2011

What kind of $ range makes it worth your time? I work at a mobile game studio that makes heavy use of lua and I could see us putting some money into it.

Also, does LuaJIT differ any from the stock lua interpreter when it comes to GC performance? It's a continual source of pain for us to have to manually tweak GC parameters on a per-app basis depending upon how heavily lua is used.

mikemike · on June 3, 2011

Well, I don't have an estimate of the costs for this feature, yet. And I need to talk with the current sponsor of the ARM port first.

Yes, LuaJIT uses its own memory allocator, which improves performance for workloads with lots of allocations. And LuaJIT's memory footprint for most object types is lower, too.

However LuaJIT 2.0 still has more or less the same GC engine as Lua 5.1. LuaJIT 2.1 will feature a new garbage collector, see the LuaJIT roadmap 2011: http://lua-users.org/lists/lua-l/2011-01/msg01238.html

In the meantime, your best bet is to avoid temporary allocations or to use manual memory management with the LuaJIT FFI, see: http://luajit.org/ext_ffi.html

LuaJIT is a drop-in replacement for Lua. Try it with your apps and report back if you encounter any problems. You need to use git HEAD to get the JIT compiler for ARM (disabled for iOS -- don't complain to me).

theatrus2 · on June 3, 2011

Glad to hear its a minimum of an ARMv5 - I have a project where I have been using plain Lua on an ARM926EJ-S which could use a performance boost.

jules · on June 3, 2011

How much of a difference does instruction scheduling vs no instruction scheduling make approximately? How about for out-of-order architectures?

mikemike · on June 3, 2011

Sorry, I have no quantitative comparison for that. It's baked into the design of the register allocator to schedule loads early and to rematerialize constants far ahead of their use. And I'm contemplating to add an extra scheduling pass for some cases that are currently not covered well.

From experience in tuning the interpreter, instruction scheduling (esp. load scheduling) can easily make a 20-50% difference on some tight loops for a simple in-order architecture. This is much less of an issue with an out-of-order architecture, of course. It's still beneficial to do instruction scheduling for all ARM architectures, because there's apparently a huge performance difference compared to the out-of-order engines on contemporary x86/x64 CPUs.

dkersten · on June 3, 2011

What are you talking about!?

Firstly, ARM11 is an instruction set and not a processor in itself. Secondly, Mvidia Tegra[1] is an ARM11 processor and the Qualcomm MSM7201A which LuaJIT targets seems to be Qualcomm's current generation of mobile chipsets[2].

Doesn't sound ancient to me at all. Sounds pretty current actually. Sure, I'd like an ARM7 version of LuaJIT (specifically, for Cortex-A9 MPCore), but this is still an interesting start. Also, now that theres an ARM11 port, its not too far of a stretch to think that ARM7 might follow sometime in the nearish future.

[1] http://en.wikipedia.org/wiki/Nvidia_Tegra

[2] http://www.qualcomm.com/products_services/chipsets/mobile_pr...

wmf · on June 3, 2011

The instruction sets have a v in them, like ARMv5, ARMv6, etc. The latest version is ARMv7. ARM11 is a core, and has since been obsoleted by the Cortex A8 which has been obsoleted by the Cortex A9 (e.g. Tegra 2 uses Cortex A9).

The MSM7201A is anything but current; it was used in the T-Mobile G1 from 2008.

dkersten · on June 3, 2011

Cortex A9 came out last year, that doesn't automatically make all ARM11 based processors, as you said, an "ancient processor".

2008 may not be the latest tech, but its not exactly "ancient" either. The majority of PC owners use processors older than that. Hell, the computer I'm typing this on is running on a 2006 Core 2. Definitely not current, but ancient is going a bit far since it still runs quite competitively for all but the most demanding tasks.

robot · on June 3, 2011

Its not only an instruction set but an entire architecture specification. For example the virtual memory subsystems differ between different versions. Instruction set is part of the specification.

robot · on June 3, 2011

ARM11 is a core based on ARMv6 instruction set. E.g. ARM1136, ARM1176. ARM7 is another core, a very old ARMv5 core. Nvidia Tegra is based on ARM Cortex-A9, which is an ARMv7 core.

dkersten · on June 3, 2011

I know I know, wikipedia as citations etc etc. But according to this page: http://en.wikipedia.org/wiki/Nvidia_Tegra#Tegra_APX_series Nvidia Tegra is ARM11 based.

Tegra 2 is based on Cortex-A9, which in turn is ARM7, but Tegra 2 is very new - that doesn't automatically make Tegra ancient and so my statement is still not false.

dkersten · on June 3, 2011

Looks like it has been confirmed that LuaJIT works fine on Cortex A9! Nice :-P