Luje 0.1 – a pure Lua JVM

jacques_chester · on Oct 6, 2013

David Given is one of those prolific thinkers and doers who seem to be overrepresented in the Lua community.

He's probably best known to HNers for his critical article "On Go" and Objective Lua.

http://cowlark.com/2009-11-15-go/ https://cowlark.com/objective-lua/index.html

seabrookmx · on Oct 7, 2013

Thanks for the link! That's a really interesting article.

Though out of date, I still think his point holds water. It's shocking how little difference there is between Algol-68 and modern Go or C. Go has changed quite a bit since his article, but not all of these changes have been well received (a great example being automatic semicolon insertion).

tinco · on Oct 6, 2013

The article makes a rather hefty claim, that it is comparable or even in some ways faster performancewise than Sun's hotspot JVM.

From what I've heard the JVM is one of the most heavily researched and optimized bytecode VM out there. Could it really be that some guy can write a competitor in a scripting language in his spare time?

Not saying it's impossible, just curious if there's anything to back that claim up, besides the very specific microbenchmarks.

edit: I was a bit too quick to comment, reading again he doesn't really claim to be faster at anything other than small loops. Still interested if it's fast at any real world things though.

pron · on Oct 6, 2013

HotSpot is indeed one of the most optimized runtimes around. It is, however, optimized for large, complex applications. It's quite possible to beat in a carefully designed microbenchmark. Beating it in a complex application is a whole other matter (in fact, it's pretty hard to beat even in C++).

haberman · on Oct 7, 2013

> in fact, it's pretty hard to beat even in C++

If there's one thing I've learned, it's that the wishful thinking will never stop.

pron · on Oct 7, 2013

Yes, but only if you call 15 years of experience in both C/C++ and Java systems wishful thinking. And I'm talking big-ass, several MLOC, some soft and some hard realtime systems.

But I guess some people must hold on to their beliefs, while some do extensive research (well, to be fair, it was government funded, so we had the means to write the same 2+ MLOC system twice, and we've had direct access to the Sun team working on their realtime Java project at the time). I guess you just don't have the data.

haberman · on Oct 8, 2013

> I guess you just don't have the data.

Basically every piece of performance-critical open-source infrastructure (ie. what we can actually all see and evaluate) is written in C or C++. That is empirical data.

You say that long-running servers that don't have startup cost constraints will perform better in Java, particularly when there is concurrency. Why then, is the entire web served by web servers written in C/C++ (Apache, IIS, nginx, LiteSpeed), mail servers written in C/C++ (Exim, Postfix, qmail, Sendmail, Microsoft), DNS servers written in C/C++ (BIND, PowerDNS), and database servers written in C/C++ (MySQL, PostgreSQL, Microsoft SQL Server, MongoDB, Riak)?

You attempt to explain all this away, as if OS's, VMs, web browsers, etc (which you admit are better off written in C or C++) are somehow different from the applications you work on, and that somehow the Java you write in your applications would be hard to beat in C++.

The only "data" you have offered is your own personal experience of working in the defense industry. A couple problems though. First of all you never write the "same" 2+ MLOC system twice, you'll always use the lessons gained from the first go at the second design. Something tells me you wrote the C++ version first. Your experience is also something that we can't see for ourselves, and involves lots of variables that we can't evaluate. It seems likely that your company didn't have a strong culture of best practices in C++, or maybe just weren't that good at it. Your C++ design might have had lots of opportunities for improvement that good C++ people would immediately recognize.

It's not hard to write good and high-performance C++ when you have a good set of guidelines like the Google C++ style guide (http://google-styleguide.googlecode.com/svn/trunk/cppguide.x...) and a solid set of core libraries, but without these it's easier to go astray. There's no doubt that Java's reasonable set of "batteries included" libraries makes this harder to screw up.

I can say from my brief stint in defense that I was not super impressed. Everything was based around extremely heavy infrastructure like CORBA. There are lots of decisions like this that can skew the results you were getting.

pron · on Oct 8, 2013

> Basically every piece of performance-critical open-source infrastructure (ie. what we can actually all see and evaluate) is written in C or C++. That is empirical data.

Beside the fact that this is an appeal to an irrelevant authority, it is also false.

MongoDB is about as far from being performant as you can be; Riak is not written in C++, but in Erlang; PostgreSQL and MySQL predate Java (and predate Java 1.4 – when it started getting performant – by many years), as do Exim, Postfix, qmail, Sendmail and BIND (don't know about PowerDNS); Microsoft has been an enemy of Java for political reasons.

On the other hand, recent Apache projects are all Java: Tomcat, ZooKeeper and the list is long. Cassandra and Voldemort are Java; Twitter Storm; most benchmarks show Java web-servers (of non-static content) to outperform or match C and C++ servers (http://www.techempower.com/benchmarks/#section=data-r6&hw=i7...).

> It's not hard to write good and high-performance C++ when you have a good set of guidelines like the Google C++ style guide (http://google-styleguide.googlecode.com/svn/trunk/cppguide.x...) and a solid set of core libraries, but without these it's easier to go astray.

This may have been true 5 or 6 years ago. But modern multicore hardware requires a high level of concurrency, and a high level of concurrency requires low contention, and low contention requires non-blocking data structures, and nonblocking data structures require either a GC, or some other method of automatic memory reclamation that's extremely hard to achieve in C++ (or any other language, for that matter). Even C++'s natural way of pointer sharing (reference counting) is a huge contention point and a performance killer. Not to mention the fact that until C++11, C++ didn't even have a memory model (which, BTW, was designed by Java people).

So while it's not too hard to make single-threaded C++ code run a little faster than Java, it's extremely hard to make multithreaded C++ code outperform Java, even for experts (and it was almost impossible prior to C++11).

haberman · on Oct 8, 2013

> most benchmarks show Java web-servers (of non-static content) to outperform or match C and C++ servers (http://www.techempower.com/benchmarks/#section=data-r6&hw=i7...).

Umm, there are no C or C++ servers in that benchmark, except for one extremely slow one I've never heard of.

> and a high level of concurrency requires low contention, and low contention requires non-blocking data structures

That really does not follow.

Lock-free data structures are not about reducing contention, they are about eliminating blocking and improving performance when contention does occur. Contention occurs whenever multiple threads attempt to access the same resource, regardless of whether access to that resource is blocking or non-blocking. Lock-free data structures can still experience contention, and while they do not block, their performance does degrade under contention. Indeed later in your message you say that reference counting is a "huge contention point" -- refcounting a la C++'s shared_ptr is a lock-free algorithm.

So while lock-free algorithms don't affect the amount of contention in the system, they do exhibit better performance under high contention, for the (small) subset of data structures that can be made lock-free.

But this is only an issue when contention exists to begin with. Many highly parallel systems exhibit very low contention; for example, MapReduce. Even in systems with inherent contention, there are many strategies for reducing it, like partitioning and affinity schemes.

> nonblocking data structures require either a GC, or some other method of automatic memory reclamation that's extremely hard to achieve in C++ (or any other language, for that matter)

I'm not sure what is so hard about Hazard Pointers (http://www.research.ibm.com/people/m/michael/ieeetpds-2004.p...). Maged Michael is a smart guy who figured this stuff out 10 years ago.

But you don't even have to do it yourself. Non-blocking algorithms are extremely subtle by their nature (I hope you don't implement them yourself in Java; that would be a waste of effort). Just use an existing library like Thread Building Blocks, Grand Central Dispatch, etc. that are specifically designed to achieve maximum multi-threaded performance in C++.

pron · on Oct 9, 2013

Contention is not binary, or, rather, its effects aren't binary. A lock that is held for 10 microseconds potentially causes a lot more contention than a CAS that takes a few nanos. OTOH, thousands of short contentions are as bad as, or worse, than few long contentions (there's a whole great talk by Doug Lea about how Java's fork/join tricks modern CPU out of crazy power-saving optimizations to reduce thread wakeup time).

Intel's TBB were modeled after java.util.concurrent, BTW. Later, the Java memory model inspired the C++11 memory model (and the Java people were called to help with that, too). When it comes to concurrency, Java is usually at least 5 years ahead of C++ (perhaps with the exception of fork/join, which was inspired by Cilk, but has since come to surpass it).

> I'm not sure what is so hard about Hazard Pointers

Hazard pointers are expensive without RCU (the hazards need to be visible to other threads, which means store-load barriers), and RCU requires kernel cooperation, and besides this entire class of techniques is either very limited or incredibly complicated (what do you do if you have thousands of threads?). A good GC vs hazard pointers with or without RCU is like driving down a highway vs stumbling in a swamp when it comes to lock-free data structures.

Here's Herb Sutter on the issue (source: http://herbsutter.com/2011/10/25/garbage-collection-synopsis...):

All existing C++ garbage collectors, notably the Boehm collector, aren’t enough for many interesting uses for several reasons, the main one being that they’re conservative, not accurate. See Hans Boehm’s page Advantages and Disadvantages of Conservative Garbage Collection for more discussion of tradeoffs.

As usual, there’s something to standardize iff we want to enable portability. So in this case the question is whether it’s interesting to enable portable programs that rely on GC — and note that typically if they rely on GC, they rely specifically on accurate GC, not conservative “maybe-GC.” For example, C++11 atomics enable writing many kinds of portable lock-free code, but not all; we need a standard for #2 if we want to enable people to write portable lock-free singly-linked lists, where the only known practical algorithms require automatic (non-refcount) GC and so they cannot be implemented in portable C++ today without some nonportable platform-specific code underneath.

Again, appeals to hazard pointers fail because the obscure convolutions and limitations of hazard pointers only expose the limits of even the most heroic efforts to work around the lack of true automatic GC. The main contribution of hazard pointers, unfortunately, is that they thoroughly motivate why direct support for automatic GC is needed.

haberman · on Oct 9, 2013

> A lock that is held for 10 microseconds potentially causes a lot more contention than a CAS that takes a few nanos.

See, if you're trying to compare those two things, you're comparing a lot more than locking vs lock-free; you're comparing two completely different designs. All you can do with a single CAS is a very primitive operation like queue enqueue/dequeue, stack push/pop, set add/remove, etc. If you're holding a lock for 10us you're doing a whole lot more than that; you're doing something that isn't possible to do lock-free, and your design is inherently less concurrency-friendly.

If your experience comes from taking a coarse-grained mutex-heavy design in C++ and reimplementing it in Java with a design that communicates between threads only with lock-free data structures, then your results aren't indicative of locking vs. lock-free or C++ vs. Java.

If you really want to compare locks vs. lock-free apples-to-apples, then use data structures that have the same API as your lock-free data structures, but are implemented by taking a lock for a very short amount of time -- just enough time for the operation you would have performed lock-free.

And if you want to compare C++ vs. Java, use an equally concurrency-friendly design with both. If you're using high-performance lock-free data structures in Java, use the same in C++, since they are available in both. The memory management complications can be confined to the container library; at the application level exclusive ownership of objects can be passed from thread to thread.

pron · on Oct 9, 2013

Obviously. But lock-free data structures in C++ are not as good as their Java counterparts (because of GC), and besides, most lock-free DS are often implemented in C++ several years after they're implemented in Java, if at all.

For example, Java's nonblocking hash map (by Cliff Click) is still not available for C++, 6 years after its release. There are a few very limited/half baked implementations, but nothing "canonical" (and nothing so thoroughly researched, optimized and used).

BTW, part of the reason is that the world's leading low-level concurrency researchers and implementers (Doug Lea, Nir Shavit and others) do most of their work in Java (most of my own work is designing non-blocking data structures, BTW; not that I'm comparing myself to Lea or Shavit in any way).

To get a sense for the quality of Java data structure implementations, I strongly suggest you watch this talk by Doug Lea: https://www.youtube.com/watch?v=sq0MX3fHkro

His work is incredible. Even if you don't use Java, there is so much to learn in the talk. One of the coolest things there (which I already hinted at), is how to spin properly on modern Intel processors.

haberman · on Oct 9, 2013

I appreciate the link to the talk, I learned a lot. It just reinforces my beliefs though; so much of the talk is about having to work around the idiosyncrasies of the JVM. Quotes from the talk:

4:15 "The main moral of this talk is that things are never what they seem, and they are increasingly not even close to what they seem as we have more and more layers and systems. So there's you on top of core libraries who live on top of JVMs, and the reason I know all the JVM folks so much is because I yell at them pretty much all the time."

10:30 "Oh my god, [because we derive from java.util.Collection] we have put more lines of code into supporting a method that nobody in their right mind would call for a concurrent queue, than almost everything else combined. Between iterators and interior remove, that's probably 80% of our concurrent queue code. The algorithms themselves, they're maybe 20 lines, they're cool but they're not a lot of code."

0:57:10 "How do you get a good CAS loop? CAS until success? Don't do while, do do/while! Why? Safepoints! Remember safepoints I mentioned long ago? Yeah, they're not your friend. [...] You know the great thing about JITs is that they work so well, and the bad thing about JITs is that, you know, sometimes they're idiots! [...] Boxing, oh my god I hate boxing. Classloading appearing at mysterious times."

Maybe to you the JVM gives enough benefits to be worth giving up all this control, but to me it it's just nuts that you'd put yourself on top of a platform that you have to work against so much.

And I still don't buy that GC makes for better lock-free data structures. Lock-free data structures may need to have a GC-like scheme internally to memory-manage their nodes, but I can't see any reason why this requires the entire program to be subject to GC.

pron · on Oct 9, 2013

> I can't see any reason why this requires the entire program to be subject to GC.

It doesn't. Lots of Java low-level programs (DBs, caches etc.) manage most of their heap manually. But an "internal" GC would have to be just as good as a "whole program" GC anyhow in order to make the non-blocking data structure robust (I generally don't use the term lock-free, as it's a very specific case of a non-blocking data structure; many interesting nonblocking DSs are only obstruction-free rather than lock-free).

Also, GCs have other advantages, such as a much higher memory allocation-reclamation throughput than any general manual scheme, especially (though not only) in the multi-threaded case. Many GCs may introduce pauses, but all the good ones would give you a much higher throughput than anything you can achieve manually.

> Maybe to you the JVM gives enough benefits to be worth giving up all this control, but to me it it's just nuts that you'd put yourself on top of a platform that you have to work against so much.

Yes, to anyone doing hardcore multicore programming, the JVM is probably the best choice. You are not giving up control, though, because C++ has far worse problems in those areas (forget about safepoints - C++ doesn't even have a memory model, or didn't until C++11, and most C++ programmers don't even know what ordering guarantees their particular compiler provides). Also, if you think you're not "giving up control" in C++ (or C; or Assembly), then you're sorely mistaken. You have little or no control over ILP, prefetching, branch-prediction, outstanding cache misses and all the magic modern CPUs do at runtime. You could think of the JVM as an extension of the CPU only in software – modern CPUs do so much crazy stuff to your code anyway, that you can't possibly know how exactly it will be executed even if you program directly in machine code. You can no longer get truly close to the metal, so being one step rather than two steps removed is not much of an advantage any more.

> "Oh my god, [because we derive from java.util.Collection] we have put more lines of code into supporting a method that nobody in their right mind would call for a concurrent queue, than almost everything else combined. Between iterators and interior remove, that's probably 80% of our concurrent queue code.

Sure. The code is often complex to support moot use-cases. Still, it's leaps and bounds beyond anything in pretty much any other language. Again, the reason is mostly coincidental. From its beginnings, more or less, Java (or Sun) has attracted the most prominent concurrency researchers. Java 1.5 was the first general purpose programming environment with a memory model, iron-level access to concurrency primitives (fences, CASs), and some world-class GCs, all requirement for state-of-the-art concurrency research. To this day, concurrency researchers prefer the JVM, despite its annoyances (and what language doesn't have those?) to any other platform out there. That's why most concurrent DSs get implemented in Java first, and their Java implementation is usually the most polished and robust.

But to quickly go back to where we started, a higher-level PL often brings about performance improvements. There are famous examples of sorting algorithms in C++ easily outperforming their C implementations. Why? Because of templates vs. function pointers. More precisely - due to inlining, as inlining is "the mother of all optimizations" (it opens the door to other optimizations). Now, of course you could achieve the same in C, by writing the sort function many times, but it's cumbersome and error prone.

The JVM is "the inlining magic machine". It inlines code that a C++ compiler never could. There are some areas where C++ is more performant, namely computations on arrays of complex numbers, but hopefully that will be addressed in Java 9. But this inlining capabilities of the JVM is why it often outperform C++ even in single-threaded code. Again, you could achieve the same in C++ by writing the same code several times (or by generating machine code at runtime, which is what the JVM does). Also, this effect is usually perceived in larger applications, because then you use virtual methods.

Anyway, that's pretty much what I have to say on the matter. I have a lot of respect for C++. I was a C++ programmer for many years. The JVM brings many interesting performance improvements over C++, as well as performance regressions in comparison. All in all, Java usually comes up ahead in large codebases, and falls behind in specific use cases (mostly numeric HPC). In addition, there is no other programming language/platform out there that even comes close to the JDK's power when it comes to hardcore concurrency; some successful implementations find their way to TBB a few years later, and some don't. If you're trying to push the envelope on concurrent software development (my day job), Java is pretty much the only sensible way to go.

Still, there are valid reasons to prefer C++ over Java in many circumstances, but sheer performance is seldom one of them.

haberman · on Oct 10, 2013

> Also, GCs have other advantages, such as a much higher memory allocation-reclamation throughput than any general manual scheme

A benefit of using C/C++ is that you are not limited to general schemes. If you have an allocation/deallocation hotspot, you can use more specialized schemes like arena allocation, which will beat a GC handily.

Also, latency and predictability are often more important than throughput. GCs fare poorly on these counts.

> forget about safepoints - C++ doesn't even have a memory model, or didn't until C++11

Aka it does have a memory model. And one of the points Doug Lea made in the talk you linked is that the Java memory model is actually broken and no one knows how to fix it, whereas C++'s is not (though it's broken in ways that apparently don't affect anyone).

> Also, if you think you're not "giving up control" in C++ (or C; or Assembly), then you're sorely mistaken. You have little or no control over ILP, prefetching, branch-prediction, outstanding cache misses and all the magic modern CPUs do at runtime.

This is a tired and weak argument.

A mispredicted branch costs 5ns. A main memory reference (cache miss) costs 100ns. And you actually have a lot of control over these things; you can make your branches more predictable and the working set of your tight loops smaller so it will fit in cache. And there are actually instructions that give control over prefetching.

A Java GC pause on the other hand is often on the order of 200ms. That's six orders of magnitude in difference, and can be even longer in degenerate GC situations. There is really no bound to the worst case.

The loss of control also comes from the fact that the effects are highly nonlocal. If your C/C++ code is in a tight loop, the microarchitectural effects are mostly a function of your own code and its behavior. But whether a safepoint triggers a GC pause in Java is a function of the entire program's behavior.

So you really can't equate these two things. Java's loss of control is many orders of magnitude larger and less local.

> But this inlining capabilities of the JVM is why it often outperform C++ even in single-threaded code.

I don't think you've made this case at all.

> All in all, Java usually comes up ahead in large codebases

Coincidentally it is large codebases where it is most difficult to compare apples to apples, because it is so unlikely that you can really get two large programs that do the same thing with the same design. I'm deeply skeptical of this claim.

> and falls behind in specific use cases (mostly numeric HPC).

Java falls behind it tons of cases. It loses in nearly all of the benchmarks game, only some of which are numeric: http://benchmarksgame.alioth.debian.org/u64q/java.php It loses very badly in this benchmark published a few years ago by Google, which is not numeric: http://static.googleusercontent.com/external_content/untrust...

Given comparable programs, C++ is nearly always faster. You really haven't offered any evidence that contradicts this basic fact. The only benchmark you cited didn't even have any C++ programs in it.

pron · on Oct 11, 2013

> A benefit of using C/C++ is that you are not limited to general schemes. If you have an allocation/deallocation hotspot, you can use more specialized schemes like arena allocation, which will beat a GC handily.

Absolutely. And you can do the same in Java with about the same amount of effort (lots of Java projects do that).

> it does have a memory model.

It does, and it probably fixes some mistakes in the JMM, but C++ got it, what, 10 years after Java? That's about how far behind C++ is in terms of concurrency. I know this because that's my business (concurrent data structures and high concurrency on many-core hardware in general - you really can't do C++ well on a many-core system; it doesn't scale unless it's embarrassingly parallel).

> A Java GC pause on the other hand is often on the order of 200ms.

It could actually be much worse. Fortunately, you could get a commercial GC that offers exactly 0 pauses (in fact, I suggest my customers do exactly that if they need 0 pause but fall short of true hard real-time; for hard real-time there's real-time Java with built in arena allocation with correctness guarantees of no pointers from the heap into the arena).

> Java falls behind it tons of cases.

I know I won't be able to convince you, but those benchmarks are completely irrelevant. I can easily make any single-threaded algorithm perform better in C/C++. But, as I've said in the beginning, Java comes out ahead when you have a large codebase with many team members, which requires software engineering that precludes good optimizations. This realization is a result of many years experience with both Java and C++. I don't have much hard evidence (some, but it's possibly anecdotal and also proprietary), but in our field, as in medicine, sometimes hard-earned first hand experience means a lot.

peterashford · on Oct 7, 2013

That's pretty snarky

haberman · on Oct 7, 2013

It's really irritating, as someone who works in C and C++ with good reason, to hear people continually deny the very real performance benefits of working at a lower level.

We (C/C++ guys) write the OS's, VMs, browsers, codecs, etc that power your software stacks, and then you (managed language fans) turn around and tell us we're wasting our time by using C and C++, while giving us stuff like Eclipse.

So yes, when someone says it's hard to beat Java with C++, it inspires some snark. Write the next JVM in Java if you think it's that easy.

com2kid · on Oct 7, 2013

> It's really irritating, as someone who works in C and C++ with good reason, to hear people continually deny the very real performance benefits of working at a lower level.

As someone who also works at a very low level, I also know the limits of precompiled optimizations!

In perfect theory land, a Sufficiently Smart JITTER will beat out a Sufficiently Smart Compiler, if for no other reason than the JITTER can always take advantage per CPU optimizations for CPUs newer than whenever precompiled code was compiled for! e.g. in theory code written ages ago gets free performance boosts.

JITTERs also have the benefit of knowing the state at run time. Doing things like only compiling code that is actually being used right now means in theory fitting more stuff throughout all layers of the memory subsystem, and we all know how important cache hit rates are to performance!

JITTERs also have access to the entire bytecode of a program, which lets it do even stranger optimizations if it so decides (again, sufficiently smart), where as a compiler cannot do much about libraries you link to dynamically (or even statically, doing optimizations on pure assembly is Not Fun)

Of course some compiler tools, such as Link Time Code Generation (also called Whole Program Optimization), and Profile Guided Optimizations can get you close to what a JITTER has by feeding the compiler a ton of additional data at compile time, but again all you have done is tried to give the compiler an approximation of what a JITTER already has available to it.

Now one thing C++ most certainly wins out on is that it is possible to create very thin lightweight wrappers around functionality, which will have huge perf gains in comparison to the multilevel abstractions that software engineers (myself included!) tend to enjoy creating when they get a hold of a VM based language (be it JVM, CLR, or pick your favorite bytecode).

illumen · on Oct 7, 2013

How can a JIT runtime be faster the first time code runs? How can it load faster if it has to load all the compilation code to run the process?

Yeah, each has their own benefits. AOT compilation is even theoretically faster in some situations, as well as being faster in practice.

vidarh · on Oct 7, 2013

> How can a JIT runtime be faster the first time code runs? How can it load faster if it has to load all the compilation code to run the process?

That depends on a lot of variables, such as where the code is loaded from, and IO capacity vs. CPU capacity, and code density. There's an interesting PhD thesis from ETH, back in 1994, on Semantic Dictionary Encoding (by now Dr. Michael Franz) that demonstrated a on-the-fly code generation system for Oberon where most or all of the code generation cost was covered by the reduced size of the executables on the then-current systems, which allowed loading the data from disk or network faster (the representation was in effect close to a compressed intermediate representation syntax tree, and was "uncompressed" by generating the code and reusing code fragments generated).

There's the difference between theory and practice though - I keep being disappointed every single time I try a Java based app. I don't know if it's the JVM or the compiler, or the language, or just the way Java developers write code, or if I'm just somehow fooling myself, but every single Java app I've used have felt horribly sluggish and bloated.

pron · on Oct 7, 2013

> every single Java app I've used have felt horribly sluggish and bloated.

It's the startup time (mostly compilation), compounded by the fact that Java loads classes lazily (so a class is loaded and compiled the first time you perform an action that uses it). Long-running Java server applications fly like the wind.

The JRE classes are, I believe, precompiled and saved in a cache. It is possible to add your own code to the cache to significantly reduce startup time.

BTW, the classes are not just compiled once. They're compiled, monitored, re-optimized and re-compiled and so on. It's quite possible for a Java app to take a couple of minutes before it reaches a steady state. Of course, loading more classes at runtime (or hot code swapping) can start the process again, as well as a change in the data that makes different optimizations preferable.

Sometimes, when going back to C, I'm amazed at how fast an application can start (I'm not used to that). But then I see performance after a few minutes and think, "damn, the current data dictates a certain execution path and all I have to rely on is the CPU's crappy branch prediction? where's my awesome JIT?"

rat87 · on Oct 7, 2013

Potentially it can save the jit results from the last execution.

pron · on Oct 7, 2013

There's a discussion in the Java community on how to do this best. Security is a problem. You need to make sure the compiled code matches the bytecode (that undergo security checks). But how do you compute a checksum for the compiled, cached code, if in order to test that checksum you need to re-generate it from the bytecode?

So Java caches the compiled code for the JRE classes only, and it's possible to add your own code to the cache (requires root permission, etc.)

kadaj · on Oct 7, 2013

Write JVM in Haskell, compile it and run natively.

pron · on Oct 7, 2013

I actually worked for years writing C/C++ software for the defense industry (including hard real-time and safety critical systems), and let me tell you that unless you write C++ in a very small team of experts, working months on micro-optimizations, your Java code will beat C++ nine times out of ten. We've since decided to switch to real-time Java even in our hard real-time systems and never looked back.

millstone · on Oct 7, 2013

My understanding is that real-time software optimizes for worst-case latency, even at the cost of making the overall program slower. That's pretty unusual, and most software prefers a different set of tradeoffs.

Consider a project like WebKit, which is definitely not comprised of a "very small team." Does anyone honestly believed that WebKit would be faster if it were written in Java?

pron · on Oct 7, 2013

No, but that's because WebKit has other requirements, like a short startup time as well as memory constraints. If WebKit were a long-running server process, a Java version could well be faster (although WebKit's most performance-critical code runs on the GPU anyway, where Java wouldn't have any advantage).

There are other issues as well (maybe I'll write a blog post about them): when it comes to throughput, Java is extremely hard to beat (i.e., it's certainly possible, but not without a lot of work); when it comes to latency, a Java project needs work, too; when it comes to a lot of concurrency, it's almost impossible to beat Java with C++, even with a lot of work (depending on the complexity of the concurrent code).

hershel · on Oct 7, 2013

If i'm not mistaken real-time java doesn't use the jit compiler(because it's pretty impossible to guarantee determinism with jit).

Is this true? And even with this constraint, java gets similar speed to c++ ?

pron · on Oct 7, 2013

Ah, the beauty of real-time Java is that it lets you mix hard real time code and soft- or non-realtime code in the same application, on different threads (and different classes).

Those classes designated as hard realtime will be compiled AOT, and will enjoy deterministic performance at the expense of sheer speed. Realtime threads are never preempted by JIT, or even by GC, as you run them within something similar to an arena memory region.

The idea is that only a small part of the application requires such hard deadline guarantees, and the rest should enjoy the full JVM treatment.

pjmlp · on Oct 7, 2013

> Write the next JVM in Java if you think it's that easy.

Jikes, http://jikes.sourceforge.net/

Oh and post Java 8, Hotspot might be replaced with Graal a new JIT done in Java,

https://wiki.openjdk.java.net/display/Graal/Publications+and...

haberman · on Oct 7, 2013

> Jikes, http://jikes.sourceforge.net/

That is not a JVM, that is a Java compiler. Surely you know the difference.

> Oh and post Java 8, Hotspot might be replaced with Graal a new JIT done in Java,

Good luck getting reasonable performance out of that.

But even if the JVM decides to go that way, a JIT is only one part of a VM. The interpreter and the GC will take particularly bad perf hits if you try to write them in Java.

lucian1900 · on Oct 7, 2013

I believe this is what was meant to be linked http://jikesrvm.org/

That's an entire JVM written in Java, with only a small bit of C to bootstrap the JIT.

pjmlp · on Oct 7, 2013

Correct, pasted the wrong link.

Thanks for pointing that out.

pjmlp · on Oct 7, 2013

A JVM does not require an interpreter, that is an implementation detail, not part of the language.

As for the GC, I can give the Squawk example which has a GC done in Java, targets embedded systems, with C/C++ only being used for hardware integration part.

http://www.sunspotworld.com/

https://java.net/projects/squawk/pages/Home

http://en.wikipedia.org/wiki/Squawk_virtual_machine

http://www.sunspotworld.com/docs/Yellow/javadoc/com/sun/squa...

http://dl.acm.org/citation.cfm?id=1094908

Developers should learn about compiler design and not mix languages with implementations.

haberman · on Oct 7, 2013

> As for the GC, I can give the Squawk example which has a GC done in Java, targets embedded systems, with C/C++ only being used for hardware integration part.

I didn't say it was impossible, just slow, and it is:

"My third reservation is Squawk's performance, which is roughly that of the J2ME-derived Java KVM introduced a few years ago, an interpreted JVM that is written in C. From everything I have learned about it, the developers are assuming that most applications for the SPOT platform will include processors of sufficient power and that large amounts of memory will be available for garbage collection, pointer safety, exception handling and a mature thread library for thread sleep, yield and synchronization. That is not true in many cases, and if SPOT is limited to only sensor applications that are not performance-constrained, the platform is interesting but not all that important in the long run."

http://www.embedded.com/electronics-blogs/cole-bin/4025677/S...

pjmlp · on Oct 7, 2013

It just needs to be fast enough to fulfill the required tasks, not a blazing thunder.

This is how C replaced Assembly in many use cases, C++ replaced C in many use cases, ...

vidarh · on Oct 7, 2013

A large part of the asm => C and C => C++ moves is that C and C++ very ,very strictly follow the principle that you pay only for what you use, and that pretty much all C and C++ environments makes inline assembler quite easy to do.

In other words: You can write C++ that is no slower than C, and you can write C that is only rarely slower than ASM, and in both cases, in the few situations where the performance difference is large enough to be noticeable and matter, you can still easily write asm.

This is also why C and C++ keeps being used in the face of so many higher level languages.

And I say this as someone who spends most of his time writing Ruby, and who haven't even bothered upgrading to MRI 2.x despite the performance improvements available.

In other words: I agree with your first line. But moving to C/C++ is a totally different tradeoff than moving to most higher level languages, most of which at the very least makes the performance characteristics much harder to predict.

pkolaczk · on Oct 7, 2013

> you pay only for what you use

This is only half of the equation. The other half is what are the costs of the things you do use. E.g. costs of exceptions, virtual calls, dynamic linking, dynamic memory allocation, concurrency, etc. Java is like an all-inclusive offer - you get more at a higher cost, probably with something you don't need, but often it is cheaper than to buy every service one by one.

pjmlp · on Oct 7, 2013

There are higher level languages like Modula-2 and Ada that offer type safety and once upon a time, had compilers with comparable performance to C, back when C was UNIX only.

Of course, 30 years of industry investment into C compiler backends changed this relation.

Unfortunately, thanks to Sun and Microsoft VM über alles attitude in the past decade, young developers tend to equate higher level languages with VMs and think C and C++ are the only languages with proper native code compilers.

derleth · on Oct 7, 2013

> you pay only for what you use

In assembly and C, but not C++, at least if you use exceptions:

http://yosefk.com/c++fqa/exceptions.html#fqa-17.1

Of course you can pare down C++ further and further, but then you have C with slightly different type-checking semantics and you might as well move to a language where 'new' isn't a reserved word.

swift · on Oct 7, 2013

I agree that the GC is a bad fit but I don't see any reason to believe that the interpreter is a particularly bad thing to write in Java. Can you elaborate on your reasoning?

haberman · on Oct 7, 2013

The usual suspects mostly (guards/bounds checks, excessive boxing/unboxing, hard to do clever things wrt. memory, etc), magnified because an interpreter is in the inner loop of an interpreted VM.

justincormack · on Oct 7, 2013

Interpreters are very hard to optimise automatically. A huge case statement for the whole language in effect. C compilers don't do that well, still often written in assembler.

vidarh · on Oct 7, 2013

But in this case they're talking about writing it in Java, and they have control over the Java compiler. In other words they have plenty of opportunities to ensure it optimises well:

- Focus on ensuring the overall pattern optimises well.

- Or recognising the special case of the JVM interpreter loop

- Or adding a pragma of some sort to trigger special optimisations.

Making it fast when compiled with an arbitrary compiler is another matter. But then even a lot of interpreters written in C often resorts to compiler specific hackery.

justincormack · on Oct 7, 2013

Recognising special cases is fragile. And the special optimisation may as well be "replace with this hand written code"... Compilers, well parts of them, are not really an example of the typical code you are trying to optimise.

pron · on Oct 7, 2013

First, nobody's telling you you're wasting time writing a kernel (or the JVM for that matter) in C++. Obviously, C/C++ has its place (kernel, drivers, VM, resource-constrained embedded systems). But a large Java application is usually more performant than the equivalent C++ code, especially if a lot of concurrency is involved.

Second, where it matters most, Java offers a similar level of access to the hardware (memory fences, CASs, etc.). There are areas where the JVM still needs improvement, though (mostly arrays of structs), but the cases where C++ is preferable to Java in large server-side application software are getting fewer by the day.

Third, there are performance benefits to working at a low level, but reaping them comes at a very high cost. Often you'll find that when working with a large team you need to make concessions on performance for the sake of maintainability. One example is virtual methods. They're good for maintainability and project structure, but bad for performance. Java solves this problem with its JIT. Another problem is memory allocation. Allocation schemes can get very complex in the face of concurrency (and damn hard if using lock-free DS). A C++ developer will usually prefer to tackle this with well defined allocation/deallocation policy and the possible use of mutexes; this harms performance. Java solves this problem with a good GC.

> while giving us stuff like Eclipse

Java on the client is far from optimal (although NetBeans beats most native IDEs I've seen, in terms of looks and performance). But you are completely unfamiliar with the entire Java landscape. Most Western defense systems (including weapon systems; including software that controls an aircraft carriers) are written in Java nowadays. Most banking systems and many high-frequency trading platforms, too.

* Unless one is working for a long time with a small team of experts on optimizing said C++ code.

hermanhermitage · on Oct 7, 2013

This is what happens when an important variable (the specific problem domain of performance comparison) is left as a free variable :-)

waps · on Oct 7, 2013

I think you'll find java's propensity to box everything to effectively mean that beating it is rather easy in pretty much any application that actually uses some memory.

As will java's insistence to do everything in UTF-16.

dman · on Oct 7, 2013

What?

luikore · on Oct 7, 2013

HotSpot is a large, complex application written in C++, it beats many metacircular VMs written in Java.

pjmlp · on Oct 7, 2013

Which might be replaced by Graal after Java 8, a new JIT done in Java.

zeckalpha · on Oct 6, 2013

LuaJIT is heavily researched and optimized.

http://article.gmane.org/gmane.comp.lang.lua.general/58908

lucian1900 · on Oct 6, 2013

Afaict this project isn't actually a JVM, it translates JVM bytecode to Lua and then runs that. It's perfectly reasonable to expect that to be very fast, assuming Lua(JIT) can compete with JVM in speed (which it does).

tinco · on Oct 6, 2013

There's usually a problem with that approach and that is when the virtual machine models don't match perfectly. This leads to any feature mismatches to be implemented in a roundabout way often leading to performance problems. An example of this is the projects that implement ruby virtual machines on the JVM and the CLR. Both VM's actually had features added later to better match Ruby's features.

If LuaJIT can really execute Java bytecode so efficiently that says a lot for the genericity of the compiler, which makes it pretty awesome tech indeed :)

lucian1900 · on Oct 7, 2013

I don't think those two cases are comparable.

Ruby is a fairly large and very generic language. It's possible to override pretty much anything, which makes running it efficiently on a runtime that doesn't is hard.

On the other hand, JVM bytecode is a tiny, concrete language, pretty much just saver assembly. Few operations are generic, and when they are they're generic on the type of something, not much else. Most languages that are already fast could probably run translated JVM bytecode fast.

(And of course LuaJIT is amazing, but for being that fast in the first place)

MrBuddyCasino · on Oct 6, 2013

Well he didn't write a fully jitted VM, this is just sort of translating the JVM opcodes into Lua scripts if I got that right, and then runs those with LuaJit 2.

LuaJIT 2 is known to be fast, but I too was impressed by the numbers, outperforming HotSpot is pretty good, even if its just a tiny microbenchmark.

copx · on Oct 7, 2013

Seems it could be even faster: http://www.freelists.org/post/luajit/ANN-luje-01

seiji · on Oct 6, 2013

It's using LuaJIT which regularly advances the state of the art in JIT'd language performance optimizations.

aaronblohowiak · on Oct 7, 2013

> In any non-trivial benchmark it usually manages about 50% to 75% of Hotspot.

hbbio · on Oct 7, 2013

It's an impressive project that shows the extraordinary work of a single developer, Mike Pall. LuaJIT is not only an incredibly beautiful JIT, the current project also demonstrates it can run generated code efficiently, which is probably the hardest thing to do for a language.

When implementing Opa in OCaml, we run into tons of problem since the OCaml runtime was unable to run generated code efficiently. One of its author, Pierre Weis told us at the time: "it's normal, OCaml is not meant to run generated code at all".

As a side note regarding Luje source code: Fossil must be great and all that but having the code on a publicly available Git repository would probably bolster contributions.

jsnell · on Oct 7, 2013

Luajit is of course awesome, but let's not get too carried away. It doesn't sound like this JVM has been used for running anything beyond extremely simple benchmarks. Of course you're not going to run into difficulties with the translation of a 15 line method.

I saw a talk Mike gave on Luajit last week, and one thing he credited Luajit's success on was that it was working on a rather high abstraction layer. Stuff like the loop optimizations and alias analysis work because the optimizer can essentially understand the code at a fairly high level, and is somewhat specific to Lua. When dealing with java translated to jvm bytecode naively translated to lua source you lose much of that context. (And at the time he didn't seem too convinced about using luajit as the backend for other languages.)

But even if you ignore the loss of context, I'm sure that various luajit limits would get hit with real life Java code. E.g. I'd guess there's a fundamental 65k limit for the number of constants and bytecode instructions in a single function.

sitkack · on Oct 7, 2013

.class files are a fairly simple mechanical translation from java into bytecodes making them easy to decompile. Many optimizations that a compiler would do for static code are not done as the JIT can decide at runtime.

The Java->OpenCL framework https://code.google.com/p/aparapi/ decompiles kernels written in Java class files back into algebra and reprojects them into OpenCL.

The peformance of a LuaJIT based JVM could actually be quite good depending on how much work is done in the decompilation stage.

silentbicycle · on Oct 7, 2013

Did anybody take notes from the Mike Pall's LuaJIT talk? I've been asking around on Twitter, but no luck so far. (My contact info is in my profile.)

justincormack · on Oct 7, 2013

JVM bytecode is quite nasty from this point of view; one of Mike's comments in the mailing list was to try Dalvik instead. It loses a lot of loop structure.

jonasb · on Oct 7, 2013

http://www.freelists.org/post/luajit/ANN-luje-01

beagle3 · on Oct 7, 2013

Is this talk available on the web in any form?

jsnell · on Oct 7, 2013

Sorry, it wasn't recorded.

doublec · on Oct 7, 2013

There's an interesting paper about a similar project that was done using Self [1]. "Design and Implementation of Pep, a Java Just-In-Time Translator written in Self"

It goes through details with benchmarks. The PEP code is available with the Self distribution. They were able to get faster than the Java implementation at the time. Back then Java was interpreted though.

[1] http://bluishcoder.co.nz/self/97-pep.pdf

pjmlp · on Oct 7, 2013

Languages are nor interpreted or compiled. It is all about implementations.

JulianMorrison · on Oct 7, 2013

Isn't this going to be intrinsically restricted to one numeric type? Lua numbers are all the same type (double by default).

justincormack · on Oct 7, 2013

No, LuaJIT has an ffi interface with support for any type, and you can just use them as if they are native types.

derleth · on Oct 7, 2013

Seems a bit like cheating if you do it that way, though.

justincormack · on Oct 7, 2013

Its not really, they are just native types, but defined with a C syntax so they are compatible with C. The code generation knows about them natively and optimises them.

derleth · on Oct 7, 2013

Eh, programs compiled to assembly language aren't limited to machine-word-sized integers and whatever the floating-point hardware (if any) natively supports. I'd be interested to know what its performance on integer code is, though.

swah · on Oct 7, 2013

Is LuaJIT bytecode a good target for toy languages?

taranka231 · on Oct 7, 2013

is there any java/lua interop?