WebAssembly for the Java Geek

kaba0 · on Dec 23, 2022

While it is an informative article, I fail to be convinced that wasm is somehow superior (even for its current role).

Validation is faster due to it being structured, but stackmaps make it fast for class files as well. It is lower level, but is that really a good thing? It is quite trivial to just assign a huge ‘long’ array to some jvm byte code for the same result, and while I will look at the linked GC proposal, because haven’t been kept up-to-date on that, I think having a good GC is the single most important thing, as that is much harder to do than just providing a linear array. Which will in turn decimate the implementations (or at least mark a toy vs prod ready divide).

Also, will that structured nature not make it a less than great compilation target? This means that certain, more niche languages will have to use ugly (and slow) hacks to get implemented. And it being lowish level it is much harder to optimize it well (in that you express the exact semantics).

kodablah · on Dec 23, 2022

For many, language freedom/portability outweigh other concerns and this is where WASM has value. Running an algorithm written in Rust, Go, C, Zig, etc on the JVM with no JNI has value vs rewriting.

> It is lower level, but is that really a good thing? It is quite trivial to just assign a huge ‘long’ array to some jvm byte code for the same result

That's basically what the WASM implementations on the JVM currently do. But it is actually _higher_ level from a dev POV due to language choice (at a not-huge performance expense).

bullen · on Dec 23, 2022

I'm considering building a Java compatible VM that only has static memory allocation to avoid GC.

Why hasn't this been done by anyone yet?

Also confused why this does not have a binary windows release yet: https://github.com/bytecodealliance/wasm-micro-runtime

Edit: Epsilon is not the answer here you can stop mentioning that.

pron · on Dec 23, 2022

> Why hasn't this been done by anyone yet?

That depends what you mean by "this." The Java spec requires you to support `new A()` by returning a fresh object or throwing a VMError (like OutOfMemoryError). If by "this" you mean that allocations would succeed until some fixed amount of memory is exhausted, then, as others have pointed out, this has been done even in OpenJDK. If by "this" you mean that the allocation of some particular set of objects -- say, only those allocated during class initialisation -- would succeed regardless of memory consumption and all others would fail with an OutOfMemoryError, then I guess that it hasn't been done in that particular way because people haven't found it particularly useful as most Java programs would fail, but you can give it a try.

RTSJ, the specification for hard-realtime Java [1][2], actually goes further than that and supports both "static" memory (ImmortalMemory) and arenas (ScopedMemory). So if that's what you mean, then it has been done.

[1]: https://www.rtsj.org/specjavadoc/book_index.html

[2]: https://www.aicas.com/download/rtsj/rtsj_76.pdf

lmm · on Dec 23, 2022

> Why hasn't this been done by anyone yet?

What would be the point? Java programs assume GC (or at least, they assume they can allocate memory and not worry about when it will be freed). If you don't have GC then you're not going to be compatible with extant Java programs, so what's the point in trying to be "Java compatible" at all?

bullen · on Dec 23, 2022

Because 1) you don't want GC pauses 2) you don't want GC code to bloat the VM 3) You don't want memory leaks 4) You don't want people to not know what they are allocating 5) Static allocation is enough to make anything.

6) int arrays do not have cache misses (but you might need to pad them to avoid cache invalidation) 7) parallel atomic multicore works on int arrays out of the box!

sorokod · on Dec 23, 2022

I assume that by "you" you mean "I"

From my perspective I am happy with the tradeoff of hanving gc pauses and not needing to manage memory manually.

kaba0 · on Dec 23, 2022

1) Java has the very best GCs out of anything to the point that I very much question anyone claiming to suffer from GC pauses on most workloads. Do you really have an application which suffers from that or is that just the usual “GC bad” mantra? Your average C program may spend just as much time trying to malloc a given block in its fragmented heap than that.

2) Why exactly? 3) that’s why you have a GC. But if I really want to understand what you mean (I assume you meant non-memory resources not being explicitly closed?), try-with-resources and cleaners solve the problem quite well.

4) Does it really matter if the last 2 decades were spent on optimizing allocation to the point where it is literally a pointer bump and like 3 basic, thread-local instructions? Deallocation can be amortized to be practically zero cost (moving GC), at RAM’s expense. Where it really really matters though, you could always do ByteBuffers, the new MemorySegment’s or just straight sun.misc.Unsafe pointer arithmetics. That string allocation will be escape analyzed and be stack allocated either way.

I don’t even know what do you mean by static allocation. You mean like in embedded, having fix sized arrays and exploding when the user enters a 32+1 letter text? I really don’t miss that. 6) value classes are coming and solving the issue. Though it begs the question, what’s the size of your average list? Also, how come it doesn’t matter for all the linked lists used extensively in C? Also, see point 4. 7) I don’t get what you mean here, do you mean not doing atomic instructions, because java don’t have out-of-thin-air values? There are rare programs that can get away with that, but I think Java has quite a great toolkit for synchronization primitives to build everything (most concurrency books use it for a reason).

bullen · on Dec 23, 2022

Yes, I mean having limits on everything! Every input has a max value. Players, messages count, message length, weapons they can hold etc. You can add more by recompiling, you should reuse them (for size but also to keep data contiguous) even if that is challenging under multicore utilization and you can add MUUUUUCH more than the CPU can handle to compute anyway so this is not an issue.

The first and last bottleneck of computers is and will always be RAM. Both speed, memory size and energy. A 256GB RAM stick uses 80W!!!! Latency is increasing since DDR3 (2007) and we have had caches to accommodate for slow RAM since 386 (1985) (3 always seems to be the last version, HL3 confirmed? >.<):

You need to cache align everything perfectly: 1) all data has to be in an array (or vector which is a managed array but i digress) 2) you need your types to be atomic so multiple cores can write to them at the same time without segmentation fault (int/float). 3) You need your groups/objects/structs to perfectly fill (padded) 64 bytes. Because then multiple cores cannot invalidate each others cache unless they are writing to the same struct.

So SoA vs. AoS never was an argument! AoS where structures are exactly 64 bytes is the only thing all programmers must do for eternity! This is the law of both X86 and ARM.

So an array of float Mat4x4 is perfect and I suspect that is where the 64 bytes came from. But here is another struct just as an example:

  struct Node {
    int mesh, skin;
    Vec3 spot, pace;
    Quat look, spin;
  };

kaba0 · on Dec 23, 2022

But things don’t fit into 64 bits all the time, and then you get tearing. This is observable and now you have to pay for “proper” synchronization. Also, apple’s m1’s reason for speed is pretty much bigger cache, so I don’t think it’s a good choice to go down this road.

Most applications have plenty of objects all around that are rarely used and are perfectly fine with being managed by the GC as is, and a tiny performance critical core where you might have to care a tiny bit about what gets allocated. This segment can be optimized other ways as well, without hurting the maintainability, speed of progress etc of the rest of the codebase.

bullen · on Dec 23, 2022

64 bytes, 512 bits.

Cachelines are 64 bytes on all modern hardware.

They will probably never change this value ever.

Everything fits into 64 bytes if you make the effort.

And if it doesn't you have to use two Arrays of 64 byte Structures and pad the last.

This is non negotiable and I'm completely baffled nobody has mentioned this yet.

I call this law: Ao64bS (did I invent my first law?) :D

kaba0 · on Dec 23, 2022

It may be a language barrier thingy, but then we are talking about different things. Also, that is architecture dependent.

bullen · on Dec 23, 2022

Nope on all modern CPUs (X86 and ARM) this is 64 bytes and has been since a looong time...

astrange · on Dec 23, 2022

Cache lines are 128 bytes on M1.

But since it’s AMP and not SMP, sharing work across cores doesn’t necessarily work how you expect it to.

bullen · on Dec 23, 2022

Can you ask the OS to give you a certain core type?

128 bytes is perfect 2 x 64! So even if the risk of cache invalidation goes up even if two cores are not writing to the exact same structure the alignment still works!

Good job Apple!

fisf · on Dec 23, 2022

There absolutely are modern systems that have e.g. 128 byte cache lines (M1).

pirocks · on Dec 24, 2022

> A 256GB RAM stick uses 80W!!!!

This seems very high to me, since most high capacity server sticks have no heatsink. Have you got a source?

lmm · on Dec 25, 2022

None of that answers my question. Why would you want your VM to be Java compatible if it can't run Java programs? What are you even going to run on this VM, given that the only mainstream application languages without automatic memory management at runtime are C++ and Rust, to the extent that the latter qualifies as mainstream?

bullen · on Dec 25, 2022

Javac is the point. To be able to code for my own engine without seg. faults. And to build something long term that improves slowly over time.

trimbo · on Dec 23, 2022

Why wouldn't you use C++?

bullen · on Dec 23, 2022

1) You cannot deploy C++ across different architectures/OSes without recompile.

2) VMs avoid crashes upon failures that cause a segmentation fault in native opcodes, with a VM you can keep the process from crashing AND get exact information where and how the problem occurred avoiding debug compiles with symbols and reproducing the error on your local computer.

Right now I use this with my C++ code to get somewhere near the feedback I get from Java (but it requires you to compile with debug): http://move.rupy.se/file/stack.txt

The question you really should ask is why are people using C++? Performance is only required in some parts of engines, to have a VM without GC on top should be default by now (50 years after C and 25 years after Java).

wtetzner · on Dec 24, 2022

Any reason not to do something like Rust + Wasm? Seems like it’d be a better fit.

I don’t have anything against Java, but it seems like you lose a lot of the benefits of using when you take away allocation.

bullen · on Dec 26, 2022

C + WASM maybe but then I'm pretty sure the compiler will not be as good as javac to tell me about things that are wrong.

Java + WASM is probably what I'll end up with, don't like the AoT step from a distance we'll see.

I'm looking at all Risc/stack op/byte-code, and it's disheartening in how many ways humans can copy the same thing differently.

C#, RISC-V, Java, WASM, ARM, 6502 ASM, uxn, lox the list goes on and on...

singularity2001 · on Dec 23, 2022

> Java programs assume GC

only users of long running programs assume GC, the program itself doesnt care.

marginalia_nu · on Dec 23, 2022

The programmer most likely assumes short-lived objects are "free" (as they effectively are in Java), and can be allocated in loops and so on without filling up memory and without incurring any real penalty to performance (via TLAB or possibly elision).

grashalm · on Dec 23, 2022

Nobody did that because java is a safe language by design. Manual allocation would make it an unsafe language. For the vast majority of applications correctness wins over performance.

Short lived objects are extremely cheap on the JVM with the right GC. Almost stack allocation cheap. So if you write code that avoids too many long lived objects there is almost no overhead to a GC.

For long lived objects GC based compaction can give you even some performance advantage over manual allocation due to better memory locality. But that heavily depends on the application of course.

In my experience people blame the GC way too early. Most often the application is just poorly written and some small tweaks can fix gc spikes.

bzzzt · on Dec 23, 2022

Because there's no need? There the 'epsilon' no-op GC in OpenJDK: https://openjdk.org/jeps/318

If you keep your object allocation in check you can use this to guarantee no GC pauses.

bullen · on Dec 23, 2022

Yes, but then you are downloading and running executables that contain the GC.

bzzzt · on Dec 23, 2022

The Java "GC" also contains the memory allocator and handling for out of memory conditions, so if you want to use Java there needs to be 'something' there to call.

karussell · on Dec 23, 2022

Isn't this already possible with -XX:+UseEpsilonGC? (disables the GC) And then you manage memory manually in big (primitive) arrays.

bzzzt · on Dec 23, 2022

Beat me to this answer ;) - I presume you can just allocate objects but you have to keep those allocations in check to prevent the JVM from terminating when going out of memory.

Maybe object pooling (which helped performance in old JVM's in the 90s) will make a comeback? ;)

lukeramsden · on Dec 23, 2022

Object pooling, mutable objects, flyweight encoding[0], and being allocation-free on the steady state are all alive and well in latency-sensitive areas like financial trading, plenty of which is written in Java

[0] https://github.com/real-logic/simple-binary-encoding/wiki/De...

bullen · on Dec 23, 2022

I'm going to try and prevent heap allocation in runtime. But since I'm going to use javac it's going to be ugly.

Basically only static atomic arrays (int/float) in classes will be allowed, for cache and parallelism, and AoS up to 64 bytes encouraged to avoid parallel cache invalidation.

And I'm even considering dropping float, and only have integer fixed point... but then I'll need to convert those in shaders as GPUs are hardcoded to float.

garblegarble · on Dec 23, 2022

Not to be too negative here but... why? You'd be creating something syntax-compatible with Java, but where you wouldn't be able to use any existing Java code & unable to use most of the interesting features of Java (even string concatenation is handled by instantiating a new StringBuilder on your behalf by javac). Aren't you just reinventing a worse C at that point?

bullen · on Dec 23, 2022

See my other 2 responses.

On mobile here would reference on PC.

usrusr · on Dec 23, 2022

What would be the point of that things relationship with java though? Is it somehow important that the bytecode run by a tiny vm that is apparently distributed alongside could also run on a proper JVM?

bullen · on Dec 23, 2022

Javac, I don't want to write a compiler.

The bytecode is only meant to run in my VM.

tryfinally · on Dec 23, 2022

Not many people are motivated to take optimization this far in the Java world.

On the .NET side, the language is more suitable because you can define your own value types (coming soon in Java AFAIK), and there is a lot of people using C# for game dev which is a big use case for GC optimization.

Unity has been integrating unmanaged allocators in their engine, which lets developers skip the GC much more easily (without having to manually preallocate and reuse memory, which isn’t anyone’s favorite workflow, and doesn’t save you much compared to a fast native allocator). I’ve also seen a couple of roughly equivalent projects, eg. Smmalloc-CSharp, github.com/alaisi/nalloc.

imhoguy · on Dec 23, 2022

You can check Java Card to see how GC-less low memory programing was sorted out there. https://en.m.wikipedia.org/wiki/Java_Card

jffhn · on Dec 23, 2022

>Why hasn't this been done by anyone yet?

I used to do Java apps which didn't allocate memory in the long run, and so did some banks for high frequency trading, by reusing objects on the spot or using pools for objects with a life cycle. It imposes some programming style/patterns, in particular for APIs design, but it's perfectly doable, so I guess people prefer to just do that. In what context that wouldn't be enough? For proof?

wtetzner · on Dec 23, 2022

At that point, why even use Java?

bullen · on Dec 23, 2022

See my other answers...

kasperni · on Dec 23, 2022

From another comment > Meaning you cannot type new in a method only in the class definition.

How would you handle the ~ 1000 java.x classes that even a minimal HelloWorld programs uses. There are plenty of random allocation.

bullen · on Dec 23, 2022

I'm not writing thus for old code, I'm writing it for new code. Specifically 3D action MMO scripting.

beardyw · on Dec 23, 2022

Not sure how that would play out. Just not running the GC is an unsatisfactory answer. Otherwise what?

bullen · on Dec 23, 2022

Basically you enforce static heap allocation. Meaning you cannot type new in a method only in the class definition. I don't even know if this can be enforced from the bytecode VM... just playing with it in my head.

pron · on Dec 23, 2022

That would violate the Java specification, i.e. it's not Java. On the other hand, what you could do is only allow allocations to succeed in class initialisation (Java's "static") and throw an error if done outside it. Note, however, that even adding an element to a Map outside of initialisers would fail when using OpenJDK's standard library even if both key and value have been preallocated in initialisers, as it may allocate an internal node and/or a new hash array. So you may want to change some of the standard library to make it more useful with your restrictions.

But you may want to ask yourself why you want to do that. OpenJDK's GCs have become really, really good in both throughput and latency, and the main thing you pay in exchange is memory footprint.

ako · on Dec 23, 2022

So the only place you really define what objects get created is in the class containing the main method? Seems fairly limited that you’d have to know upfront what your memory needs are? An app would have to allocate a large block of memory in the main class, and then handle memory management itself from this piece of memory?

bullen · on Dec 23, 2022

Yes, you would have to overallocate and reuse.

This is the way I code C now allready.

ako · on Dec 23, 2022

That would remove a lot of the benefits of using Java, might as well use something like c. Garbage collection has been highly optimized in the jvm, and most would consider it a benefit of a jvm compared to managing memory yourself.

the_only_law · on Dec 23, 2022

I’ve never understood it either. I’ve heard people doing similar things with Java in the low-latency trading area.

I mean Java isn’t the worst tool out there, but even C# outclasses it as a language. The biggest plus for Java is the massive, mature ecosystem which I imagine mostly evaporates when you’re only using a self restrictive subset of the language itself.

kaba0 · on Dec 23, 2022

It doesn’t evaporate, more often than not only a small subset of these trading programs need that strict “no heap allocation” policy. The rest of the program is free to take advantage of the huge ecosystem.

invalidname · on Dec 23, 2022

I have a bit of a different idea along somewhat different lines. Tweeted at you.

marginalia_nu · on Dec 23, 2022

Why do you want to avoid GC?

bullen · on Dec 23, 2022

See above.

sandGorgon · on Dec 23, 2022

Just fyi - there is an open bug to target wasm for Graal java compilation. https://github.com/oracle/graal/issues/3391

to a very large extent, it is open till LLVM upstream is more functionally complete to support GC.

wasm doesnt support GC to the way Java and Graal does. Which is why Graal is seriously impressive piece of tech. so it is not wasm vs java. Because that is really apples verus oranges.

it is just a matter of time before java gets compiled to wasm. that's what makes me puzzled about this article. Maybe its really about rust or golang vs java.

evacchi · on Dec 23, 2022

huh, author here, I didn't expect this to be on HN please don't roast me too much :P

ridruejo · on Dec 23, 2022

It is actually a pretty good article and I learnt quite a bit. Thanks for taking the time to write it. The only two things I would add are that, unlike Java, the industry has rallied unanimously behind it (in the 90s it was Java vs .NET and Applets vs ActiveX, etc.) and the big difference that Wasm comes pre-integrated in the browser, has a focus on security and is a proper W3C standard. Everything else that came before it felt like some after-the-fact bolt-on (Flash, Applets, etc.) and had big security issues and often got out of sync with browser releases

evacchi · on Dec 23, 2022

thanks! some of those consideration I made in the article I linked at the beginning of this post: "A History of WebAssembly" https://evacchi.github.io/posts/2022/11/23/a-history-of-weba...

ridruejo · on Dec 23, 2022

I missed clicking on it, thanks for sharing!

rowls66 · on Dec 23, 2022

There was no .NET in the 90's. It was Java vs. ActiveX vs. Flash. I agree that Java support in browsers was never as good as it could have been. Sun never had direct control over a mainstream browser, and the companies that did each had other technologies that they preferred (Microsoft - ActiveX, Netscape - Javascript).

kaba0 · on Dec 23, 2022

I mean, Applets are just really old, security itself wasn’t considered all that “important” at the time. There is no real difference on a purely runtime level between wasm and the JVM to make one safer than the other. And regarding release strategy, that’s just politics.

usrusr · on Dec 23, 2022

Wasm could start out as a "function runner" box with no outside interface at all besides arguments and return value, with everything UI, storage or networking conveniently out of scope. Applet, in its day, had to be so much more than just the bytecode VM...

kaba0 · on Dec 23, 2022

That were applets (used for a very different web than today). The JVM format/vm itself can also work as a simple “function runner”, hell, without adding actual functionality you pretty much get this automatically from any runtime. Most brainfuck interpreters are trivially safe from deleting your hard drive, they simply don’t have fs access.

stcroixx · on Dec 24, 2022

Security was considered just as important as it is today, in fact probably more because the default attitude towards the internet was not to trust anything. At this time the only relevant browser was IE, which was hostile to everything not MS and they would kneecap anything they could.

kaba0 · on Dec 24, 2022

Come on, there was basically no concern for privacy for a very long time. The whole IT were built upon “no bad actors” assumption. It turned out to be a deeply flawed assumption and we try to fix it, but desktop OSs are still insanely unsafe.

zozbot234 · on Dec 24, 2022

There was enough concern for privacy that browsers were set up to warn for each and every cookie by default, and prompt the user before running any JavaScript on the page.

lifthrasiir · on Dec 23, 2022

Not sure why, but some links in the article seem to be missing, for example one to the draft GC spec for WebAssembly (which should probably link to https://github.com/WebAssembly/gc).

evacchi · on Dec 23, 2022

you are right, I spotted a couple more links being broken. They should be fixed now. Thanks!

nilslice · on Dec 23, 2022

For anyone who want to experiment with WebAssembly in Java, we just shipped a Java Host SDK[0] for Extism[1] which makes it about as easy as possible to call wasm functions from your Java programs.

0: https://extism.org/docs/integrate-into-your-codebase/java-ho...

1: https://github.com/extism/extism

evacchi · on Dec 24, 2022

Author here, I just added a reference to Extism to the post! :)

nilslice · on Dec 24, 2022

hey thank you! appreciate you putting this article together.

torginus · on Dec 23, 2022

Genuine question about GC'd languages on Wasm - how does the GC run?

When running a GC language in its own runtime, usually there is at least 1 GC thread, that runs concurrently with the app code and pauses/preempts the main thread.

Afaik, in WASM, your application only has CPU time when you are inside a WASM function and threading is incredibly hacky and may be disabled due to security reasons related to side-channel attacks (read up on `SharedArrayBuffer` and cross-origin site isolation).

saurik · on Dec 23, 2022

Garbage collection isn't generally something that happens periodically as it isn't needed until you get a bunch of new allocations and classically did not happen on some background thread as that's both extremely difficult to pull off and requires threading support in the first place. The simpler way to think about garbage collection is you establish a heap, set a size, and when you run out of space in that heap you run a garbage collection pass during the memory allocation that failed. If you fail to free enough memory, you either kill the program or request a larger heap.

kaba0 · on Dec 23, 2022

This is the way it is described in the GC Handbook as well.

  New(): 
    ref <- allocate()
    if ref = null
      collect()
      ref <- allocate()
      if ref = null
        error "Out of memory"
    return ref

pulse7 · on Dec 23, 2022

Currently there is no official support for GC in WASM. So currently any compilation of GC'd language into WASM must provide it's own GC which will be part of the WASM binary.

gernb · on Dec 23, 2022

WASM is a virtual processor spec. There's no GC support in X64 assembly nor ARM assembly and there's no GC support in Web Assembly because they are all assembly languages. You get a hunk of memory, you manage it however you want. If you want GC you you implement GC the same way the JVM implements GC on various processors.

pjmlp · on Dec 23, 2022

It runs alongside the application, basically you're back to using RC as GC algorithm or only when memory needs demand a full GC.

titzer · on Dec 23, 2022

Virgil compiles its tracing semi-space collector into the application. It's the same code as the native GC. The major difference is that the call stack is not visible, so the compiler must spill all live references on the stack into a shadow stack in memory. That last part was the trickiest and most inefficient part of porting.

pjmlp · on Dec 23, 2022

Yeah, but it doesn't run in parallel, does it?

titzer · on Dec 24, 2022

No, you need an additional safepoint mechanism to do that.

pjmlp · on Dec 24, 2022

Thanks!

bullen · on Dec 23, 2022

Reference Counting, ok!

stevefan1999 · on Dec 23, 2022

also take a look in .NET counterpart: https://dev.to/azure/exploring-net-webassembly-with-wasi-and...

singularity2001 · on Dec 23, 2022

This seems a bit outdated. There was a recent blog post by intellij Raider that showed that you can just compile to wasm by adding the wasi nuget package, no extra steps required.

stevefan1999 · on Dec 23, 2022

It's JetBrains Rider not Raider...Plus the blog post I posted also talked about the out of the box experience with just installing one Wasi.Sharp package to get WASI WebAssembly support out of the box for most scenarios...except it is still too big to put into Cloudflare workers because it contains the entire Mono runtime...

TeaVMFan · on Dec 24, 2022

+1 for TeaVM. Not only is there the Fermyon friendly fork, but WASM/WASI/debugging is seeing a large number of checkins in recent weeks in the main fork.

However, true "Java Geeks" should try the JavaScript support in TeaVM. It is mature and battle-tested. The resulting code runs great in a browser.

Live TeaVM-based game: https://frequal.com/wordii

Getting started with TeaVM: https://frequal.com/TeaVM/

Performance Comparison: https://frequal.com/java/TeaVmPerformance.html