I'm curious to see what the performance impact of using 64bit memory ends up being. WASM runtimes currently use a clever trick on 64bit host platforms where they allocate a solid 4GB block of virtual memory up front, and then rely on the MMU to catch out-of-bounds accesses by faulting on unmapped pages, so bounds checking is effectively free. That won't work with 64bit WASM though so runtimes will have to do old fashioned explicit bounds checking on every single heap access.
I also wonder what the perf overhead will be for programs that only need i32. I didn’t dig deep enough on the implementation, but these type of variadic runtime options often cause perf regressions because of all the “ifs” / lookups of type configs/parameters, etc. I just imagine some inner loops where mem.get adds a few instructions from the “if u32/64/“.
Unless it ends up being seamless / no cost via loading different u32/u64 implementations.
I mostly agree with the old c++ mantra of - no feature should have a runtime cost unless used.
I would hope it's set up such that the runtime always knows whether a memory access is 32bit or 64bit at JIT-time so it can generate unconditional native code. Runtimes that use an interpreter instead of a JIT might be hurt by having to branch every time, but interpreters being slow is already kind of a given.
Yes, the bitcode specifies whether a memory operation/address uses 32-bit or 64-bit addressing, so a runtime/JIT can determine this ahead of time during the static analysis phase.
Can't the same trick be used for slightly more bits? 64-bit uses 48 bits nowadays AFAIK, which is 256 TB. It's usually 50/50 for user/kernel mode. If, say, WASM takes half of the user mode address space, it's still 46 bits which is 64 TB (ought to be enough for anybody?). Or maybe I'm way off, I don't really know the specifics of the trick you're referring to.
The trick takes advantage of 32-bit registers automatically being zero-extended to 64.
It actually uses 8GB of allocated address space because WASM's address mode takes both a "pointer" and a 32-bit offset, making the effective index 33 bits. On x86-64, there are address modes that can add those and the base address in a single instruction.
When that trick can't be used, I think the most efficient method would be to clamp the top of the address so that the max would land on a single guard page.
On x86-64 and ARM that would be done with a `cmp`and a conditional move. RISC-V (RVA22 profile and up) has a `max` instruction. That would be typically one additional cycle.
The new proposal is for using a 64-bit pointer and a 64-bit offset, which would create a 65-bit effective index.
So neither method above could be used. I think each address calculation would first have to add and check for overflow, then do the bounds-check, and then add the base address to the linear memory.
> RISC-V (RVA22 profile and up) has a `max` instruction.
You know, it's kind of insane that things like this are nigh impossible to learn from the RISC-V official site. Googling on the site itself doesn't yield anything; you have to go to the "Technical > Specifications" page which has PDFs on the basic ISA from 2019, but not on the newer frozen/ratified extensions, and a link to the "RISC-V Technical Specifications" page on their Attlassian subdomain, where you can find a link to a Google doc "RISC-V profiles" and there you will learn that, indeed Zbb extension is mandatory for RVA22U64 profile (and that, apparently, there is no RVA22U32 profile).
And of course, you have to know that Zbb extension is the one that has max/min/minu/maxu instructions in it; grepping for "min" or "max" won't find you anything. Because why bother listing all the mnemonics of the mandatorily supported function in a document about the sets of instructions with mandatory support? If someone's reading a RISC-V document, they obviously have already read all the other RISC-V docs that predate it, that's the target audience, after all.
> When that trick can't be used, I think the most efficient method would be to clamp the top of the address so that the max would land on a single guard page.
If you are already doing a cmp + cmov, wouldn't you be better off just doing a cmp + jmp (to OOB handler)? The cmp + jmp can fuse, so it's probably strictly better in an execution cost sense, plus it doesn't add to the critical data-dependent chain of the load address, which would otherwise add a couple of cycles to the address data chain.
Of course, it does require you have these landing pads for the jmp.
They won't be mispredicted nor take predictor resources since the default prediction is "not taken" and these branches are never taken (except perhaps once immediately before an OOB crash if that occurs). So they are free in that sense.
A saturating add instruction could help do the same trick without checking for overflows first, although they seem fairly uncommon outside of SIMD instruction sets (aarch64 has UQADD for example)
The problem is you need the virtual memory allocation to span every possible address the untrusted WASM code might try to access, so that any possible OOB operation lands in an unmapped page and triggers a fault. That's only feasible if the WASM address space is a lot smaller than the host address space.
I suppose there might be a compromise where you cap a 64bit WASM instance to (for example) an effective 35bit address space, allocate 32GB of virtual memory and then generate code which masks off the top bits of pointers so OOB operations safely wrap around without having to branch, but I'm not sure if the spec allows that. IIRC illegal operations are required to throw an exception.
32-bit memory addressing means you only have 4GB, not 32GB.
The company I current work for makes radiotherapy software. Most of our native apps run in clinics under .NET.
There are some cases where we want users or employees to be able to access our radiotherapy tooling from a browser. Microsoft has a pretty phenomenal .NET -> WASM compiler. We can compile our .NET DICOM image viewer tooling to WASM, for example, and it is more performant and full-featured than Cornerstone.js (https://www.cornerstonejs.org/).
However, medical imagery is memory heavy. We frequently run into the 4GB limit, especially if the CT/MR image uses 32-bit voxels instead of 16-bit voxels. Or if the field of view was resized. Or if we accidentally introduce an extra image copy in memory.
I agree that wanting to have more than 32bits of addressable space (not 32GB of memory) in a web app seems excessive.
However the real win is the ability to use 64 bit types more easily, if for nothing else other than it simplifies making wasm ports of desktop libraries.
It goes beyond that. Many languages expect that you use types such as size_t or usize for things that are conceptually collection sizes, array offsets, and similar things. In some applications, it's common that the conceptual collection is larger than 2^32 while using relatively little memory. For example, you could have a sparse set of integers or a set of disjoint intervals in a universe of size 2^40. In a 64-bit environment, you can safely mix 64-bit types and idiomatic interfaces using size_t / usize. In a 32-bit environment, most things using those types (including the standard library) become footguns.
I work in bioinformatics. A couple of times a year I check if browsers finally support Memory64 by default. They don't, and I conclude that Wasm is still irrelevant to my work. I no longer remember how long I've been doing that. Cross-platform applications running in a browser would be convenient for many purposes, but the technology is not ready for that.
One could argue that size_t should be 64 bits on wasm32 since it's a hybrid 64/32 bit platform (and there's the ptrdiff types too which then would depend on the pointer size), but I guess sizeof(size_t) != sizeof(void*) breaks too much existing code.
For example, in a project needed to rely heavily on markdown and needed the exact same markdown renderer on both server and client. That alone made us choose node.js on the server side so that we could use the same markdown module.
Today, I'd probably find a rust / c etc markdown renderer and compile it to wasm. Use it on the server and client as it.
This is a silly example but wasm being a universal runtime would allow interfacing things a lot easier.
Ah also, things like cloudflare workers let you run wasm binaries on their servers. You can write in in any language that can target wasm and you have a universal runtime. Neat.
You can embed a C/C++ program into arbitrary places using WASM as a runtime, so if you have any C++ program you want to automate, you can "lift and shift" it into WASM and then wrap it in something like TypeScript. This is surprisingly useful. WASM also removes sources of non-determinism, which may enable you to do things like aggressive caching of programs that would normally be slightly non-deterministic (imagine a program that uses a HashMap internally before dumping its output). I use this to run FPGA synthesis and place-and-route tools portably, on all operating systems, with 100% deterministic output: https://yowasp.org/
memory64 support will be very useful, because many non-trivial designs will use a lot more than 4GiB of RAM.
Some of us are trying to convince the Node team that pointer compression should be on by default. If you need more than 4G per isolate you're probably doing something wrong that you shouldn't be doing in Node. With compression it's not actually 4GB, it's k * 4GB.
Java pointer compression promises up to 32GB of heap with 32 bit pointers, for instance.
If some subset of pointers has a guaranteed alignment of 2^N bytes, then the least significant N bits are always zero, and don't need to be stored explicitly (only restored when dereferencing the pointer)
Look son, the only way we're gonna get anything done is abstracting the abstractions so we can actually get some abstracted code running on the abstracted abstractions. That means we need 128 gallonbytes of abstracted neural silicon to fit all our abstracted abstractions written by our very best abstracted intelligences.
Then why didn't Java do better? Its tagline was write once, run everywhere.
I remember back in the day setting up cross compiling was horrendous though, so I agree, I just don't think it's the only reason. These days all you do is set a flag and rerun "go build", it's stupidly easy, as far as compiling goes.
The other two things that come to mind is that on the web users expect things to look different, so the fact that your cross compiled app looked/behaved like ass on at least one platform unless you basically rewrote the front end to conform to each platforms user interface guidelines (aka write once, rewrite everywhere), meant that websites could look more how the company making the website wanted it to look, and less like how Redmond or Cupertino-bases companies wanted it to look.
The real killer feature though, imo, was upgrading of software. Customer support is a big expense that ends up sinking developer time, and if you got bug reports and you fixed the problem, you'd keep getting bug reports for that issue and the CS team would have to figure out which version the customer is on, buried three menus deep, before getting them to upgrade. The website,
however is basically always running the latest version, so no more wasting everyone's time with an old install on a customer's computer. And they showed up in metrics for management to see.
> Then why didn't Java do better? Its tagline was write once, run everywhere.
Because Sun and then Oracle never though the web would kick off. Sun had the ball of gold in its hands with the HotJava browser. But they thought the web was a fad and abandoned it. They should have continued developing HotJava and hard-pushed for the JVM inside the web-browser as a public standard and then Java would have been the absolute dominant language of Planet Earth today - the rest would be crying in a dark corner.
Another problem was the 2001 dotcom bubble burst. That crash made a lot of senior executives and investors think that the web was merely hype-technology and de-invest in most front-end efforts. Google proved them completely wrong later.
> Since there are many questions about the way the TIOBE index is assembled, a special page is devoted to its definition. Basically the calculation comes down to counting hits for the search query
>
> +"<language> programming"
I don't think the popularity of a programming language could be measured by how many hits it has on search engines. For example, it may well be that 50% of those hits are forum posts from people being frustrated because the language sucks. In addition, the fact that a language is in use in a lot of existing systems says little about when that code were written, and which options were available at that time.
A major factor was that early Java in the browser did not support.jar (really .zip) files. This meant every class (and every inner class) required a separate http request (on the much slower http of the day).
You used to have to put everything in one giant class to work around this.
I don't disagree that it came down to UI framework support. It came down to Qt and Wx and neither was a clear winner. The problem was there was nobody with broader ecosystem incentive to make a toolkit that kicked ass on all platforms. It had direct interest but was selling a toolkit and could not make it free/gratis as selling that was their business model. Maybe Firefox with XUL, but they had a vested interest in promoting web sites.
Is there movement on allowing larger memory usage for JS in browsers too? It's pretty limiting in some cases, say if you want to open a large file locally.
You can open a large file using a File or Blob, which do not store all the data in memory at once. Then you can read slices from it. That even works synchronously, in Workers: