Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: I spent my vacation writing a modern JVM assembler (github.com/roscopeco)
218 points by noone_youknow on June 2, 2022 | hide | past | favorite | 76 comments



Nice project! At Guardsquare, we have a similar project: https://github.com/Guardsquare/proguard-assembler

One of the things we use it for is testing: we can craft specific bytecode sequences that we want to test; for example, it's useful to test cases we've seen in the wild (e.g. obfuscated code) or to create a test that doesn't rely on a specific compiler/version.

An example of using the assembler from a ProGuard unit test https://github.com/Guardsquare/proguard/blob/master/base/src...


Awesome, that's a very interesting project! Thanks for the link :)


I had a lot of fun doing this, and along the way I learned a lot about writing Gradle and IntelliJ plugins, and I hope others might find this fun too :)


This looks great! I love that you kept the shape of Java classes and method definitions but with the low level signatures.

Over a decade ago, I used to maintain Jasmin and some other class file tooling :)


Thanks :) Small world, I currently help maintain Jasmin (via involvement in Code Shelter).


I implemented a .class disassembler a while back by literally reading the spec. It's surprisingly accessible.


It really is. We've Sun to thank for that I think.


Yeah, I had to write a small bytecode patcher from scratch the other day and was amazed at the quality of the writing in the spec. Took me all but 2h to implement what I wanted, with no real prior knowledge of JVM bytecodes and class file structure.


Yeah we had to create a java compiler and java interpreter in the compiler class I took(of course it was just a small subset of the runtime)

pretty fun project for school but tedious.


Very cool! Reminds me of a little toy (and very unfinished) project I worked on almost a decade ago (jeez, can't believe it's been that long): https://github.com/dvx/jssembly

Even our syntax is somewhat similar, though you're using plugins for cleaner code which is an awesome idea.


Awesome, I'm gonna have to take a good look at your project, it's very interesting! Thanks :)


Bravo! you and I have very different way to decompress during a vacation


Haha yes, I'll admit there are probably better ways to decompress :D


I dunno about OP, but it's not that I hate programming, the stress comes from dealing with the people and sometimes working on draining problems. The more frustrated I get at work, the higher my S.O. reputation gets because I find comfort in helping to answer impactful questions


This is awesome.

I have been toying with writing my own little JVM language (for education purposes) and this could go a long way to making that a bunch easier.

Thanks for sharing!


For the code generation phase, it might be easier to bypass emitting a text-based assembly format and just use the ASM library directly. If you want something easier, I wrote a code generation library specifically for that purpose: https://github.com/cojen/Maker It also includes an example "Mini-C" language compiler.


Something else that could make your code generation for your JVM language easier: ProGuardCORE (https://github.com/Guardsquare/proguard-core). It can be used to read, generate and analyse Java bytecode.

Some examples for code generation: ProGuard where the project originated (https://github.com/Guardsquare/proguard), Brainf*ck compiler (https://github.com/mrjameshamilton/bf), Lox compiler (https://github.com/mrjameshamilton/klox)

Disclaimer: I work at Guardsquare on ProGuardCORE so may be biased :)


What's the main difference compared to the ASM library? When I was still JVM stuff it was really on the right abstraction level for most things. (I was recently doing some .NET IL and gotta say that MS made it even easier to get started with DynamicMethod:s).


ProGuardCORE is very similar to ASM; they have a similar feature set and a similar visitor-based API. ProGuardCORE came about because ProGuard itself never used ASM when the project was started 20 years ago.

We split out ProGuardCORE from ProGuard 2 years ago, to provide as a library the core underlying bytecode manipulation & analysis tools that ProGuard and the Android security solution DexGuard both use.

Much of our recent focus in ProGuardCORE has been on the code analysis components: we've recently added to ProGuardCORE some of the analyses that power our AppSweep mobile security scanning service (https://appsweep.guardsquare.com/). For example, taint analysis https://guardsquare.github.io/proguard-core/taintcpa.html


Awesome! I started this for educational purposes too, but if it turns out to be useful then I'll be very happy! Hope it turns out that way (and let me know if there's anything else you need to make it useful! :) )


If you haven't heard of it already check out Truffle/graalvm


I have but Truffle was a bit further than I wanted to go for now. If I wanted to build a production ready language it would likely be my first choice though.


I used to use jasmin to generate invalid code to test an old JVM. I assume this has the same relaxed attitude that will let you generate invalid code?


It sure will, as an assembler it only does the bare minimum of validation, so you should totally be able to generate "dodgy" code and see if it works on older JVMs :)


My question (if I may), When people say the bytecode is "platform independent" I ask, exactly what platform is it being freed from?

The compiler? The "cloud"?

And this goes back to my lack of understanding of what "publishing" a package means to Oracle and the ecosystem in general.

If I publish something on Maven, my guess is that it is transformed into bytecode directly, so that when I pull my package from a diff device the code can run. In this case if the entire "internet" collapses then the only way to run my code would be a flashdrive with the .java file run via a compiler... unless there is a machine that has the JVM installed WITHOUT the compiler... THEN the bytecode is useful... Am I correct?


It is independent of the processor architecture, and to a large extent the operating system. (As opposed to an ordinary binary executable, which depends on both.)

The compiler turns Java source into bytecode. The JVM runs the bytecode. The JVM doesn’t need the compiler to run.


Thanks for your response.

What I get from this is that the processor arch. Is dictated by the compiler at compile time, and by writing to bytecode directly we are skipping this.

I thought the proc. arch. was established by the manufacturer and was unavoidably implemented by the jni at runtime... unless there is an interface on top of the jn-interface to deal with the proc. But it seems absurd... why do this at compile time?


The JVM is a native-code application (i.e., running on the processor directly) that executes the bytecode by interpreting it (starts fast, but runs slowly) and/or converting it into native code and running that code on the processor directly (delayed by the conversion step, but then runs fast). The latter step, translating bytecode to native code, is known as the just-in-time (JIT) compiler.

So somewhat confusingly, Java typically involves two compilers: one from source code to bytecode, the other from bytecode to native code. By nature, the first is processor-independent, the second is processor-dependent. The second happens at runtime on the target machine — thus, “write once, run anywhere”.


It compiles code to an "idealized" instruction set. That instruction set is then mapped to a real instruction set by the jvm at runtime, first in some cases through interpretation, but mostly by recompiling it (potentially better and better as more statistics come in) by the jvm.

A big advantage to this approach has (over the instruction set independence), is the JVM knows exactly what the characteristics of the environment are, cache size, memory size, processor extensions, etc, etc, and can decide at runtime how to compile best. This is in contrast to like a C compiler that will give users compile switches to pick a target environment (which is only sometimes correct).


Compiling to a sort of intermediary representation has distinct advantages over either distributing pure source or pure machine code.

Pure source-based interpretation, where you build software to look directly at human-written text and run a program based on those instructions (eg, BAISC, Python, Lua), are inherently slowed by necessitating that the program be, well, parsed and validated on-the-fly. A lot of the processing time is eaten up by virtue of simply understanding each line of textual source code and making decisions about what code to run based on that. However, the advantage is that code written in one of these languages can be dropped onto another machine running an entirely different operating system or processor architecture and run just fine, since it's the language interpreter which is tasked with interpreting the human-readable instructions, therefore only that interpreter needs to be ported across platforms.

Pure compiled code, eg C/C++, Rust, etc have the advantage of being very fast. When your final product is pure machine code, there is no translation necessary. You just feed your instructions straight to the CPU, and the CPU does its job. However, the tradeoff here is that you need to produce many different sets of instructions, one for each operating environment + processor architecture combination you wish to support. Essentially, the tradeoff of compilation versus interpretation is whether you can justify the up-front cost of maintaining platform-dependent builds of your software, or whether you can justify the downstream cost of slower operating performance.

But there is a middle-ground. A sort of combined approach between compilation and interpretation, which gains us some of the advantages of both, but also pulls in some of the drawbacks. In this "hybrid" approach, software is compiled to an intermediate representation, in this case Java bytecode. You can think of this bytecode as machine code for some abstract machine, the "Java Virtual Machine." So the target processor architecture for this compilation step is therefore the JVM, and our final shippable product is that compiled artifact. We then ship our compiled bytecode, and require end users to have a copy of the JVM on their systems to then execute our bytecode. This JVM is an interpreter of Java bytecode; it reads this bytecode, and makes decisions about which pathways in its code to execute for each given JVM instruction. This Java bytecode interpreter is therefore the platform-dependent component that needs to be maintained and ported across to different processor architectures/operating environments, and yes, you need a copy of this to run any compiled Java artifacts. However, being that Java bytecode is a more rigid format, optimized for simplicity and machine-readability rather than human-readability and expressiveness, the interpreter for Java bytecode is much, much simpler to develop and maintain, and the simplicity of each instruction combined with earlier decisions made by the compilation step allow us to apply optimizations to the bytecode while we're interpreting it, much like a real CPU would do. The advantages of this hybrid approach are thus that we maintain platform independence of interpreted code, but at a reduced ongoing cost as our interpreter component is much simpler; additionally, because of the optimizations we can apply both during the initial compilation step and during interpretation time, we are able to gain quite a fair bit of speed back when compared to standard interpretation. The drawbacks to this hybrid approach, however, are that you do need that external interpreter component in order to execute the final product, and that interpreter still needs to be maintained for each platform you want to run your code on; additionally, because we still use a compiler step along the line, there is still an upfront time cost of translating our human-readable code into bytecode.

This hybrid approach is actually used by some compiled languages. LLVM-based compiler tools, for instance, compile source code into an intermediate representation known as "LLVM IR." However, this intermediate code is not shipped to consumers; rather, it's simply used to apply further optimizations to the code now that it's in a simpler format.

So, longwinded explanations aside... yes. You compile Java code to bytecode, and ship that bytecode rather than either your raw source code or your completely compiled code. That bytecode should be runnable on any platform you're able to get it onto, provided you have a functioning JVM built for that target system. Should the "internet collapse," you will need some way to source a Java Runtime Environment in order to run any compiled Java programs you have, and you will need a Java Development Kit in order to compile and run any Java source code.


Thanks for your explanation, as is often the case for me when I encounter this level of insight, I need to take a break, relax... and digest everything said word for word, this involves opening some tabs and a lot of reading, but these answers are always appreciated, thanks.

BTW you made me lol when you followed my ridiculous "internet collapse" scenario.

I'll be definitely coming back to this answer.


Actually LLVM IR is shipped to consumers on the Apple ecosystem mainly on watchOS, the difference being that it isn't the same out of the box LLVM IR used by the LLVM project, rather a massaged version that Apple uses as distribution format.


These are the type of posts that make some of us noobs feel like imposters, lol. (been coding webstack-ish for 10 years, fwiw, can't even begin to write an assembler though :shrug:)

Good on you though, we need inspiration too! And you definitely should feel great about pulling that off! I am imaging it must have been some really fantastic "flow" time! How did you feel while writing it? Did you have family or friend distractions? Did you feel super focused? What is that story?

What are some of your focus patterns that you have identified? What gets you motivated when you feel stuck?


> can't even begin to write an assembler

An assembler is a 1-1 mapping. If you can read some text, and you can write some bytes, then you can write an assembler. Everything in between is just a case of reading the manual! It's just some ifs - if the user's written this instruction name, then emit these bytes.

I work on compilers and I'm awe of people who work on web apps because they genuinely seem far more complicated than what I do.


I mean, there is usually a pass to resolve labels, right? So it's not quite that simple. But not far from it.


That depends on how fancy you want to be, you can definitely implement an assembler with relative branches and no labels.


familiarity is not the same thing as simplicity


No I think we can look at simplicity/complexity absolutely to some degree.

A compiler is a nice pure function - one array of bytes to another array of bytes. You need to use some algorithms and data structures in there to do that, but it's fundamentally simple.

A web app is fundamentally complex - asynchronicity, interacting with other services, network connections, all kinds of chaos.


Thanks :)

I definitely got into the flow for the first couple of days when I was laying down the foundations, it was the weekend and there weren't many distractions. If it wasn't for that I'd likely not have gotten as far as I did during the rest of the time (during the week, when there were other things going on and I wanted to spend some time away from the computer).

I also found that having set myself a goal for the week (support all the instructions) along with a couple of stretch goals (gradle plugin, IntelliJ plugin - I only hit one of these within the time limit though) helped to keep me motivated, and kept me at it for at least an hour every day, even when there were other things going on.


If you had the skill and wherewithal to do something like this, would you want to spend vacation doing it? I think it's neat, but I'd probably want to get the hell away from my screen


I enjoyed reading about doing some things with the jvm which are not possible through the java language - can anyone recommend a page that goes further into that concept?


I'm not sure how many of them expose features that are literally not possible in Java per-se, but there are approximately a gazillion different languages out there that emit Java bytecode. Reading up on some of them might be enlightening.

https://en.wikipedia.org/wiki/List_of_JVM_languages



Looks like a fun project. One thing i'd be interested in, is if it's slower than using javac, given that the JIT can optimize common idioms that the compiler spits out. For instance, for some JITs, in the past anyway, This code

int myNumber = getANumber(); String s = "" + myNumber;

was faster to run than

int myNumber = getANumber(); String s= String.valueOf(myNumber);

even though the bytecode for the second bit of code is a proper subset of the first. (Meaning the first section had all the byte code of the second, plus some extra stuff.)

And yet the first ran faster than the second. So unless you've got some spectacularly tricky algorithm that you can naturally express in java, and can in jasm, i'm betting the javac output will be as fast or faster.

Again tho, i'm not criticizing the author for doing this project at all. Kudos!


The purpose of this thing isn't really performance but fun and exploration, if I'm reading the author's comments right. For examples like yours though, between the static analysis done by the compiler, whatever the JIT does and intrinsics, you can't readily predict what's going to run faster by looking at the Java code or the bytecode.

These days, a much broader range of things that are equivalent-looking end up performing the same once the JIT gets involved, including stuff that seems to be naively doing allocations in inner loops. There's a whole genre of SO questions and answers that boil down to throwing a micro-example like yours into JMH and finding out they have identical performance profiles.


Oh yeah, totally. As another reply mentions, this isn't about performance - by this point Javac is most likely doing the best thing possible (especially given the knowledge of hotspot and intrinsics that are built in to it). I kind of hint at this in the readme :)

That said, this doesn't _have_ to be any slower than Javac - in fact I recommend looking at the disassembly of classes compiled with Javac to see how things work under the hood :)


Groovy version of this but based off Jasmine: https://github.com/renatoathaydes/Grasmin


Brilliant, much better than having to use heavy and clunky tools like recaf or jbytemod, or one-off java projects with ow2 ASM to create custom classes.


Why not Jasmin?

Congrats on publishing your application. If this was something more than self-education please sell the benefits over Jasmin in your README? Looking forward to having a play with it.


That's a great suggestion, I'll get the README updated with that. Thanks!


That syntax seems to be implying that the next step is to integrate this into the compiler, so you can write inline Asm in Java.


That's either the best or the worst idea I've ever heard. I can't decide which... :D


"Let's just get this out of the way, shall we?

{code here} "

Thanks for doing that; I can't stand it when I go to a site for something new and can't find what it actually looks like without digging around.


I couldn't agree more, I'm a big fan of "show me the code" myself.


"With my recent scaling-back of the rosco_m68k project that’s been getting all my free time for the past couple of years, I needed a fun project to do during my time off, and I ideally wanted to take a break from Motorola 68k assembly and electronics and do something different."

All of your free time is consumed in programming?

Isn't it a bit absurd to have your entire life revolve around programming?


I don't think so, if that's your passion and you have no other hobbies. Why is programming for fun worse than birdwatching or building model trains?

My joke is always: "Wait, you guys are getting paid for this?"


Yeah but birdwatching at least implies that the activity is varied.

What if someone was laying bricks for 9 hours and then came home ate dinner and then laid bricks for 3 more hours and did this for years and then took a vacation and laid bricks during the vacation as well?

The line is murky between passionate, dedicated, obsessed and insane.

I mean, do we understand that we get a single shot at life. This year we'll be as young as we'll ever be. Why do we behave as if though we have endless days that we can just toss away and be done with.

Programming full-time and then also during free time and vacations is simply put not healthy no matter how much Elon Musk and the west glamourise it. It simply leaves no time whatsoever for taking care of health, spending time with family, travelling, photography, reading novels, meeting romantic partners etc.


> What if someone was laying bricks for 9 hours and then came home ate dinner and then laid bricks for 3 more hours and did this for years and then took a vacation and laid bricks during the vacation as well?

Well, then I would surmise that they must really enjoy laying bricks. And if that's the case, then more power to them :)

> I mean, do we understand that we get a single shot at life.

I'm pretty sure we do. And some of us are lucky enough that we get to choose how we spend it :)


Not nearly as absurd as thinking it’s okay to criticize strangers’ hobbies.


Good method of helping to manually obfuscate some calls, so that reverse engineering can be harder.


    ...
    checkcast java/lang/String
    areturn
Looks like a typo.


no, areturn is a jvm opcode.


ah. TIL.


areturn is return an object, as opposed to ireturn (return a primitive int, dreturn, lreturn, etc, etc.)


Can anyone recommend a good source (book, articles) to learn Java Bytecode?



There's a very old but still pretty effective book called Programming for the Java Virtual Machine. Byte code is remarkably stable.

(Disclaimer: author)



There's a "cookbook" document in the repo, but it's not very comprehensive - I'd be more than happy to take PRs :) https://github.com/roscopeco/jasm/blob/develop/docs/cookbook...


amazing!

I will definitely discuss this in my compilers class, which targets JVM bytecode as a back end!


Awesome, thanks! Would love to hear if it turns out to be interesting (or maybe even useful!)


public static main([java/lang/String)V {

is this open square brackets a typo?


Nope, that’s the syntax for a one-dimensional array of strings. I wanted to keep the syntax as close to JVM internal type names as possible.


kotlin is modern now? I would not dare to use jasm in my hypothetical new language targeting the JVM. I'd rather write this anew in C++ or C.


Impressive


Thanks :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: