Implementing a Virtual Machine in C

TomNomNom · on June 23, 2015

A good read.

One thing I thought of as a little weird is this:

> But remember, we aren't decoding anything since our instructions are just given to us raw. So that means we only have to worry about fetch, and evaluate!

The 'decode' step is being done by the switch statement, it's not being skipped. I think the author may be confusing turning assembly into machine code (i.e. what an assembler does) with decoding an opcode.

I built a (very simple) VM in Go live on stage at a local mini-conference last year. I really enjoyed it. There's a video of it here: https://www.youtube.com/watch?v=GjGRhIl0xWs

Corun · on June 23, 2015

I'm not sure, but I think the author means that there is no extraction of fields from the instruction, which is one way of defining the decode step.

r4pha · on June 23, 2015

There's something really nice about these minimalistic C projects. You can quickly take a look at the source and immediatelly grasp what's happening. A simple and didactic introduction. Super cool.

tptacek · on June 23, 2015

Yes. Obligatory and tedious re-up for my favorite minimalistic C program of the last 5 years:

https://github.com/rswier/c4

Take 2 hours to grok this and you can skip compiler books and move straight to compiler papers.

Winterflow3r · on June 23, 2015

Challenge accepted.

tptacek · on June 23, 2015

An even better challenge: add structs to the compiler. :)

Winterflow3r · on June 23, 2015

haha okay! I don't have any formal CS training though (and the CPU between the ears isn't too great either) so it's going to take me more than 2 hours. I'm going through void next() right now. Are you open for questions on Github?

tptacek · on June 23, 2015

It's not my code, so, Github is a bad place to ask me questions. But you can ask them here, and I'll respond.

kyrre · on June 23, 2015

https://news.ycombinator.com/item?id=9516656

joe_the_user · on June 23, 2015

A nice simple demonstration.

Other things that can be added easily is Labels, Loop0 - loop while a variable is greater than zero and subX - pop stack and instruction pointer and go to location X.

jheriko · on June 23, 2015

why not use an array of structs as the program, with a function pointer and the parameters stored in the struct? the instruction pointer is then an index into that array... the functions implement the opcodes and the archetecture becomes more instructive, cleaner, faster, smaller and easier to read imo. executing the program can become a loop that just calls whatever function pointer is in the struct at the instruction pointer.

stevekemp · on June 23, 2015

Funny you mention that - I used that structure in my simple virtual machine (which includes a compiler, a decompiler, support for conditionals, and also an example of embedding):

https://github.com/skx/simple.vm

Rather than having an enum of opcodes I have a array of pointers to functions - one for each opcode I implement. It is a logical/clean way of implementing things, but i think it is not necessarily better.

(In my case I lose due to inline strings, and variable length opcodes which make complications.)

asQuirreL · on June 23, 2015

As already suggested, function pointers are a particularly memory inefficient way to represent instructions/opcodes.

But on top of that, there is another issue with this representation: Your interpreter will very likely be dog slow, because you have to make a function call on each iteration, and it cannot be inlined, because it is dynamic.

And I suspect there are implications for the I-Cache that should probably be considered as well.

EDIT By "considered", I mean, measured, and then considered.

pjc50 · on June 23, 2015

Not very memory-efficient and not very much like a real instruction set.

sklogic · on June 23, 2015

And if you want a better performance, you can replace this switch with computed gotos (supported at least by gcc and clang).

simias · on June 23, 2015

You should first check if the compiler doesn't do it for you. Otherwise you're just making your code harder to maintain and less portable for zero gain.

In my experience with similar code you can get GCC and LLVM to generate a proper jump table if you're careful to make your code as regular as possible in order for the optimizer to understand what you're doing.

Case in point: https://github.com/simias/psx-rs/blob/master/src/cpu/mod.rs#...

Here LLVM manages to inline all the functions and generate a proper jump table. IIRC I tried getting the function address in the "match" and calling it afterwards but then the compiler wouldn't optimize it properly anymore.

plq · on June 23, 2015

> You should first check if the compiler doesn't do it for you.

It probably doesn't, because otherwise Intel wouldn't have submitted this patch that adds computed goto support to CPython: http://permalink.gmane.org/gmane.comp.python.devel/153401

xigency · on June 23, 2015

The same performance can easily be achieved by simple compiler optimizations like inlining functions, unrolling loops, peephole optimizations, etc. The fact that your compiler doesn't support the same performance doesn't necessarily mean to re implement things in a confusing way.

Another way of achieving the exact same result would be to use an array of function pointers.

I think the more interesting dilemma is when you have a more complex VM possibly operating on an abstract syntax tree instead of just a bytecode stream, and the ways to optimize code (in C) while maintaining a proper stack.

After figuring out how to make a switch statement work, there are a ton of more pressing organizational concerns like how to make function trampolining work without any overhead.

Another way to make it faster would be to have it interpret * two * op-codes at once, but hey, that wouldn't do anything except save two cycles and explode the code size to a square of what it was.

sklogic · on June 23, 2015

Yes, that python patch is only halfway to a proper indirect threaded dispatch, it still got a lookup table. See the OCaml bytecode VM for a really high performance example of this approach.

> a more complex VM possibly operating on an abstract syntax tree

Don't do it.

thesz · on June 24, 2015

> Don't do it.

http://peterdn.com/files/A_JIT_Translator_for_Oberon.pdf

Oberon did it with relative success. Relative to other VMs at the time.

sklogic · on June 24, 2015

We're not talking about JITs here. Of course an tree-based representation is better for JITs, and if using a flat VM you may have to promote it to a higher level AST (or at least an SSA-based) form anyway.

But for an immediate execution, a flat stream of instructions (or, better, a threaded code) is much more efficient than any kind of tree. Yes, I'm aware of things like SISC, but I'm yet to see a proof that their approach is more efficient than a flat bytecode, even when JVM overhead is taken into consideration.

simias · on June 23, 2015

I don't know the specifics but from the linked thread at https://bugs.python.org/issue4753 :

Having a jump table in itself isn't special (the compiler does exactly that when compiling the switch() statement). What's special is that a dedicated indirect jump instruction at the end of each opcode helps the CPU make a separate prediction for which opcode follows the other one, which is not possible with a switch statement where the jump instruction is shared by all opcodes. I believe that's where most of the speedup comes from.

So I don't think you can conclude that the compiler "probably doesn't" generate the jump table, just that you might do a better job yourself if you know what you're doing.

duaneb · on June 23, 2015

Considering CPython is designed to be so portable, I don't think this means anything about popular compilers.

sklogic · on June 23, 2015

> You should first check if the compiler doesn't do it for you.

No, compiler cannot rewrite your bytecode into a threaded code.

> Otherwise you're just making your code harder to maintain and less portable for zero gain.

No. Threaded code is almost always faster, even if run only once.

> a proper jump table

I'm not talking about jump tables. I'm talking about replacing the bytecode indices with jump addresses once (yes, with a precalculated table) and then doing an indirect jump for each translated opcode from the input stream.

An example: https://github.com/ocaml/ocaml/blob/trunk/byterun/interp.c

(see the Next macro definition)

simias · on June 23, 2015

I don't follow you, what do computed gotos have to do with threaded code?

I'm not sure I understand the code you've linked but "goto (void )(jumptbl_base + *pc++)" looks very much like my big "match" statement except that I match only a subset of the 32 bit instruction.

I don't see why the compiler wouldn't be able to rewrite a bit switch/match statement as a jump table without having to spell it out directly.

> I'm not talking about jump tables. I'm talking about replacing the bytecode indices with jump addresses once (yes, with a precalculated table) and then doing an indirect jump for each translated opcode from the input stream.

Now you've lost me completely.

sklogic · on June 23, 2015

> I don't follow you, what do computed gotos have to do with threaded code?

1. Build a jump table

2. Read a bytecode

3. Replace all the instruction codes in the bytecode stream with values from the jump table (i.e., build a threaded code from your bytecode).

4. In a loop, take the next value at PC and jump to it - it's already an address. No table lookups (just once, when you load your bytecode). So it's much faster than any switch can be.

EDIT: probably you did not notice that for most of the targets jumptbl_base = 0, and an indirection is removed.

simias · on June 23, 2015

Oh I think I get it, thanks (although I still don't understand what "threaded" means in this context).

The only downside I can see is that you'd significantly increase the size of the code. On a 64bit architecture you trade each byte for a 64bit address, effectively multiplying by eight the footprint in the data cache. A 256 entry LUT on the other hand will fit snugly in cache and the lookup shouldn't be very costly.

Also if I understood you correctly what you're proposing doesn't have much to do with the "computed gotos" extension.

sklogic · on June 23, 2015

> I still don't understand what "threaded" means in this context

https://en.wikipedia.org/wiki/Threaded_code#Indirect_threadi...

> The only downside I can see is that you'd significantly increase the size of the code

This is why adding the jumptbl_base on 64-bit platforms. Pointers are still 32-bit, just with an added offset (and a tiny overhead).

> doesn't have much to do with the "computed gotos" extension

You cannot implement indirect threaded code without a computed goto. You can implement a direct threaded code, of course (generating jump or call instructions directly), but this is a totally different thing.

simias · on June 23, 2015

Ah, thank you for the link, that was the missing piece of the puzzle! :)

arcatek · on June 23, 2015

Even without inlining? Since the compiler cannot know what instruction will be called, it cannot inline it in the loop.

sklogic · on June 23, 2015

Why do you want any inlining? It's a threaded code. You know an address of each instruction handler, so you can replace an opcode with this address and eliminate a switch and any table lookups altogether.

arcatek · on June 23, 2015

Eh, I was reading your psx guide a moment ago :) Small world!