This reminds me of a technique used by another language I read about in the early nineties, used by a language called (I think) 'C@T' (don't bother Googling that: I found nothing relevant, and the results seem to be mostly NSFW).
The technique was instruction interleaving. They supported several very different architectures from one platform-independent binary. I think they were x86, 68k, MIPS and SPARC. Each executable started with a boilerplate header: a bit of black magic which worked on all of the platforms (more on that below). When execution emerged at the end of the header, the instruction pointer was pointed at the entry point appropriate for the architecture it was executing on. Each entry point (one for each architecture) made sure that the instruction pointer would always jump over parts with the other architectures, to the next applicable instruction in the correct architecture.
The boilerplate header worked something like this: the first few bytes, when interpreted on one particular architecture, would jump to the address immediately following the header, which was the starting point for that architecture's thread-of-control. All other architectures interpreted that instruction as something that was effectively a no-op for the purpose of flow control (like maybe added two registers into another register). Consequently, the next instruction would only be seen by one of the remaining architectures, and a similar process would peel off the remaining architectures, one-by-one, directing each to the appropriate real entry point for that architecture. As I recall they did not do something simple like have monolithic blocks of code separated by architecture: if there was a loop, the body of the loop had interleaved instructions for all architectures. I'm sure this was in the days before instruction caches were a thing.
I think they were able to get some space savings with this, at least by sharing constant data segments (maybe). At the time I read it I did not appreciate some of what they did, and now I could but I can't find the article I read nor any papers about it. I probably read it in Computer Language or Dr. Dobbs or something like that, probably in the early nineties. If anyone else remembers this and can point me to a source, I'd love to revisit it. I think it might have been a project out of AT&T labs.
I had the name wrong. The language is C+@ [1]. I dug up an article about it in Dr. Dobbs Journal, the October 1993 issue. This does not seem to be the article I am remembering, since it does not go into the instruction interleaving technique anywhere near as much as I remember, but they do mention it and say it was called "beading":
The binaries produced by the C+@ compiler are independent of the underlying machine architecture. Without recompiling, applications can be moved from SPARC to 68000 to Intel x86, and so on. C+@ is not interpretive--the binaries are encoded using a sophisticated 'beading' technique developed at Bell Labs. Because of the streamlined language design, the C+@ compiler produces these portable binaries with extraordinary speed, without the need for preprocessing or front ends.
This is from the article's introduction:
The C+@ programming language, an object-oriented language derived from AT&T Bell Lab's Calico programming language, was developed to provide programmers with a true object-based language and development environment. C+@ (pronounced "cat") has the syntax of C and the power of Smalltalk. Unlike C++, C+@ includes a mature class library with more than 350 classes used throughout the system. The C+@ compiler itself is written in C+@, and all of the source for the class libraries is included with development systems. The Calico project was started at AT&T Bell Labs in the early '80s, after the introduction of Smalltalk and at the same time as C++. Calico was originally used for rapid prototyping of telecommunication services; hence, its heavy emphasis on keeping the language syntax simple and showcasing the power of the graphical development environment.
> As I recall they did not do something simple like have monolithic blocks of code separated by architecture: if there was a loop, the body of the loop had interleaved instructions for all architectures. I'm sure this was in the days before instruction caches were a thing.
The first part of your post sounded pretty clever and workable, but this part sounds completely insane. Managing all of the side effects for all of those other architectures constantly would have been an never ending nightmare. Keeping separate blocks and only sharing data segments seems like the only way to keep your sanity.
Well, there are never going to be side effects for the other architectures, because those parts simply can't be touched. Code for the i386 only jumps to other i386 code. Anything that is not i386 code might as well be donkey droppings. If we think about how sequences of NOP instructions are routinely added to machine code for cache-aligning the start of a block, or how non-code data such a string literals is weaved into the middle of code, it doesn't seem far-fetched at all.
block
of
x86
code
je L43
more
x86
code
jmp L57
L33:
sparc
code
never
mind
L43:
x86
code
L37:
sparc
never
mind
L57:
x86
code
again
That kind of thing: blocks of code mashed together, but everything is correctly generated to jumps to its own kind.
In regular compiler-generated machine code, we already have "foreign" blocks of stuff in the middle of the instructions, such as string (and other) literals, and computed branch tables. The generated code doesn't accidentally jump into these things. This is kind of the same: all the other architecture stuff is just a literal (that is not referenced). From the x86 POV, the stuff at L33 and L37 above is just data.
No, that's not the way I remember they did it. I think you'd have a run of instructions for one architecture, and the only thing that interrupted a run of them would be a branch or a jump. So for example, if you had something like:
for (i=0; i<10; i++) {
// body of loop
}
Then you'd have all of the x86 instructions for the loop in contiguous bytes, after that block you'd have the MIPS instructions, then the SPARC instructions and so on.
I could be wrong about this, which is why I am now really adamant that I find the article I remember, so I can figure out what they really did. But I think there are three levels of granularity they could have done:
1. They could have had what was essentially a monolithic code block for each architecture, concatenated those blocks and had the little bit of magic at the beginning of the executable figure out which block to dispatch to, then execution would stay in that block. I don't think they did this.
2. At the other extreme, they could interleave individual instructions, like you just described. I don't think they did this either.
3. They could have runs of architecture specific instructions corresponding to basic blocks [1], which is what I tried to describe above. I think that's what they did, but I could easily be misremembering.
As I recall, there were no jumps or branches that there would not have been anyway in a single-architecture binary. Every time there was a jump or branch, it simply had to go further, since it would be skipping over unreachable binary code for the other architectures. The real complexity was all resolved up front, in that blob of magic at the entry point.
I was thinking to myself “why did these guys talk about inventing another Java, like it was a new thing?” then it hit me that the article was written in 1997. I mean Java existed back then but HotSpot JIT didn’t.
Yeah, this part:
"
We have implemented a system taking a wholly
different approach that we call “slim binaries.” The
object files in our system do not really contain native
code for any particular processor at all, but a highly
compact, architecture-neutral intermediate program
representation
"
is basically the definition of Java byte code. Wikipedia search of the author seems to say he came up with it first, though:
"
His doctoral dissertation, entitled "Code Generation On-The-Fly: A Key To Portable Software"[11] proposed to make software portable among different target computer architectures by way of using on-the-fly compilation from a compressed intermediate data structure at load time. Two years later, the Java programming language and system were launched and took this idea mainstream, albeit using the term "just-in-time compilation" instead of the term "on-the-fly compilation" that Franz had used.
"
https://en.wikipedia.org/wiki/Michael_Franz
Except that the slim binaries concept is not based on byte code at all, but more like a Huffman compressed abstract syntax tree. Not only are the binaries smaller, they can potentially be optimised further because no semantic information is lost.
The Self bytecode was pretty close to an abstract syntax tree, though not huffman-compressed. How is it that nobody has mentioned Self yet in this thread? Self was doing JIT compiling from a platform-independent bytecode in the late 80s and getting within a factor of 2 of C performance by the early 90s.
Yeah Michael's dissertation was out in '94 already, and he had native Oberon running in slim binaries (aka jit'ing from compressed AST) for quite a bit before that. Thomas did the Juice stuff (aka slim binaries on the web) around mid '96 IIRC, then did a continuously optimizing JIT for Oberon, then got sucked into binary translation at Transmeta.
There were a lot of fun overlaps. Robert Griesemer did his PhD at ETH around the time Michael did, on a vector extension to Oberon for a Cray. He went on to Animorphic working on Strongtalk, and later the JVM at Sun. IIRC the assembler from Strongtalk days has been extended and used in other projects like Javascript compilers. Animorphic had Urs Hoelzle / Lars Bak / Gilad Bracha too; Urs was undergrad at ETH with Michael IIRC, then worked on Self at Stanford inventing a bunch of the optimizations that went into Strongtalk and later Hotspot for Java. Etc. Etc. Small world :-)
Absolutely bytecode was already an old idea by the time this work took place. (And Wirth, Franz's advisor, wrote compilers and books using pcode).
The novelty was in using trees for the representation, and recompiling the entire Oberon system not just a single app. Later work by his students (Vivek+Chris iirc) looked at using arithmetic encoding to further compress things. Also Michael noticed that it prevented malformed programs, as his interest in security grew.
Another branch of the family tree worked on more compiler optimization friendly IRs, such as typed SSA instead of bytecode, and applied it to Java. The idea being to be able to do more upfront optimization work, and still transmit a safe representation that was cheap to verify.
Microsoft Office used a vaguely similar VM-based technique in the 90s, called P-Code. Cold functions in Office were compild to a space-efficient stack-based bytecode, minimizing how much native code had to be written and validated for each architecture. https://web.archive.org/web/20010222035711/http://msdn.micro...
Not similar at all, if you'd compare what which did. The p-code is practically assembly for a virtual CPU which uses more or less the same resources that assembly code for the "real" CPU would do, only does that with less bytes per instruction.
In the early aughts I was kicking around an idea for minimization of binaries/libraries. The idea was to take a compiled binary plus debug information, and use disassembly to work out which code paths are reachable from a selected set of entry points (say the ones used by your application). Then code paths which are considered "dead" would be zeroed out, preparing a minimized binary.
Speaking about C and C++ on Linux, I have an impression that the compilers seldom can decide that the whole functions aren't used, especially because most libraries and programs "export" "everything" by default, so only on run-time it can be decided what will be actually used -- the last linking step happens only when you start the program, by the loader.
The original comment pointed out "entry points", so I assumed that they were talking about libraries rather than standard dead code elimination (which has been a thing for quite some time).
> In the menu at the start, cursor-down works, but cursor-up doesn’t (he never hit cursor-up in menus during the test run). The small enemies at the start can hit you, but he didn’t get hit by any enemy shots, so in the released version of .kkrieger enemy shots deal no damage.
LLVM IR is relatively agnostic, but not so much so that you can use it to recompile for another architecture (unless you take effort to design an entire ABI to work around the platform-specific things that creep into it, as Apple did with arm64_32).
IIRC, LLVM IR has ABI, word length, and endianness baked in so you could generate IR that's portable to, say, any 32-bit LE processor. Basically that was Chrome PNaCl.
> And since each fat binary usually contains only a single code image per supported architecture, rather than separate versions for different implementations of each architecture, fat binaries still do not solve the problem of intra-architectural hardware variation.
$ lipo -info /System/Library/Frameworks/CoreFoundation.framework/CoreFoundation
Architectures in the fat file: /System/Library/Frameworks/CoreFoundation.framework/CoreFoundation are: x86_64 x86_64h
It works in Chrome, but most viewers will fix minor corruption in PDFs. qpdf/poppler will tell you if they detect any issues, and they both say it's fine.
So I suspect the issue is either your viewer, or the file somehow got corrupted on the path from the website to your viewer.
The technique was instruction interleaving. They supported several very different architectures from one platform-independent binary. I think they were x86, 68k, MIPS and SPARC. Each executable started with a boilerplate header: a bit of black magic which worked on all of the platforms (more on that below). When execution emerged at the end of the header, the instruction pointer was pointed at the entry point appropriate for the architecture it was executing on. Each entry point (one for each architecture) made sure that the instruction pointer would always jump over parts with the other architectures, to the next applicable instruction in the correct architecture.
The boilerplate header worked something like this: the first few bytes, when interpreted on one particular architecture, would jump to the address immediately following the header, which was the starting point for that architecture's thread-of-control. All other architectures interpreted that instruction as something that was effectively a no-op for the purpose of flow control (like maybe added two registers into another register). Consequently, the next instruction would only be seen by one of the remaining architectures, and a similar process would peel off the remaining architectures, one-by-one, directing each to the appropriate real entry point for that architecture. As I recall they did not do something simple like have monolithic blocks of code separated by architecture: if there was a loop, the body of the loop had interleaved instructions for all architectures. I'm sure this was in the days before instruction caches were a thing.
I think they were able to get some space savings with this, at least by sharing constant data segments (maybe). At the time I read it I did not appreciate some of what they did, and now I could but I can't find the article I read nor any papers about it. I probably read it in Computer Language or Dr. Dobbs or something like that, probably in the early nineties. If anyone else remembers this and can point me to a source, I'd love to revisit it. I think it might have been a project out of AT&T labs.