Slim Binaries (1997)

EdwardCoffin · on July 31, 2019

This reminds me of a technique used by another language I read about in the early nineties, used by a language called (I think) 'C@T' (don't bother Googling that: I found nothing relevant, and the results seem to be mostly NSFW).

The technique was instruction interleaving. They supported several very different architectures from one platform-independent binary. I think they were x86, 68k, MIPS and SPARC. Each executable started with a boilerplate header: a bit of black magic which worked on all of the platforms (more on that below). When execution emerged at the end of the header, the instruction pointer was pointed at the entry point appropriate for the architecture it was executing on. Each entry point (one for each architecture) made sure that the instruction pointer would always jump over parts with the other architectures, to the next applicable instruction in the correct architecture.

The boilerplate header worked something like this: the first few bytes, when interpreted on one particular architecture, would jump to the address immediately following the header, which was the starting point for that architecture's thread-of-control. All other architectures interpreted that instruction as something that was effectively a no-op for the purpose of flow control (like maybe added two registers into another register). Consequently, the next instruction would only be seen by one of the remaining architectures, and a similar process would peel off the remaining architectures, one-by-one, directing each to the appropriate real entry point for that architecture. As I recall they did not do something simple like have monolithic blocks of code separated by architecture: if there was a loop, the body of the loop had interleaved instructions for all architectures. I'm sure this was in the days before instruction caches were a thing.

I think they were able to get some space savings with this, at least by sharing constant data segments (maybe). At the time I read it I did not appreciate some of what they did, and now I could but I can't find the article I read nor any papers about it. I probably read it in Computer Language or Dr. Dobbs or something like that, probably in the early nineties. If anyone else remembers this and can point me to a source, I'd love to revisit it. I think it might have been a project out of AT&T labs.

EdwardCoffin · on Aug 1, 2019

I had the name wrong. The language is C+@ [1]. I dug up an article about it in Dr. Dobbs Journal, the October 1993 issue. This does not seem to be the article I am remembering, since it does not go into the instruction interleaving technique anywhere near as much as I remember, but they do mention it and say it was called "beading":

The binaries produced by the C+@ compiler are independent of the underlying machine architecture. Without recompiling, applications can be moved from SPARC to 68000 to Intel x86, and so on. C+@ is not interpretive--the binaries are encoded using a sophisticated 'beading' technique developed at Bell Labs. Because of the streamlined language design, the C+@ compiler produces these portable binaries with extraordinary speed, without the need for preprocessing or front ends.

This is from the article's introduction:

The C+@ programming language, an object-oriented language derived from AT&T Bell Lab's Calico programming language, was developed to provide programmers with a true object-based language and development environment. C+@ (pronounced "cat") has the syntax of C and the power of Smalltalk. Unlike C++, C+@ includes a mature class library with more than 350 classes used throughout the system. The C+@ compiler itself is written in C+@, and all of the source for the class libraries is included with development systems. The Calico project was started at AT&T Bell Labs in the early '80s, after the introduction of Smalltalk and at the same time as C++. Calico was originally used for rapid prototyping of telecommunication services; hence, its heavy emphasis on keeping the language syntax simple and showcasing the power of the graphical development environment.

[1] https://encyclopedia2.thefreedictionary.com/C%2b%40

jandrese · on July 31, 2019

> As I recall they did not do something simple like have monolithic blocks of code separated by architecture: if there was a loop, the body of the loop had interleaved instructions for all architectures. I'm sure this was in the days before instruction caches were a thing.

The first part of your post sounded pretty clever and workable, but this part sounds completely insane. Managing all of the side effects for all of those other architectures constantly would have been an never ending nightmare. Keeping separate blocks and only sharing data segments seems like the only way to keep your sanity.

kazinator · on July 31, 2019

Well, there are never going to be side effects for the other architectures, because those parts simply can't be touched. Code for the i386 only jumps to other i386 code. Anything that is not i386 code might as well be donkey droppings. If we think about how sequences of NOP instructions are routinely added to machine code for cache-aligning the start of a block, or how non-code data such a string literals is weaved into the middle of code, it doesn't seem far-fetched at all.

jandrese · on Aug 1, 2019

That makes even less sense. Was the code flow like:

  x86_instruction
  jump +6
  mips instruction
  jump +6
  sparc instruction
  jump +6
  68k instruction
  jump +6

Not separating the code out by blocks seems unworkable to me.

kazinator · on Aug 1, 2019

Sounds like it was like:

     block
     of
     x86
     code
     je L43
     more
     x86
     code
     jmp L57

   L33:
     sparc
     code
     never
     mind

   L43:
     x86
     code
 
   L37:
     sparc
     never
     mind

   L57:
     x86
     code
     again

That kind of thing: blocks of code mashed together, but everything is correctly generated to jumps to its own kind.

In regular compiler-generated machine code, we already have "foreign" blocks of stuff in the middle of the instructions, such as string (and other) literals, and computed branch tables. The generated code doesn't accidentally jump into these things. This is kind of the same: all the other architecture stuff is just a literal (that is not referenced). From the x86 POV, the stuff at L33 and L37 above is just data.

EdwardCoffin · on Aug 1, 2019

No, that's not the way I remember they did it. I think you'd have a run of instructions for one architecture, and the only thing that interrupted a run of them would be a branch or a jump. So for example, if you had something like:

    for (i=0; i<10; i++) {
      // body of loop
    }

Then you'd have all of the x86 instructions for the loop in contiguous bytes, after that block you'd have the MIPS instructions, then the SPARC instructions and so on.

I could be wrong about this, which is why I am now really adamant that I find the article I remember, so I can figure out what they really did. But I think there are three levels of granularity they could have done:

1. They could have had what was essentially a monolithic code block for each architecture, concatenated those blocks and had the little bit of magic at the beginning of the executable figure out which block to dispatch to, then execution would stay in that block. I don't think they did this. 2. At the other extreme, they could interleave individual instructions, like you just described. I don't think they did this either. 3. They could have runs of architecture specific instructions corresponding to basic blocks [1], which is what I tried to describe above. I think that's what they did, but I could easily be misremembering.

[1] https://en.wikipedia.org/wiki/Basic_block

EdwardCoffin · on July 31, 2019

As I recall, there were no jumps or branches that there would not have been anyway in a single-architecture binary. Every time there was a jump or branch, it simply had to go further, since it would be skipping over unreachable binary code for the other architectures. The real complexity was all resolved up front, in that blob of magic at the entry point.

kazinator · on July 31, 2019

> I'm sure this was in the days before instruction caches were a thing.

This would have to be before paged virtual memory was a thing, too.

sedatk · on July 31, 2019

I was thinking to myself “why did these guys talk about inventing another Java, like it was a new thing?” then it hit me that the article was written in 1997. I mean Java existed back then but HotSpot JIT didn’t.

lnanek2 · on July 31, 2019

Yeah, this part: " We have implemented a system taking a wholly different approach that we call “slim binaries.” The object files in our system do not really contain native code for any particular processor at all, but a highly compact, architecture-neutral intermediate program representation "

is basically the definition of Java byte code. Wikipedia search of the author seems to say he came up with it first, though: " His doctoral dissertation, entitled "Code Generation On-The-Fly: A Key To Portable Software"[11] proposed to make software portable among different target computer architectures by way of using on-the-fly compilation from a compressed intermediate data structure at load time. Two years later, the Java programming language and system were launched and took this idea mainstream, albeit using the term "just-in-time compilation" instead of the term "on-the-fly compilation" that Franz had used. " https://en.wikipedia.org/wiki/Michael_Franz

eterps · on July 31, 2019

> is basically the definition of Java byte code

Except that the slim binaries concept is not based on byte code at all, but more like a Huffman compressed abstract syntax tree. Not only are the binaries smaller, they can potentially be optimised further because no semantic information is lost.

See also this blog post for more details: https://hokstad.com/semantic-dictionary-encoding

Original website on Wayback machine: https://web.archive.org/web/20020617132121/http://caesar.ics...

kragen · on Aug 1, 2019

The Self bytecode was pretty close to an abstract syntax tree, though not huffman-compressed. How is it that nobody has mentioned Self yet in this thread? Self was doing JIT compiling from a platform-independent bytecode in the late 80s and getting within a factor of 2 of C performance by the early 90s.

nialldalton · on July 31, 2019

Yeah Michael's dissertation was out in '94 already, and he had native Oberon running in slim binaries (aka jit'ing from compressed AST) for quite a bit before that. Thomas did the Juice stuff (aka slim binaries on the web) around mid '96 IIRC, then did a continuously optimizing JIT for Oberon, then got sucked into binary translation at Transmeta.

There were a lot of fun overlaps. Robert Griesemer did his PhD at ETH around the time Michael did, on a vector extension to Oberon for a Cray. He went on to Animorphic working on Strongtalk, and later the JVM at Sun. IIRC the assembler from Strongtalk days has been extended and used in other projects like Javascript compilers. Animorphic had Urs Hoelzle / Lars Bak / Gilad Bracha too; Urs was undergrad at ETH with Michael IIRC, then worked on Self at Stanford inventing a bunch of the optimizations that went into Strongtalk and later Hotspot for Java. Etc. Etc. Small world :-)

SilasX · on July 31, 2019

According to the lectures in the Nand2Tetris course, the concept dates to p-code from the 60s and 70s:

https://en.wikipedia.org/wiki/P-code_machine

nialldalton · on July 31, 2019

Absolutely bytecode was already an old idea by the time this work took place. (And Wirth, Franz's advisor, wrote compilers and books using pcode).

The novelty was in using trees for the representation, and recompiling the entire Oberon system not just a single app. Later work by his students (Vivek+Chris iirc) looked at using arithmetic encoding to further compress things. Also Michael noticed that it prevented malformed programs, as his interest in security grew.

Another branch of the family tree worked on more compiler optimization friendly IRs, such as typed SSA instead of bytecode, and applied it to Java. The idea being to be able to do more upfront optimization work, and still transmit a safe representation that was cheap to verify.

Scaevolus · on July 31, 2019

Microsoft Office used a vaguely similar VM-based technique in the 90s, called P-Code. Cold functions in Office were compild to a space-efficient stack-based bytecode, minimizing how much native code had to be written and validated for each architecture. https://web.archive.org/web/20010222035711/http://msdn.micro...

acqq · on July 31, 2019

Not similar at all, if you'd compare what which did. The p-code is practically assembly for a virtual CPU which uses more or less the same resources that assembly code for the "real" CPU would do, only does that with less bytes per instruction.

jacinabox · on July 31, 2019

In the early aughts I was kicking around an idea for minimization of binaries/libraries. The idea was to take a compiled binary plus debug information, and use disassembly to work out which code paths are reachable from a selected set of entry points (say the ones used by your application). Then code paths which are considered "dead" would be zeroed out, preparing a minimized binary.

thefreeman · on July 31, 2019

Don't most compilers already do this?

acqq · on July 31, 2019

Speaking about C and C++ on Linux, I have an impression that the compilers seldom can decide that the whole functions aren't used, especially because most libraries and programs "export" "everything" by default, so only on run-time it can be decided what will be actually used -- the last linking step happens only when you start the program, by the loader.

saagarjha · on July 31, 2019

Yeah, isn't this just link-time optimization?

throwaway542134 · on July 31, 2019

It's called dead code elimination/optimization and compilers can do it without LTO, with LTO you can can get more savings but ymmv.

saagarjha · on July 31, 2019

The original comment pointed out "entry points", so I assumed that they were talking about libraries rather than standard dead code elimination (which has been a thing for quite some time).

ronsor · on July 31, 2019

solution: run the application and find out what functions are/are not called.

mrec · on July 31, 2019

That only tells you what functions got called on that run, for that run's sequence of inputs.

https://fgiesen.wordpress.com/2012/04/08/metaprogramming-for...

Taster:

> In the menu at the start, cursor-down works, but cursor-up doesn’t (he never hit cursor-up in menus during the test run). The small enemies at the start can hit you, but he didn’t get hit by any enemy shots, so in the released version of .kkrieger enemy shots deal no damage.

meithecatte · on July 31, 2019

Please add [pdf] to the title next time.

> If you submit a link to a video or pdf, please warn us by appending [video] or [pdf] to the title.

https://news.ycombinator.com/newsguidelines.html

rbanffy · on July 31, 2019

How architecture neutral is llvm's intermediate representation?

It's interesting how not having the source code drove technology in the 90's.

saagarjha · on July 31, 2019

LLVM IR is relatively agnostic, but not so much so that you can use it to recompile for another architecture (unless you take effort to design an entire ABI to work around the platform-specific things that creep into it, as Apple did with arm64_32).

wmf · on July 31, 2019

IIRC, LLVM IR has ABI, word length, and endianness baked in so you could generate IR that's portable to, say, any 32-bit LE processor. Basically that was Chrome PNaCl.

saagarjha · on July 31, 2019

> And since each fat binary usually contains only a single code image per supported architecture, rather than separate versions for different implementations of each architecture, fat binaries still do not solve the problem of intra-architectural hardware variation.

  $ lipo -info /System/Library/Frameworks/CoreFoundation.framework/CoreFoundation
  Architectures in the fat file: /System/Library/Frameworks/CoreFoundation.framework/CoreFoundation are: x86_64 x86_64h

juliangoldsmith · on July 31, 2019

Haswell wasn't quite out yet when the article was written, so that was true at the time.

saagarjha · on July 31, 2019

Yeah, I know. Just pointing out that this is no longer true.

dleslie · on July 31, 2019

PDF is invalid and won't load.

dunham · on July 31, 2019

It works in Chrome, but most viewers will fix minor corruption in PDFs. qpdf/poppler will tell you if they detect any issues, and they both say it's fine.

So I suspect the issue is either your viewer, or the file somehow got corrupted on the path from the website to your viewer.

aloknnikhil · on July 31, 2019

Works fine on Safari for me.

dleslie · on July 31, 2019

Failed on Drive PDF on Android

stronglikedan · on July 31, 2019

Acrobat (Pro) isn't complaining. Looks like Drive PDF is what's invalid and won't load (the valid PDF).

skykooler · on July 31, 2019

Works fine in Firefox.