C compiler with support for structs written in Assembly

mmastrac · on Aug 27, 2018

I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.

The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.

The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).

EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap

giomasce · on Aug 27, 2018

I am also doing something like that here:

https://gitlab.com/giomasce/asmc

My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).

My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.

The README in the linked page contains more information.

I'll look at your code too!

mmastrac · on Aug 27, 2018

Very cool. If I could port your G compiler to my VM that would basically get me where I wanted to be. I'll take a look at it!

giomasce · on Aug 27, 2018

In theory yes, but at the moment my C compiler is very tied to the x86 platform (there is no separated code generation module, emit calls are directly in the code). Since the compiler is very simple, though, it should not be difficult factor out the handful of machine code gadgets that need to be implemented and add support for other back ends, I think. Or you might also other consider other ways, like using the toolchain in OP's repository.

If you want to have some brainstorming on #bootstrappable at freenode, there are often interesting discussions on these things.

akkartik · on Aug 27, 2018

Wow, great to meet both of you. I've been chasing bootstrapping as well. My approach has been to make programming in raw machine code more ergonomic with some simple text-based tools (that should in principle be easy to write in machine code)

My repo is at https://github.com/akkartik/mu/tree/master/subx#readme

akavel · on Aug 27, 2018

Did you consider Oberon?

lboasso · on Aug 27, 2018

as mentioned in another comment, an example of bootstrapping compiler for Oberon on the JVM is: https://github.com/lboasso/oberonc

weberc2 · on Aug 27, 2018

> guarantee that you can unrar a file in 20 years with minimal work

Does this take into account a potential change in the predominant architecture? I.e., we move from x86 to fooarch? Presumably there's more work than "implement the VM in fooarch instructions"? You'd have to write a fooarch assembler as well, right? As well as fooarch C compiler backends?

mmastrac · on Aug 27, 2018

As the code is compiled to a well-specified, small VM it should only be a matter of writing the VM in whatever language is already available for that particular platform/architecture.

You can also choose to write the VM in raw assembly. While this isn't ideal, the VM itself is mainly just straightforward register operations and should map trivially to any hardware that has bit operations and hardware 32-bit multiply/divide.

If it comes down to it, you can implement the VM itself on bare metal, but you'll need to do some work implementing things like a filesystem (not terribly hard to get a basic, non-scalable one up-and-running).

I suppose there's an assumption that the platform provides 32-bit integers, but I _think_ that's a safe assumption.

pjmlp · on Aug 28, 2018

This is the genesis of P-Code design for Pascal.

Originally Wirth only planned to use it for bootstrapping purposes, then UCSD decided to make a whole platform on top of it.

ailideex · on Aug 27, 2018

Why not just cross compile!?

phoe-krk · on Aug 27, 2018

Bootstrapping isn't an issue of convenience, it is an issue of trust. You can't trust the compiler doing the cross-compilation. You literally have to start from the smallest chunks of assembly code and build your way up to a fully featured compiler through several stages, each of which is more complex and can in turn compile more complex code.

nickpsecurity · on Aug 27, 2018

That it's an issue of trust is the reason that's not true. You know certain groups are highly-unlikely to work together on a backdoor that works same for all of them. There's also folks like Wirth's group at ETH that's unlikely to be backdooring things at all. So, the easiest route is to write your bootstrap phase is several languages that use tools from very different people and countries. You can trust it once they produce the same output. Use that output to bootstrap the rest. Also, do it on different hardware and OS's if concerned about that level. Make sure the CPU's were done at different fabs.

Aside from trust, the other reason people are doing this is for fun challenge of building things from ground up. They also are learning about interpreters, compilers, assembly, etc. The author of this work talked like he is doing it the way he is mainly for the challenge.

haberman · on Aug 28, 2018

I think this is a very cool concept, but it doesn't seem to protect you from your environment. When you write the program in your editor, how do you know it's not inserting rogue code before it writes them to disk? When you run an assembly program stored on disk, how do you know the OS or even the hardware isn't patching it before it runs it?

weberc2 · on Aug 27, 2018

This still doesn't get you all the way, since you're ultimately trusting your chip manufacturer.

cellularmitosis · on Aug 27, 2018

If we are just talking about bootstrapping, you could also homebrew a CPU: https://news.ycombinator.com/item?id=13208516

Creating a backdoor distributed across a bunch of 7400-series logic chips would be pretty unlikely.

giomasce · on Aug 27, 2018

Because when you cross compile you have to trust your compiler. https://en.wikipedia.org/wiki/Backdoor_(computing)#Compiler_...

ailideex · on Aug 27, 2018

And when you compile regularly you don't?

giomasce · on Aug 27, 2018

Yes, but if you want to rebuild the trust chain you cannot restart from an untrusted compiler. The idea is to bootstrap a trusted compiler from nothing (i.e., from little enough so that you can check it directly), and then use the trusted compiler for everything else.

geezerjay · on Aug 27, 2018

When you compile regularly you don't, mainly because you're more interested in ensuring your code isn't broken and you have no intention to ship the binaries to anyone.

Filligree · on Aug 27, 2018

As for 'why bootstrap'...

Since Reflections on Trusting Trust has been linked already, I'm going to offer something else. Today's nightmare: https://www.teamten.com/lawrence/writings/coding-machines/

silisili · on Aug 28, 2018

Thanks for this, I'd never read it before and it was very interesting. But by the end I had more questions than answers. In this situation, my first thought would never be 'i need to write my own compiler', vs say, trying any combo of the handful of c compilers on other boxes. Also it's not clear to me about the ending. Was this a real story? I couldn't tell if the 'letter' was meant to be a dystopian what-if, real, or anything in between.

Filligree · on Aug 28, 2018

The entire page is fiction, presumably. Hopefully.

That being said, it's not professional fiction. It's possibly the first work of the writer, and works because of talent (probably) and a good premise (definitely), but it's sorely lacking in polish and editing. I wouldn't look too closely at the cracks. :)

lathiat · on Aug 28, 2018

I'm going to have nightmares for a decade after reading that...

jtjj222 · on Aug 28, 2018

Not related, but Coding Machines reminded me of the relatively unknown movie Pontypool, which has a similar premise (I won't spoil it).

Filligree · on Aug 28, 2018

I looked it up on Wikipedia, sorry. :)

It does look like an interesting movie. I think I'll take a look.

exikyut · on Aug 27, 2018

Vital details:

From http://git.savannah.nongnu.org/cgit/stage0.git/tree/README (which also declares GPL3):

> This is a set of manually created hex programs in a Cthulhu Path to madness fashion. Which only have the goal of creating a bootstrapping path to a C compiler capable of Compiling GCC, with only the explicit requirement of a single 1 KByte binary or less.

> Additionally, all code must be able to be understood by 70% of the population of programmers. If the code can not be understood by that volume, it needs to be altered until it satifies the above requirement.

From https://savannah.nongnu.org/projects/stage0/:

> A class of minimal bootstrap binaries that has a reproducible build on all platforms. Providing a verifiable base for defeating the trusting trust attack.

Cooooooool.

rasengan · on Aug 27, 2018

I am in the bottom 30 percentile of programmers. :(

shawn · on Aug 27, 2018

Everyone was. Work hard and force yourself to learn something alien each day.

todd8 · on Aug 27, 2018

Shawn’s perspective is great. Like chess players or golfers, there is (almost) always someone better, but practice improves our ability. Programming takes practice, but it gets more satisfying the more you do of it.

My first programming language was Fortran, but I couldn’t get my first program to compile! I never did. I switched to something easier, Basic. Then went back to Fortran successfully. I toyed with programming for years before deciding on it as a career.

Ten years from my first attempts I was a good programmer doing really interesting things.

In twenty years I was architecting distributed systems for IBM’s new AIX operating systems.

In thirty years I was chief scientist at a very successful software company.

Key to my success was to keep learning from people smarter than me by reading, taking classes, and practice doing hard things. Although it has now been fifty years, I still love programming.

brutal_chaos · on Aug 27, 2018

[A bit OT] Brilliantly said. I always enjoy reading encouraging statements like these; especially in response to self doubt.

giomasce · on Aug 27, 2018

I would not say that, in its current form, that code is understandable by 70% of programmers, unless you define that word "programmer" in a very restrictive sense.

gmueckl · on Aug 27, 2018

It would be kind of cool to start small with entirely verifiable hardware for the first bootstrapping stages.

I am phantasizing about a sort of ceremony in which the whole bootstrap process is done live in front of an audience starting with a discrete computer (using e.g. this board as a CPU https://monster6502.com), absolutely no electronic non-volatile memory and the first programs read into the computer from punch cards or punch tape. This would be used to create later stages for more powerful hardware and the end result (after maybe one or two hardware switches) is hopefully a minimal C compiler or similar that can be used to bootstrap production compilers like GCC. Ideally, this binary is shown to be completely identical to a binary built by a typical build process.

Even if such a ceremony is ultimately not very useful, it could still be seen as a kind of artistic performance.

pnathan · on Aug 27, 2018

Very, very nice. Not often you get to study non-trivial assembly programs.

Some context is this: https://bootstrapping.miraheze.org/wiki/Stage0

akkartik · on Aug 27, 2018

This is fantastic. I see it's mostly written by the author of the project. Why the heck isn't it in the repo's Readme?!

lboasso · on Aug 27, 2018

Bootstrapping a compiler is fun. I wrote a self-hosting compiler for the Oberon-07 language, targeting the JVM: https://github.com/lboasso/oberonc

Initially the project was written in Java, after enough features were working the bootstrap phase could start.

pome · on Aug 27, 2018

If someone is interesting about how to bootstrapping, this[0][1] tutorial is just awesome, tiny compiler from nothing (raw machine codes and hex!) to the self-hosting!

[0] https://web.archive.org/web/20120712111627/http://www.rano.o... [1] https://github.com/certik/bcompiler (Fork in GitHub)

Also worth check - https://www.t3x.org/t3x/book.html :-)

young_unixer · on Aug 27, 2018

Then how do they usually bootstrap C? Do they write a C compiler without structs, then program a C with structs compiler using C without structs?

garmaine · on Aug 27, 2018

They cross-compile on a machine that already has a C compiler.

fb03 · on Aug 27, 2018

But how was the first compiler bootstrapped? Serious question (how can a compiler of X language be written in X?)

coliveira · on Aug 27, 2018

They wrote in another language. I think the bootstrap C compiler was written in BCPL.

pieterr · on Aug 27, 2018

More info can be found here (straight from the source...)

https://www.bell-labs.com/usr/dmr/www/chist.html

https://www.bell-labs.com/usr/dmr/www/primevalC.html

garmaine · on Aug 27, 2018

And, for example, C++ was originally a C program that transpiled C++ -> C.

sigjuice · on Aug 28, 2018

No, it wasn’t. “Transpile” is not a real word. There is no need to desecrate the first ever C++ compiler.

garmaine · on Aug 28, 2018

https://en.m.wikipedia.org/wiki/Source-to-source_compiler

https://en.m.wikipedia.org/wiki/Cfront

sigjuice · on Aug 28, 2018

“Source-to-source” is redundant. You can just shorten this to “compiler”.

pawelmurias · on Aug 28, 2018

It's not redundant. A "source-to-source" compiler isn't a "source-to-machine code" compiler

sigjuice · on Aug 28, 2018

It is redundant. See page 1 of most compiler books.

Compilers: Principles, Techniques, and Tools

Simply stated, a compiler is a program that can read a program in one language - the source language - and translate it into an equivalent program in another language - the target language

Engineering a Compiler

Compilers are computer programs that translate a program written in one language into a program written in another language

chrisseaton · on Aug 28, 2018

The word ‘transpiler’ has been used in academic literature since at least the late 60s.

sigjuice · on Aug 28, 2018

It has been (mis)used in a very tiny amount of academic literature. It obviously did not catch on for what one might only hope were good reasons. There is no good reason to promote this now.

zadler · on Aug 27, 2018

Thsts awesome. Id love to see a lineage tree of early language development.

tempodox · on Aug 27, 2018

What assembler syntax is this? The comments don't say.

z29LiTp5qUC30n · on Aug 27, 2018

M0 macro assembly, it is implemented here: http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage1/M... Which is Built in Hex2 which is implemented here: http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage1/s... Which is built in Hex1 which is implemented here: http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage1/s... Which is self-hosting and implemented here: http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage1/s...

And was written using the bare metal text editor found here: http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage1/S...

meffie · on Aug 27, 2018

Looks to be bytecode for a vm;

http://git.savannah.nongnu.org/cgit/stage0.git/tree/vm_instr...

userbinator · on Aug 27, 2018

I don't recognise the architecture despite it saying x86, but it reminds me of ARM.

Recursive descent seems to be the go-to parsing technique for compilers both big and small now. I like how all the repetitive functions for each level have been refactored into a "general_recursion" function, but if you want to make it even simpler and yet more extendable, table-driven precedence climbing would be ideal:

http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing

chrisseaton · on Aug 27, 2018

> Recursive descent seems to be the go-to parsing technique for compilers both big and small now ... if you want to make it even simpler and yet more extendable, table-driven precedence climbing would be ideal

I think the fact that everyone just writes recursive descent parers tells us in practice that there isn't sufficient value in using more techniques like table-driven variants and they don't make anything practically simpler.

userbinator · on Aug 27, 2018

I agree about table-driven LR and the like (LALR, LLR, SLR, etc) which were the "traditional YACC" generated parsers, but precedence climbing is just a simple refactoring of recursive descent that eliminates needing multiple almost-the-same functions for each precedence level. This is most applicable to languages like C which have many levels, which could be why it isn't so common for others.

dfox · on Aug 27, 2018

That depends on what you are parsing. For in-fix expressions shunting yard algorithm is significantly more clear (ie. you don't have to resort to left-right recursion tricks and invent labels like "product") and allows you to extend the set of supported operators by simply adding table entry (which you can do even while parsing and thus support user-defined operators)

chrisseaton · on Aug 27, 2018

> For in-fix expressions shunting yard algorithm is significantly more clear

If it was significantly more clear, people would use it in practice! This makes me think it is not in fact significantly more clear.

I did research work in parsers, and I work professionally in compilers now, and guess what when I need a parser for in-fix expressions I just write a recursive descent one manually, it's never an issue.

dfox · on Aug 27, 2018

It is significantly more clear if you only parse infix expressions which is why it is used for many introductory "lets write a calculator" examples.

In context of more complex language with small set of infix operators and their precedence classes it is probably not worthwhile unless you really want user-defined operators.

mncharity · on Aug 27, 2018

My very fuzzy recollection is there's long-ago work compiling user-defined mixfix operator precedence parsing down to recursive descent, but yes, opp implementation is pretty for large complex user-defined operator sets.

z29LiTp5qUC30n · on Aug 27, 2018

That is because it is compiling for x86. It is the Knight Architecture, designed for TTL implementation

pecg · on Aug 27, 2018

This program looks interesting. Maybe it could help in the bootstrap process of tcc, though I don't know if it is written in ANSI C. On a related note, I will try to test compiling lua's runtime environment and interpreter, because I'm sure it is written using standard C already.

nickpsecurity · on Aug 27, 2018

Part of the goal of these efforts is doing that. For example, here's a page tracking which tools can compile which others:

https://bootstrapping.miraheze.org/wiki/C_compilers

ailideex · on Aug 27, 2018

> Maybe it could help in the bootstrap process

Why would it be more useful than just cross compiling?

another-cuppa · on Aug 27, 2018

Just a few hours ago there was a conversation in #gentoo-chat about the Thompson attack. I wondered how difficult it would be to write a C compiler that can compile GCC. How far away is this?

chubot · on Aug 27, 2018

Isn't GCC written in C++ now? I think they are probably aiming for a very old version of GCC written in C (or that's all they can really hope for).

The now-defunct Aboriginal Linux project was doing a similar sort of bootstrapping, and the build dependencies for GCC was a big issue:

http://landley.net/aboriginal/

To work around this, it never used anything later than gcc-4.2.1 from 2007, while we're now on GCC 8.2.

http://gcc.gnu.org/releases.html

EDIT: Yes, it appears GCC uses C++: https://lwn.net/Articles/542457/

earenndil · on Aug 27, 2018

Yes, but you can use gcc 4.2.1 to compile a newer version which does use c++.

dleslie · on Aug 27, 2018

It just needs to compile enough C to compile a compiler that can compile GCC.

em3rgent0rdr · on Aug 27, 2018

Yes. And should emphasize that the bootstrapping compiler doesn't need to be an optimizing compiler, and doesn't need to compile fast, since it is only to be used once for compiling the next stage of compiler.

z29LiTp5qUC30n · on Aug 27, 2018

Well, there is work to have MesCC bootstrapped off this: https://gitlab.com/janneke/mes Which can build GCC.

So literally once that step is done, we would have a full path...

sam0x17 · on Aug 27, 2018

This is super impressive. What might be some compelling technical reasons to use this over existing C implementations? I just don't know so I thought I'd ask.

pavelbr · on Aug 27, 2018

If I wanted to actually run this compiler, how would I build it?

z29LiTp5qUC30n · on Aug 27, 2018

git clone 'https://git.savannah.gnu.org/git/stage0.git' make test

That should do the full bootstrap in about 1 minute

newnewpdro · on Aug 27, 2018

I'd appreciate a bare minimum C compiler capable of building gcc written in either assembly or x86 machine code that can be trivially audited in as-executed form.

jacquesm · on Aug 27, 2018

That's impressive.

I wished I had time today to put this through its paces.

z29LiTp5qUC30n · on Aug 27, 2018

The C version is alot easier to read: https://github.com/oriansj/M2-Planet

maxpert · on Aug 27, 2018

Mind blown