Yeah, even older versions of GCC ought to work, though they don't come with ARM7...

int_19h · on March 16, 2022

C isn't difficult to parse, so you could probably automate splitting of C source files on e.g. function boundaries.

But then again, with a single-pass compiler, it shouldn't matter how large the input is, no? I know there are forks of tcc with ARM support...

kragen · on March 16, 2022

You probably know all this, but it may interest other people:

Declarations change the parsing of C, so even a single-pass compiler needs to keep the declarations in memory somehow; cases like `foo * bar;` can be legally parsed as either a declaration of `bar` of type `foo*` or a (useless but legal) void-context multiplication of `foo` and `bar`, depending on whether `foo` has been declared as a type with typedef. Plus, of course, preprocessor macros can do arbitrary things. In PDP-11 C days it was common to put declarations of library functions directly into your code (with, of course, no argument types, since those didn't appear until ANSI C) instead of in a header file, and the header files were very small. Nowadays header files can be enormous, to the point that tokenizing them is often the bottleneck for (non-parallel) C compilation speed; often we even include extra totally unnecessary header files to facilitate sharing precompiled headers across C source files.

So I think it probably isn't straightforward to enable small computers to compile current C codebases like Linux.

tcc, however, would be an excellent thing to start with if you were going to try it.

int_19h · on March 17, 2022

I imagined a separate pass for the preprocessor, where the state would only be #defines and the current stack of #if blocks. Thus compiler would only have to keep track of type declarations and globals (including functions). With some effort, it should be possible to encode this quite efficiently in memory, especially if strings are interned (or better yet, if the arch can do mmap, sliced directly from the mapped source). Looking at tcc, it's somewhat profligate in that compound types are described using pointer-based trees, so e.g. function declarations can blow up in size pretty fast.

kragen · on March 17, 2022

Yeah, definitely running the preprocessor as a separate process eases the memory pressure — I think Unix's pipeline structure was really key to getting so much functionality into a PDP-11, where each process was limited to 16 bits of address space.

Pointer-based trees seem like a natural way to handle compound types to me, but it's true that they can be bulky. Hash consing might keep that manageable. An alternative would be to represent types as some kind of stack bytecode: T_INT32 T_PTR T_PTR T_ARRAY 32 T_PTR T_INT32 T_FN or something for the type of int f(int *(*)[32]) or something (assuming int is int32_t). That would be only 11 bytes, assuming the array size is 4 bytes, but kind of a pain in the ass to compute with.

Interned strings — pointers to a symbol object or indexes into a symbol array — can be bigger than the underlying byte data. Slices of an mmap can be even bigger. This is of course silly when you have one of them in isolation — you need a pointer to it anyway — but it can start to add up when you have a bunch of them concatenated, like in a macro definition where you have a mixture of literal text and parameter references.