A C Compiler that fits in the 512 byte boot sector of an x86 machine

rowanG077 · on Nov 7, 2023

I'm getting pretty tired of this false advertisement. This is not a C compiler. It doesn't have many crucial features required to compile most larger C programs. I do have to say it's impressive what they have squeezed into 512 byte.

distcs · on Nov 7, 2023

> This is not a C compiler.

This is technically true! What is posted here is not a C compiler. It is an implementation of a subset of C.

I'd prefer that the title honestly mentions that like: A compiler for a subset of C that fits in the 512 byte boot sector.

It is still a remarkable feat. But honestly, when I read the original title I was in complete disbelief that someone could implement a whole C compiler in 512 bytes.

But with the new context that it is a subset of C (not the whole C), the initial great surprise is gone. It is still very impressive though.

Brian_K_White · on Nov 7, 2023

Or even, interpreter. It compiles and executes on the fly, in ram, function by function. It doesn't even compile the whole input but just a bit and immediately executes that bit before moving to the next bit, and doesn't save the compilation result anywhere. To me, that's an interpreter.

So it's a c subset interpreter.

And a very cool thing. This is not a denegration or critique at all, just terminology.

I think it's perfectly fine for a bootstrapper to be a drastic subset. They all already are drastically limited in countless other ways anyways like not knowing how to use any of the crazy hardware, networking, etc. A forth bootloader is a full turing language that can eventually do anything, but it itself can do almost nothing initially besides use bios-provided features and start interpreting code which then provides more functionality.

jcul · on Nov 7, 2023

To be fair, the first sentence states that it supports a subset of C.

>SectorC is a C compiler written in x86-16 assembly that fits within the 512 byte boot sector of an x86 machine. It supports a subset of C that is large enough to write real and interesting programs.

The post title could include this, but perhaps it's a little verbose.

In any case, agreed it's impressive to fit it in 512 bytes!

tomjakubowski · on Nov 7, 2023

i'm sorry to nitpick but it's the second sentence which mentions it only supports a subset, not the first sentence. and the first sentence calls it a "C compiler" without qualification

jcul · on Nov 8, 2023

Ah yes you're correct, I should have said first paragraph.

humanrebar · on Nov 7, 2023

Agreed. It's a cool project, but it's a compiler for a DSL that is a subset of C.

userbinator · on Nov 7, 2023

It's closer to B than C.

jjtheblunt · on Nov 7, 2023

I wonder how long until some LLM filters fraudulent titles wrt the article contents

userbinator · on Nov 7, 2023

Previously discussed on HN: https://news.ycombinator.com/item?id=36064971

dang · on Nov 7, 2023

Yes, it's a great topic but since it had significant attention in the last year, it counts as a dupe for now. This is in the FAQ: https://news.ycombinator.com/newsfaq.html.

baq · on Nov 7, 2023

How large would a compiler which can build tcc and can be built by this be?

tccboot https://bellard.org/tcc/tccboot.html clocks in at a ginormous 138kB by comparison.

'Can we boot to linux from source in 512b' is the wrong question to ask ;)

mati365 · on Nov 7, 2023

Slightly different idea but currently I'm working on C + Assembler Compiler to prototype and run such 512B games / apps for retro CPUs.

https://github.com/Mati365/ts-c-compiler

andai · on Nov 7, 2023

Blog post with more details: https://xorvoid.com/sectorc.html

distcs · on Nov 7, 2023

When we talk about these sector$LANG implementations, I am guessing we are talking about the boot sector that BIOS recognizes, right?

Does the 512 byte limit for a boot sector exist in UEFI too? I don't know much about UEFI so if someone could educate me about how the boot sector and its size limit differs in UEFI, I'd love to know.

sspiff · on Nov 7, 2023

No, UEFI loads PE executables from a special partition called the EFI System Partition or ESP. There's no real size restriction there as far as I know.

Before the ESP is accessed, there is no standardized way to customize the boot process. You could put these kinds of sectorX toys into the firmware directly, which would come with more constraints, but it would be vendor-specific.

There is a platform-independent VM running a special EFI byte code that is part of the EFI specification, which allows you to extend the UEFI system with things like additional drivers, but those are also loaded from the ESP.

distcs · on Nov 7, 2023

Thanks for the answer! I've got some more questions now. Sorry, but if anyone is willing to take a stab at these questions, it'd be helpful to me.

1. IIUC PE executables are Windows executables. So a Linux system that targets UEFI ends up writing a PE executable to the EFI System Partition?

2. I know that some UEFIs (or is it all?) support BIOS boot sector as backward compatibility feature? How does that work? If I write a "hello world" program in pure machine code in the 1st sector of the boot disk, would UEFI read that and execute that? How would it even know whether what's in the first sector is valid code or garbage? By checking the magic 0x55 0xaa at the end of the boot sector?

rft · on Nov 7, 2023

> 1. IIUC PE executables are Windows executables. So a Linux system that targets UEFI ends up writing a PE executable to the EFI System Partition?

Short answer: Yes.

Longer answer: Yes, but it is not exactly a classic Windows desktop application. A PE file itself is at its core just a way to store data in a somewhat structured way. It has metadata (like imports/exports and the target architecture) and actual contents (like data and code). I think it is a good idea to reuse an already existing, common executable file format for this use case. That the choice was PE instead of ELF or Mach-O is, IMO, the result of Microsoft being in the UEFI consortium. They are a big player and likely pushed for this. Whether or not ELF would have been a better choice, I can not say, but at this point I don't think it matters.

More info on that topic at OSDev Wiki: https://wiki.osdev.org/UEFI#Binary_Format

File output on my GRUB install:

  /boot/EFI/GRUB/grubx64.efi: PE32+ executable (EFI application) x86-64 (stripped to external PDB), for MS Windows, 4 sections

Brian_K_White · on Nov 7, 2023

"I think it is a good idea to reuse an already existing, common executable file format for this use case."

Similarly how many file formats are actually just pkzip files with standardzied names and contents.

sspiff · on Nov 8, 2023

1. Yes, that's exactly right. It installs a GRUB version compiled into a PE executable, typically.

2. Boot sector support requires UEFI CSM support, which most PCs have (but ARM based devices and Intel Macs don't). With the deprecation of 32-bit architectures in a lot of operating systems, some vendors also started dropping CSM support in some devices, but most devices still ship with CSM either enabled by default or as an option in the settings.

oynqr · on Nov 7, 2023

Since FAT32 is the only FS that must be supported, I'd guess one potential limit is 4 gigs.

sspiff · on Nov 11, 2023

It's not exactly FAT32, but close enough. I don't know if they provide any way to bypass the 4GB limit, but I doubt it.

Still, that would be 4GB per file, and UEFI provides a file system access interface, so you could easily load in more.

danbruc · on Nov 7, 2023

A classical PC master boot record does not actually have 512 byte for code as it also contains the partition table and a signature, you have 446 bytes for code. Not sure what exactly the BIOS validates, you might be able to get away with an invalid partition table. In general there is not really any limit unless you want to be compatible with something existing, you can define whatever disk layout you like. At worst you will have to load additional sectors yourself because the BIOS has no clue where you put them. I no longer remember what a floppy boot sector looks like, how much room you have there.

distcs · on Nov 7, 2023

It's been a long time since I've done ASM but do I understand it right that this implementation compiles each function and then executes it immediately? Or does it really compile the whole source code and then execute the binary generated?

And where is the compiled binary saved? Is it kept temporarily in memory itself for immediate execution? Or is the compiled binary saved back to the disk?

If someone could point me to the right sections of the code that answer these questions, it'd be of great help! Thanks!

bluetomcat · on Nov 7, 2023

Looks like a recursive-descent parser that emits instructions in memory as it parses. Then it executes them immediately (sectorc.s):

    ;; done compiling, execute the binary
    execute:
    push es                       ; push the codegen segment
    push word [bx]                ; push the offset to "_start()"
    push 0x4000                   ; load new segment for variable data
    pop ds
    retf                          ; jump into it via "retf"

userbinator · on Nov 7, 2023

Doesn't even have room for recursive descent or any sort of operator precedence.

doener · on Nov 7, 2023

I found this in a Golem article (German): https://www.golem.de/news/milliforth-eine-programmiersprache...

PumpkinSpice · on Nov 7, 2023

Folks here are probably reading too much into the "boot sector" angle. This project, like many others on HN, is best understood as "doing something in a constrained way because it's a fun challenge." As to why 512 bytes, the honest answer is "because it's a round number with some vague retro connotations."

There was never a constraint on x86 that you had to fit any real functionality into a single disk sector. All the bootsector code ever did was setting up the registers and calling BIOS to load more data from disk - and the only reason you had this done in stages was so that BIOS wouldn't have to know the particulars of your OS, such as where to find the kernel or what address to put it at.

For what it's worth, the pre-boot BIOS environment was always quite featured, offering text and VGA graphics, disk and keyboard handling, and so forth. In fact, when you look at something like MS-DOS, it was actually the BIOS-side code that was doing a lot of the heavy lifting. MS-DOS was halfway between a shell and a "real" OS as it is understood today.

Nowadays, with UEFI, we essentially decided to make the initial code executed by the CPU an operating system in itself, with filesystem support and so forth - and the notion of a 512 byte first-stage bootloader is largely gone.

xorvoid · on Nov 7, 2023

Somebody with a fine taste in fall beverages gets it! Lol

userbinator · on Nov 7, 2023

UEFI is almost like protected-mode DOS, with its support for things like filesystems and networking.

codedokode · on Nov 7, 2023

I loved reading about hacks and tricks used to implement this (like they did in 80s).

amelius · on Nov 7, 2023

No pointer support?

benj111 · on Nov 7, 2023

Well theres pointer dereferencing.

I'm not sure if the int type actually does anything. It may just be Typeless, and there's no difference between an int and a pointer, if you want to treat an int as a pointer, you just dereference it.

That's how it works in assembly.

devit · on Nov 7, 2023

The code just ignores variable declarations. All variables are 16-bit words.

It just hashes the variable name and uses twice the hash value as the address of the variable in memory

For functions instead it uses twice the hash to store and lookup the function address at compile time, dedicating the whole compiler data segment except it seems for ds:0 to that table.

Note that in segmented x86-16, the code (plus constants) and stack are in dedicated segments, and string functions used to write the generated code write to yet another segment selected by es.

This seems to be the best strategy for compiler code size, although obviously it's vulnerable to hash collisions and only supports global non-array variables.

TedDoesntTalk · on Nov 7, 2023

It says:

> pointer dereference

varispeed · on Nov 7, 2023

Love it!

rollcat · on Nov 7, 2023

From readme:

> What is this useful for?

Hard disagree with "nothing"! This could be a stage 1 compiler for running an entire OS from source. Tcc already demonstrated this running Linux, but it takes significantly more than 512 bytes.

Also: https://malleable.systems/

bell-cot · on Nov 7, 2023

>> What is this useful for?

Other than "look at really cool thing I made", I'm thinking it might have some nice Chain-of-Trust use cases. A 512-byte binary isn't hard to verify by hand. Vs. nobody can hope to verify that a modern multi-MB `cc` binary doesn't contain a back-door-maker for someone less trustworthy than Ken Thompson.

CamperBob2 · on Nov 7, 2023

Intel: "LOL get your own microcode."

Brian_K_White · on Nov 7, 2023

risc-v: "workin'n on it"

rollcat · on Nov 7, 2023

How does an instruction set solve supply chain trust?

I suggest you reconsider [1] & [2], but apply the thought process in the context of a physical CPU. The problem is not source access or patents, the problem is you don't have an electron microscope to look at these nanometers-wide elements; even if you did, it's still an insanely difficult process.

[1]: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

[2]: https://research.swtch.com/nih

Brian_K_White · on Nov 7, 2023

Very simply, by not requiring Intel or AMD to produce the hardware to implement the isa.

It's true, but irrelevant, that you can't trust a risc-v cpu from Intel any more than an x86 cpu from Intel.

But I think the bulk of x86 and amd64 patents have expired since the 90's already, so I think you could have other suppliers designing and fabricating x86_64 royalty-free and open-source today just as well, and risc-v is particularly interesting at least as much for the simpler start-over design as for the pure unencumbered ip.

lloeki · on Nov 7, 2023

Maybe that was not the intent but I read it as a reference to Intel ME.

snvzz · on Nov 8, 2023

>risc-v: "workin'n on it"

Considering how much denser RISC-V is, it wouldn't even be a feat.

ColonelPhantom · on Nov 7, 2023

You'd also need to load the source code from disk. At the point where you're doing that, the 512-byte limit isn't really relevant anymore, and adding "loading code from disk" within the 512 bytes isn't exactly realistic.

londons_explore · on Nov 7, 2023

BIOS systems can load a page from disk in just a couple of bytes (since you are calling into the BIOS to do the actual work for you).

Obviously, if you want to read from a file, you have all the overhead of understanding a filesystem.

WaitWaitWha · on Nov 7, 2023

> you have all the overhead of understanding a filesystem.

No need. Read & write cluster/sector or block/page directly; BIOS already does that.

DougN7 · on Nov 7, 2023

Don’t you need to read a directory structure to know where the file’s sectors are, how many there are, etc, and thus need to understand the file system?

EvanAnderson · on Nov 7, 2023

You could do something really quick-and-dirty. Write the code into sequential sectors, put a sentinel value in the last sector to indicate the code ends. It wouldn't be at all good for in-place editing but it would be easy to load and parse the code.

masklinn · on Nov 7, 2023

You’d put the file contents at a static offset, or a location pointed to by the MBR or something.

benj111 · on Nov 7, 2023

Movsb can move the data in one instruction so I don't see why it's 'unrealistic'

I don't really get your objection. 512bytes is rather arbitrary, yes. But I don't see why it becomes less relevant when it's ram.

6581 · on Nov 7, 2023

With builder-hex0, 512 bytes are more than enough. https://github.com/ironmeld/builder-hex0

dale_glass · on Nov 7, 2023

I wonder how long it's going to take for Gentoo to adopt this.

f1shy · on Nov 7, 2023

That is the concept behind GNU MES [1]

[1] https://www.gnu.org/software/mes/

tromp · on Nov 7, 2023

It would be far more useful if it supported function arguments. I wonder how much of the other features they'd have to give up to support those within 512 bytes.

Waterluvian · on Nov 7, 2023

“If you want to boot your OS from scratch…”

linuxrebe1 · on Nov 7, 2023

This is scary. It hides in boot sector and can compile tiny C apps to bootstrap malware. Wipe system, rebuild, blackhat is soon back in, rinse and repeat. Th end solution ... destroy the box.

tomjakubowski · on Nov 7, 2023

I don't see what makes a compiler in the boot sector scarier in malware terms than any other program… would you elaborate?

Like, how does the malware benefit shipping its own source code and a tiny compiler at boot time, over just booting directly into the compiled malware?

JohnFen · on Nov 7, 2023

That sort of attack has been around for a very, very long time.

> Th end solution ... destroy the box.

Or reformat the disk, or even just write over the boot sector with something else (proper boot code, zeros, or even garbage).

wvh · on Nov 7, 2023

Some viruses intercept the interrupt handler, detect if you're trying to write to the boot sector, then either fake a write or change the sector. I believe some forms of ParityBoot (B?) did this. So you need to be sure you're booting from a clean medium, which in the case of some of these boot viruses might not be that easy since a lot of your disks and floppies might have been infected already.

Some viruses also used extra space at the end of the partition table or the end of the disk to store themselves so they wouldn't be limited to the 512 byte limit (minus the metadata in the boot sector).

sitzkrieg · on Nov 7, 2023

spooky code running code!