I'm getting pretty tired of this false advertisement. This is not a C compiler. It doesn't have many crucial features required to compile most larger C programs. I do have to say it's impressive what they have squeezed into 512 byte.
This is technically true! What is posted here is not a C compiler. It is an implementation of a subset of C.
I'd prefer that the title honestly mentions that like: A compiler for a subset of C that fits in the 512 byte boot sector.
It is still a remarkable feat. But honestly, when I read the original title I was in complete disbelief that someone could implement a whole C compiler in 512 bytes.
But with the new context that it is a subset of C (not the whole C), the initial great surprise is gone. It is still very impressive though.
Or even, interpreter. It compiles and executes on the fly, in ram, function by function. It doesn't even compile the whole input but just a bit and immediately executes that bit before moving to the next bit, and doesn't save the compilation result anywhere. To me, that's an interpreter.
So it's a c subset interpreter.
And a very cool thing. This is not a denegration or critique at all, just terminology.
I think it's perfectly fine for a bootstrapper to be a drastic subset. They all already are drastically limited in countless other ways anyways like not knowing how to use any of the crazy hardware, networking, etc. A forth bootloader is a full turing language that can eventually do anything, but it itself can do almost nothing initially besides use bios-provided features and start interpreting code which then provides more functionality.
To be fair, the first sentence states that it supports a subset of C.
>SectorC is a C compiler written in x86-16 assembly that fits within the 512 byte boot sector of an x86 machine. It supports a subset of C that is large enough to write real and interesting programs.
The post title could include this, but perhaps it's a little verbose.
In any case, agreed it's impressive to fit it in 512 bytes!
i'm sorry to nitpick but it's the second sentence which mentions it only supports a subset, not the first sentence. and the first sentence calls it a "C compiler" without qualification
Yes, it's a great topic but since it had significant attention in the last year, it counts as a dupe for now. This is in the FAQ: https://news.ycombinator.com/newsfaq.html.
When we talk about these sector$LANG implementations, I am guessing we are talking about the boot sector that BIOS recognizes, right?
Does the 512 byte limit for a boot sector exist in UEFI too? I don't know much about UEFI so if someone could educate me about how the boot sector and its size limit differs in UEFI, I'd love to know.
No, UEFI loads PE executables from a special partition called the EFI System Partition or ESP. There's no real size restriction there as far as I know.
Before the ESP is accessed, there is no standardized way to customize the boot process. You could put these kinds of sectorX toys into the firmware directly, which would come with more constraints, but it would be vendor-specific.
There is a platform-independent VM running a special EFI byte code that is part of the EFI specification, which allows you to extend the UEFI system with things like additional drivers, but those are also loaded from the ESP.
Thanks for the answer! I've got some more questions now. Sorry, but if anyone is willing to take a stab at these questions, it'd be helpful to me.
1. IIUC PE executables are Windows executables. So a Linux system that targets UEFI ends up writing a PE executable to the EFI System Partition?
2. I know that some UEFIs (or is it all?) support BIOS boot sector as backward compatibility feature? How does that work? If I write a "hello world" program in pure machine code in the 1st sector of the boot disk, would UEFI read that and execute that? How would it even know whether what's in the first sector is valid code or garbage? By checking the magic 0x55 0xaa at the end of the boot sector?
> 1. IIUC PE executables are Windows executables. So a Linux system that targets UEFI ends up writing a PE executable to the EFI System Partition?
Short answer: Yes.
Longer answer: Yes, but it is not exactly a classic Windows desktop application. A PE file itself is at its core just a way to store data in a somewhat structured way. It has metadata (like imports/exports and the target architecture) and actual contents (like data and code). I think it is a good idea to reuse an already existing, common executable file format for this use case. That the choice was PE instead of ELF or Mach-O is, IMO, the result of Microsoft being in the UEFI consortium. They are a big player and likely pushed for this. Whether or not ELF would have been a better choice, I can not say, but at this point I don't think it matters.
1. Yes, that's exactly right. It installs a GRUB version compiled into a PE executable, typically.
2. Boot sector support requires UEFI CSM support, which most PCs have (but ARM based devices and Intel Macs don't). With the deprecation of 32-bit architectures in a lot of operating systems, some vendors also started dropping CSM support in some devices, but most devices still ship with CSM either enabled by default or as an option in the settings.
A classical PC master boot record does not actually have 512 byte for code as it also contains the partition table and a signature, you have 446 bytes for code. Not sure what exactly the BIOS validates, you might be able to get away with an invalid partition table. In general there is not really any limit unless you want to be compatible with something existing, you can define whatever disk layout you like. At worst you will have to load additional sectors yourself because the BIOS has no clue where you put them. I no longer remember what a floppy boot sector looks like, how much room you have there.
It's been a long time since I've done ASM but do I understand it right that this implementation compiles each function and then executes it immediately? Or does it really compile the whole source code and then execute the binary generated?
And where is the compiled binary saved? Is it kept temporarily in memory itself for immediate execution? Or is the compiled binary saved back to the disk?
If someone could point me to the right sections of the code that answer these questions, it'd be of great help! Thanks!
Looks like a recursive-descent parser that emits instructions in memory as it parses. Then it executes them immediately (sectorc.s):
;; done compiling, execute the binary
execute:
push es ; push the codegen segment
push word [bx] ; push the offset to "_start()"
push 0x4000 ; load new segment for variable data
pop ds
retf ; jump into it via "retf"
Folks here are probably reading too much into the "boot sector" angle. This project, like many others on HN, is best understood as "doing something in a constrained way because it's a fun challenge." As to why 512 bytes, the honest answer is "because it's a round number with some vague retro connotations."
There was never a constraint on x86 that you had to fit any real functionality into a single disk sector. All the bootsector code ever did was setting up the registers and calling BIOS to load more data from disk - and the only reason you had this done in stages was so that BIOS wouldn't have to know the particulars of your OS, such as where to find the kernel or what address to put it at.
For what it's worth, the pre-boot BIOS environment was always quite featured, offering text and VGA graphics, disk and keyboard handling, and so forth. In fact, when you look at something like MS-DOS, it was actually the BIOS-side code that was doing a lot of the heavy lifting. MS-DOS was halfway between a shell and a "real" OS as it is understood today.
Nowadays, with UEFI, we essentially decided to make the initial code executed by the CPU an operating system in itself, with filesystem support and so forth - and the notion of a 512 byte first-stage bootloader is largely gone.
I'm not sure if the int type actually does anything. It may just be Typeless, and there's no difference between an int and a pointer, if you want to treat an int as a pointer, you just dereference it.
The code just ignores variable declarations. All variables are 16-bit words.
It just hashes the variable name and uses twice the hash value as the address of the variable in memory
For functions instead it uses twice the hash to store and lookup the function address at compile time, dedicating the whole compiler data segment except it seems for ds:0 to that table.
Note that in segmented x86-16, the code (plus constants) and stack are in dedicated segments, and string functions used to write the generated code write to yet another segment selected by es.
This seems to be the best strategy for compiler code size, although obviously it's vulnerable to hash collisions and only supports global non-array variables.
Hard disagree with "nothing"! This could be a stage 1 compiler for running an entire OS from source. Tcc already demonstrated this running Linux, but it takes significantly more than 512 bytes.
Other than "look at really cool thing I made", I'm thinking it might have some nice Chain-of-Trust use cases. A 512-byte binary isn't hard to verify by hand. Vs. nobody can hope to verify that a modern multi-MB `cc` binary doesn't contain a back-door-maker for someone less trustworthy than Ken Thompson.
How does an instruction set solve supply chain trust?
I suggest you reconsider [1] & [2], but apply the thought process in the context of a physical CPU. The problem is not source access or patents, the problem is you don't have an electron microscope to look at these nanometers-wide elements; even if you did, it's still an insanely difficult process.
Very simply, by not requiring Intel or AMD to produce the hardware to implement the isa.
It's true, but irrelevant, that you can't trust a risc-v cpu from Intel any more than an x86 cpu from Intel.
But I think the bulk of x86 and amd64 patents have expired since the 90's already, so I think you could have other suppliers designing and fabricating x86_64 royalty-free and open-source today just as well, and risc-v is particularly interesting at least as much for the simpler start-over design as for the pure unencumbered ip.
You'd also need to load the source code from disk. At the point where you're doing that, the 512-byte limit isn't really relevant anymore, and adding "loading code from disk" within the 512 bytes isn't exactly realistic.
Don’t you need to read a directory structure to know where the file’s sectors are, how many there are, etc, and thus need to understand the file system?
You could do something really quick-and-dirty. Write the code into sequential sectors, put a sentinel value in the last sector to indicate the code ends. It wouldn't be at all good for in-place editing but it would be easy to load and parse the code.
It would be far more useful if it supported function arguments. I wonder how much of the other features they'd have to give up to support those within 512 bytes.
This is scary. It hides in boot sector and can compile tiny C apps to bootstrap malware. Wipe system, rebuild, blackhat is soon back in, rinse and repeat. Th end solution ... destroy the box.
I don't see what makes a compiler in the boot sector scarier in malware terms than any other program… would you elaborate?
Like, how does the malware benefit shipping its own source code and a tiny compiler at boot time, over just booting directly into the compiled malware?
Some viruses intercept the interrupt handler, detect if you're trying to write to the boot sector, then either fake a write or change the sector. I believe some forms of ParityBoot (B?) did this. So you need to be sure you're booting from a clean medium, which in the case of some of these boot viruses might not be that easy since a lot of your disks and floppies might have been infected already.
Some viruses also used extra space at the end of the partition table or the end of the disk to store themselves so they wouldn't be limited to the 512 byte limit (minus the metadata in the boot sector).