Hacker News new | past | comments | ask | show | jobs | submit login
Pacman in 512 bytes of x86 boot sector machine code (github.com/nanochess)
229 points by nanochess on July 8, 2019 | hide | past | favorite | 62 comments



just to put into perspective how little 512 bytes is, here is 512 ascii bytes (including 8 newlines)

   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacmanpacman
   pacmanpacmanpacmanpacman
that's it. That's 512 characters/ascii bytes.


It's actually in 510 bytes; the MBR requires a two-byte signature to indicate to the BIOS that it's bootable / that it is a MBR, so you lose the ability to use/work with those two bytes¹.

You can see this in the original code on the line that reads,

          db 0x55,0xaa            ; Make it a bootable sector
So, 1% more impressive. (The original submission calls it "in 512 bytes" too.)

Now, what I'm curious about is the filler before the signature. You have to pad the binary to 510B prior to emitting the signature, s.t. the signature is in the right place. The author pads with 0x4f … but why that octet, and not, say, 0x00?

¹Unless, I suppose, you realize you need a 0x55 0xaa in your code somewhere, and re-use the signature. But that is not the case here. But I wouldn't put it past some of these tiny code challenges, as the authors tend to be quite clever.


If you're going to be pedantic enough to mention the 2 byte signature, I think you need to be more precise than '1% more impressive' when its less than 0.5% length differential. Or are you suggesting a non linear relationship between size and impressiveness??? I suppose a 1 byte pacman would be more than 100% more impressive, so that fits?


Everyone arguing this isn't "technically" 512 bytes... come on. That goes against the spirit of this post. I feel it was merely an illustration. I mean, at some point just enjoy life and stop being so "technical".


This is HN. I guess being so technical is what makes HN fun for many of us! :-)


People (me included) are just pointing out how bad of an illustration it is.

Technically, it is 512 bytes, but it does not give me any sense of scale. "There's a lot of pacmans in there, so 512 looks like a lot".

Granted, I looked at it from my phone, so that probably biases me towards long- or short-looking text (I even had to scroll).


4 times more RAM than the Atari 2600 has.


True, but let’s not forget that the game code is usually in much bigger ROM on the Atari 2600 (and I forgot whether bank switching hardware and similar cartridge extension hardware was common there). Still mind boggling how little RAM you had for game state. And that you actually did not have a framebuffer at all, meaning that you essentially had to draw each pixel (or scanline?) when it came up!


I'd say about 10% of games had some sort of bankswitching. I think about 2 or 3 games even had a 256-byte expansion RAM.

Not only was there no framebuffer and only 128 bytes of RAM, but the graphics chip didn't generate any interrupts (it could halt the CPU until the start of the next scanline though). You had to wait and spinlock on a timer for the proper amount of time after finishing your frame and processing game state to start drawing the next frame.

You also manually had to tell the graphics chip to output VBLANK lines so you had to do that at the right time as well. If you got the timing wrong on that the vertical hold on your analog TV would lose sync and the screen would jitter or roll.


Oh, interesting. Do you happen to know how that bank switching and, more so, the RAM expansion was implemented? I heard before that this wasn’t common, because of a fatal omission in the cartridge connector. And indeed, looking at the pin out, it seems to lack a Write Enable line, or really anything else besides Address and Data lines! https://old.pinouts.ru/Motherboard/AtariCartridge_pinout.sht...

Are the Data lines still bidirectional? Or was this achieved by strobing the Address lines in a certain pattern?


All I know is the majority of carts with it had a few specific addresses that triggered the bankswitch when written, toward the top of memory right before the 6502 vectors. I'm unsure how they were implemented electronically.

At least one chip (the FE device for Decathlon) is sniffing the 6502 instruction stream because it triggers on JSR/RTS being executed.

http://blog.kevtris.org/blogfiles/Atari%202600%20Mappers.txt

http://www.classic-games.com/atari2600/bankswitch.html


There are a lot of games where you could represent state inside 10 bytes.

Imagine candy crush for example - just just need a level counter (1 byte) and a bitmap of which candies have been collected (8 bytes).


The 2600 was such an interesting machine for its lack of video RAM. Developers "raced the beam" to draw things just in time for them to be displayed, changing colors and sprites partway through a line to make all the cool (for the time) effects.


Funny, my smallish ARM Cortex-M / VGA DIY video console also works like that, with less RAM than needed for a framebuffer. There are framebuffer modes, but with less pixels/bpp than natively output by the video signal output.


Still cool.


But it only used RAM for things that changed. It used the much bigger ROM for code and static data. Consoles of the era could get away with very small amounts of RAM because of that.


This is slightly misleading. the "pacmanpacman..." example is based on the alphabet "pacmn\n", which consumes 6 bytes of ASCII space but actually utilizes only log(6) bits (less than 3 bits) of entropy for its alphabet. A more realistic visualization of 512 bytes would be something like this comment, which utilizes more of the range of ASCII encoded bytes. An interesting comparison would be to write a text description of the pillman game as implemented, and see how many bytes are needed.


It's not misleading at all. It could have been 512 of the same character. It's not about how it can be stored more efficiently, it's just to visually demonstrate 512 bytes, which requires 512 characters (in straight ASCII anyway).


I'd counter and agree it's misleading. Repeating a word "pacman" int(512/len(word)) times does not give me the idea of how small 512 bytes is. It actually looked like a lot of text to me (whereas my description of that comment fit in ~45 characters).

If the original code was 512 (well, 510 if you deduct the signature, 446 if you also include the DOS partition table) bytes of asm source code, then the analogy might make sense.


Yet your whole refutation, which is quite a small comment, a couple of lines is 432 characters.

512 characters isn't a whole heap. It's more than some, but not a lot, and less than the kind of binary most languages do produce today.

A visual demonstration doesn't need to be nitpicked. (293)


As you say yourself, my comment not saying much being almost 500 bytes tells me more how tiny 500 bytes is, than the "visual demonstration".

The fact that it is so, tells me that the visual demonstration is off. I am not the only one who feels that way, so while it's definite that any visual demonstration will fail with some, this one seems to fail with many.

That, to me, is a bad visualisation, and defeats the purpose of the visualisation: you may call it nitpicking, I just call it bad use of a tool (visualisation and analogy). I even feel it confuses the scale, instead of clarifying it, which means that it should be called out.

Note that at no point was I claiming that 512 bytes is a lot, which you seem to also be arguing against. FWIW, I've written bootloaders in my high school days in 446 bytes, so I understand how little that is, even if one uses BIOS to drive the keyboard and the screen.


Reflection:

- Machine code is very dense. Really, this is not an accident, although maybe it’s not entirely on purpose either, more just common sense at the time. In the end, there were probably many platforms using this or predecessor CPUs with very little RAM/ROM and bytes still counted even on IBM PC.

- Assembly code is still very readable. The trouble is not that machine code isn’t readable, and it probably never really was. It just doesn’t offer ways to build abstraction beyond what the machine supported (like routines, sw interrupts, etc.) Only gets worse as hardware gets more complicated. Imagine avoiding VGA and instead attempting to support initializing a modern graphics card all the way through! It would probably be quite ridiculous even with documentation. This is part of what bugs me about a lot of non-x86 platforms, since that’s kind of the environment you’re dropped into in many of them.

- Speaking of compatibility, one must wonder how many new layers have been introduced on top of things like VGA since the original VGA.


> Machine code is very dense.

It's intriguingly dense. If you're writing an emulator or a bytecode VM, and you can't just slap a JIT in (because you need to ensure that e.g. data races resolve the same way they did on the original machine), it's very hard to come up with a data structure to represent "decoded" code, that is more performant than just keeping the original code around in memory in its packed-stream-of-opcodes form.

One would think, for example, that it would make sense to do the "instruction decode" pass ahead of time, to end up with an array of pointers-to-instructions plus literal-values... but the resulting representation of the code is usually much larger in memory, and so less of it will fit in cache (and it'll also fight the VM interpreter itself for cache-lines.) You might gain from your instruction impls not having to trampoline back to the interpreter (https://en.wikipedia.org/wiki/Threaded_code, basically), but you'll lose in cache coherence.

Really, the best you can do in such a situation is to translate the stream of opcodes to another stream of opcodes, just ones that you can execute more efficiently (i.e. create your own "microcode" ISA for your VM.)

Either way, "a loop that walks/jumps through an in-memory buffer of variable-length CISC opcodes using a byte-granular program-counter pointer register" seems to be an optimum somehow in design space.


> One would think, for example, that it would make sense to do the "instruction decode" pass ahead of time, to end up with an array of pointers-to-instructions plus literal-values... but the resulting representation of the code is usually much larger in memory, and so less of it will fit in cache (and it'll also fight the VM interpreter itself for cache-lines.) You might gain from your instruction impls not having to trampoline back to the interpreter (https://en.wikipedia.org/wiki/Threaded_code, basically), but you'll lose in cache coherence.

As someone currently writing a (subset of an) x86 VM, I feel this pain entirely too much. My subset greatly simplifies things by not using segment registers and getting to (mostly) not have to implement the 16bit version of ModR/M.

The biggest problem with x86 is the sheer number of ways to do addressing within a single opcode using a Mod R/M operand. For instance, all of these lines of assembly can use the same primary opcode:

    push eax 
    push [eax] 
    push [1000] 
    push [1000 + eax] 
    push [0x11 + (eax * 2 + ecx)]
    push [0x11223344 + (eax * 8 + ecx)]
    push [(eax * 8 + ecx) - 10]

If not for the Mod R/M and SIB operands, x86 would be a static-width instruction set

I'm building an interpreter that goes the way of decoding to build a pipeline (really a "basic block") and then to execute the entire pipeline with minimal branching across execution. I'm less afraid of excessive cache use than the unpredictable indirect branch problem. The hope is that building a pipeline and executing it with a branchless unrolled loop will allow me to avoid that problem, while also greatly simplifying the implementation of each opcode where the actual logic just receives a set of operands it can get or set.


> but you'll lose in cache coherence.

Did you mean cache locality?


x86 is very dense. Traditional RISCs like MIPS and SPARC are not. ARM is somewhere between the two.

Imagine avoiding VGA and instead attempting to support initializing a modern graphics card all the way through!

If you just need a framebuffer, and not any kind of acceleration (which is sufficient for something like this), then VBE is a good "better than VGA" (i.e. "SuperVGA") standard you could use. I believe most OSs can fall back to VBE in the absence of GPU-specific drivers.

https://en.wikipedia.org/wiki/VESA_BIOS_Extensions#Other_com...

- Speaking of compatibility, one must wonder how many new layers have been introduced on top of things like VGA since the original VGA.

AFAIK a VGA is basically a framebuffer with a timing controller and some very basic fixed-function hardware to assist the CPU with changing pixel values. A modern GPU, as its name implies, is a complete processor (or more precisely, a massively parallel array of them) that has its own instruction set and runs its own firmware. I believe the basic VGA hardware (i.e. the timing controller, sequencer, DACs, etc.) still exists in somewhat extended form on a modern GPU.


VBE basically went away with UEFI[1], but on the plus side UEFI basically gives you a function pointer to do mode setup. You just need to jump through multiple other function pointers to find it, which makes everything more annoying...

(of course, if you're using UEFI you're also not writing a 512 byte boot sector, so the comparison isn't entirely fair)

One meaningful distinction between VBE and the UEFI GOP standard is that you can do VBE calls at OS runtime but can't call GOP after the OS has transitioned away from the boot environment, so modern OSs pretty much require native graphics modesetting for anything other than the case where you only ever need one mode and don't support suspend/resume.

[1] OK not strictly true - older UEFI systems would leave a VGA BIOS in the legacy location and let you call it anyway, and UEFI Windows<8 actually relied on this - Windows 8 was the first that would boot on legacy-free UEFI


The true spirit of my thought experiment was, initializing a graphics card, like an NVIDIA card for example, from scratch; sure, VBE works, but no graphics driver actually implements the framebuffer this way. Not saying bothering to do acceleration, just avoiding compatibility layers.


VGA was only dropped really recently. not sure if nvidia already did, amd did only 1 or 2 generations back.. so it's surprisingly compatible ;)...


I don't think anything has dropped support for VGA. Every modern graphics card will still happily show not only all VGA modes, but CGA and EGA as well. There are some edge cases that do not work right that can become apparent if for some ungodly reason you want to run 80's DOS games on bare metal on a new system.

Are you thinking of the analog VGA monitor output? Because I think that's been dropped from some recent cards, but that is different from the VGA graphics standard. You can output VGA graphics over HDMI or DVI or DisplayPort.



My older brother bought an 8086 kit in the 80's. He could afford 128B (bytes) of RAM which cost $50. It had a front panel with switches for toggling in a program (MITS 8800) and a serial display port. So I wrote a little game in 128 bytes that took one key input and you guided a little 'X' around the screen, trying to avoid the obstacles or going off the screen.


When I wad in college, we had something similar in lab. This was around early 2000s, and we had Motorola 68K minicomputers with around 512k RAM, IIRC. They were hand built, with wirewrapped connections between the various DIPs. To debug supervisor mode assembler, we had a clock toggle switch that could switch from normal running at 1 Mhz, I think, to a single clock cycle push button. We had a 2 character hex display that showed the 16-bit raw data on the databus. Spent so much time staring at those 16-bit hex values, I was able decode most of the instructions from memory (spent a LOT of time with the 68K ASM programmer manual - still have it on my shell).

Couldn't even see the register state. Had to keep track of that in your mind or on paper. I think we had some dipswitches to toggle memory values, as well. Helped save time for trivial changes because downloading a decently sized ASM program could take 15-20 mins over a 1200 baud serial connection. Got in the habit of inserting NOPs in the code so we'd have room to insert a handful of instruxtions here and there. When I say decently sized, I mean around a couple 10ks bytes of machine code.

We were working on a supervisor program to load user code, execute it in user mode and debug user code. Some other misc stuff, like commands to act like a calculator, sort arbitrary men locations, etc.

It was a really fun class. Still can't believe how much we got done in 1 semester. Lots of late hours, sleeping on lab stools. Beer and sandwiches in lab while we worked. Made some new friends, strengthened bonds with others, lost a few friends that weren't pulling their weight.

The lot of us later made an MP3 player around an Atmel AVR chip (forget the model) and a dedicated decoder chip with some wire wrapping and soldering and a crap ton of wiring. For expediency, we streamed the data over a serial port from a PC. Also a fun project. Discovered a lot of engineering stuff we'd been taught, but didn't really understand the relevance, was extremely important (like travelling wave theory and the importance of terminating a bus properly).

It was a lot of fun building things in a constrained environment. Can't say I don't miss the college years, when you couldn't just "throw more hardware at it".


Oh of course it was an 8080 kit. 8-bit microprocessor. Typo!



Does anyone know what are the artifacts at the bottom left corner of the video? Is it just an issue with the video capture or is he doing something clever like storing data in these pixels?


To save 3 bytes I had to use the address in DI after drawing the maze (just below the last tile row), it contains the positions of the player and ghosts.


You say "save 3 bytes", we say "visualizing dynamic game state, for debugging and educational purposes, and highlighting how small the dynamic state is compared to the static art asserts"


Reminds me of CRAM dots on the Sega Genesis: https://segaretro.org/Sega_Mega_Drive/Palettes_and_CRAM#CRAM...



Storing data in pixels, so it only has one memory segment and can write direct to memory to do the graphics.


What's incredible is that there really aren't any special "tricks" here. The code is pretty easy to follow.


> It's compatible with 8088 (the original IBM PC). So you now have to look for a 8-bit compatible VGA card if you want to run it in original hardware ;)

I'd be interested in the thoughts behind this decision. CGA would have been more period correct and more compatible with the "real" hardware that's out there (I say this as an IBM 5150 owner). Not wanting to criticise the author of course - this is still a very cool project and, being open source, I'm sure we'll see new variations popping up before too long. I might even have a go myself.


The CGA 320x200 graphics mode is horrible to program for and would have made a 512 byte implementation much harder to accomplish. Even and odd rows of pixels on screen are separated in memory by 8k bytes, for instance.

VGA mode 0x13, on the other hand, is widely known as the easiest graphics mode to program for -- all you do is throw bytes one after the other into 0x8000 + 64K and you're immediately putting 256-color graphics on the screen.

8-bit compatible VGA cards are somewhat of a rarity to find nowadays but it wasn't entirely uncommon to see 8088 or 8086 systems paired with VGA cards in the late 80's, although generally they were more common and useful with a 286 and up.


Standby for my upcoming 8088 programming book ;)


> It can be run as a COM file or put into a boot sector of a floppy disk to be run.

Isn't "COM file" something specific to DOS? How does it differ from a machine binary file, i.e. what GCC (or a linker) would spit out of the ASM code? (The makefile uses nasm, thought)


From https://en.m.wikipedia.org/wiki/COM_file

> Since it lacks relocation information, it is loaded by the operating system at a pre-set address, at offset 0100h immediately following the PSP, where it is executed (hence the limitation of the executable's size): the entry point is fixed at 0100h.


COM is a machine code file to be loaded at 0100 then executed. There's no instruction difference, but you have to remove the headers. Even a .a produced by GCC has to be de-ELFed with -nostdlib -ffreestanding.


-ffreestanding will still produce an elf binary. You have to either use "objcopy -O binary" or set the output format in the linker script (or on the linker command lind) to produce a flat binary without headers.


We used to refer to .com files as memory images, because they look the same on disk as they do in memory.

EXE, OVL and the like had relocatable segments, whereas the memory images assumed all segments were within the code segment. cs=ds=es and the stack you just crammed somewhere in that space, usually at the end of the literal code, but still very much inside the space of the one segment you could access.


It's a plain 16-bit executable without headers or linking information. Specific to DOS.


Which essentially is equal to a raw machine binary, except that the MBR's boot loader is mapped to 0x7c00, while DOS's COM loader maps the start at 0x1000. Of course the pure MBR version has no access to DOS interrupt handlers for IO access or similar. (Interrupt 0x21)


Interesting that, when you compare the MBR and COM versions side-by-side in a hex editor, the only changes (apart from the required few bytes at the end of the MBR version) are in what I assume are memory addresses: a 7D byte in the MBR version ends up being 02 in the COM version. One higher than 7C and 01 in the respective starting addresses.


You can also easily see this in the (really nice) code, where the only difference is the starting address set at https://github.com/nanochess/Pillman/blob/master/pillman.asm... and the MBR magic bytes in the very bottom.

If you now wonder what the difference between a .com and a .exe file is:

A .com file has a single page which is loaded in that specific location, with all addresses in the code assuming absolute addresses. The .exe (MZ file) allows DOS to put the program anywhere into memory and the rewrite address references (or rather segments) according the relocation table from the program header. With MZ files DOS also sets up space for the stack and sets registers accordingly.


It's also the reason why we have to careful with files named (for example) "google.com".


While still very impressive, this does make use of BIOS routines for VGA, I/O and timekeeping.


...as opposed to something much bigger that also eventually uses the same "library" of functionality.


I would be interested in seeing if an EFI version of this would also fit in 512 bytes. One may have to overlap some fields in the binary format.


Does EFI even have the same 512 byte restriction though?


No, it doesn't.


As for VGA, it only uses one BIOS call to set the graphics mode, then never again. After that, it just throws bytes directly at VGA memory.


HN Title is misleading. The github author correctly calls it Pillman / yellow-thing-eats-pills-and-chased-by-ghosts . Even ignoring trademark issues; it's not Pac-Man -- there are no power pills, no teleporter/torus holes, no ghost house, no fruits, no score, etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: