One of my best hacks back in the day was an 8080 emulator, for the 8088/8086 back in the time when what you were really trying to do was run the 8080 code as fast as possible (because the first PCs were slow, and emulating 8080 code was much slower than running 8080 code on a native machine).
Invariably at the time the 8080 emulators would have a dispatch loop, fetching the next opcode and dispatching through a jump table to some code to emulate each of the 256 opcodes. One problem with this was that you couldn't dispatch this way without altering the CPU flags, so you'd also have to save and restore those with LAHF/SAHF (fast) or PUSHF/POPF (slow). In general you'd need about 10 8086 instructions to emulate 1 8080 instruction, and in the early 1980s this meant your emulated CP/M program would run much slower than on your old CP/M computer.
My emulator would emulate 1 8080 opcode with 4 8086 opcodes as follows; An 8080 instruction, say 0x94 = sub h would be emulated with code loaded at address 0x9494. That code would be;
sub bh,dl ;bh = bh-dl, emulate h with bh, a with dl
lodsb ;al = *si++, get next opcode, increment emulated pc
mov ah,al ;eg 0x94 -> 0x9494
jmp ax ;jump to next instruction
This is a classic example of trading space for time, your emulator is sparsely distributed through 64K of RAM. So you needed 128K to emulate your 64K 8080.
Shower thought the next day, OMG I got the subtraction round the wrong way! Should be (of course);
sub dl,bh ;dl = dl-bh, emulate A with dl, H with bh
Not all opcodes are one to one like this one. For example 0x3e = MVI A,imm (move an immediate value to the A register) would be;
lodsb ;al = *si++, get the immediate value
mov dl,al ;dl = al, A = imm
This time I've omitted the 3 instruction postscript (lodsb; mov ah,al; jmp ax) to all opcodes.
While I'm here adding extra information, note another strength of my approach; The si, al and ah registers are the only resources consumed for emulation machinery. The di, bx, cx, dl and f registers are available to emulate the DE, HL, BC, A and F registers without any memory accesse (the dh register was wasted sadly). Remember the host CPU was the 16 bit 8088/8086 machine which also had 'segment' registers to allow access to 20 bits of addressable memory rather than 16 (so 1M not 64K). The segment registers had a role in emulation, basically setting up a 64K 'code segment' to run the emulator and another 64K shared 'stack and 'data' segment to emulate 64K of memory for the emulated 8080. Amusingly, getting this emulator going today would involve running another emulator to emulate this first iteration of the now venerable x86 architecture.
I was wondering about the registers, why not use the "standard" mapping for BC/DE/HL => CX/DX/BX, and DI for A?
One disadvantage of this would be you have to "xchg ax,di" before and after every accumulator op, but that is a single byte opcode (and on the 8088, instruction fetch is the main bottleneck). Just from intuition, it seems like for common code sequences like "MOV D,H ! MOV E,L", having the byte registers directly addressable would be a win. But maybe you profiled how common the various instructions are in real code?
>Amusingly, getting this emulator going today would involve running another emulator to emulate this first iteration of the now venerable x86 architecture.
Not necessarily. Modern x86 CPUs running in 64-bit mode still support 16-bit code/data segments (the only thing not supported is 64 bit OS + V86 mode, and you don't need that for 8080 emulation). At least on Linux, this is made available to userspace via the modify_ldt syscall.
Of course, when you have 16-bit code and a CP/M emulator, might as well also make it run CP/M-86 programs (most of which do not use the kind of segment arithmetic that would require V86 mode). Then the 8080 CPU emulator would be just another code segment, and probably the easier part to write, compared to the code necessary to emulate the CP/M filesystem on Linux.
Good question and extra information, thanks. I wasn't aware of 16 bit code/data segments on 64 bit x86 CPUs.
I don't think I seriously considered using di for A (I've sort of started using a convention of lower case = 8086 registers, upper case = 8080 register so I may as well stick to it). Simply because of the many cases like the two I've used as examples, both of which are really simplified by having an 8086 8 bit register for A. As a 8080/Z80 programmer learning 8086 at the time, it took me a while to realise that al being off-limits for register A due to lodsb wasn't really a big problem, because any of the 8086 8 bit registers had equivalent expressability to the 8080 accumulator. At that point I stopped worrying about using the standard 8080 -> 8086 register mappings. It's more than possible that I was seduced by apparent simplicity and actual runtime performance could have been tweaked a little higher on average by something with a little more superficial complexity. I was in a hurry and looking to get my 8 bit tools running on my new box over a weekend or similar I think (I can't really remember exactly).
It was such a long time ago that I've forgotten a lot. Only after I wrote the "shower thought" post above did I remember that as well as dh, I had bp in reserve (basically I'd forgotten about bp). I suppose the extra resource (bp) meant even a normal dispatch loop emulator could also be written without needing to emulate any 8080 register with memory, contrary to what I wrote about above.
One optimisation I think I did do was actually to emulate MVI A,imm and other single byte immediate instructions with a lodsw rather than a lodsb, effectively prefetching the next opcode. I had the 256 opcode routines packed together initially, then I'd distribute them to their ultimate 0x0000, 0x0101, 0x0202 ... locations at runtime and append the lodsb; mov ah,al; jmp ax suffix then. So no problem if the code for MVI A,imm was lodsw; mov dl,al; mov al,ah; jmp ax; The appended suffix would still be there, it would just never execute.
My first implementation of this emulator took the form of a binary prefix that could be prepended to any CP/M 80 .com file with the Operating System's concatenate command (PIP ?), to make a CP/M 86 .cmd file. Later on I adapted it to MS-DOS with a similar deployment strategy, not a huge job because the basic system calls I needed were close cousins to MS-DOS equivalents. As you say, quite different for Linux.
In general the window of widespread usefulness for this tool was limited; basically the time when CPM/80 software was still competitively capable and PCs were still not at least an order of magnitude faster than their CP/M 80 predecessors (after that point conventional emulators were fine). I never took the chance to try and share my code with the world, that was harder then. So it never helped out anyone but me sadly.
Yes, pretty close so full translation is easy if you have the source code. If not, emulation is the normal way. My emulator was faster than native 8080 code on everything but vanilla 8088 PCs, so it's more than sufficient.
I address translation in depth in my Retro Sargon project, an exercise in getting a vintage Z80 (8080 superset) assembly language program working on modern PCs. https://github.com/billforsternz/retro-sargon See the "Yet More Details" section in the README.
I think the 8086 was designed such that 8080 assembly code could be assembled as 8086 machine code. So I'd guess the instruction emulation would probably be reasonably efficient.
They seem to be re-organizing the website structure [1] as well as updating the submitting system [2], and lately they were making some progress [3], but still need a lot of formatting work.
Seems like they are stretched a bit thin, and no doubt this is a hobby project.
This is a great steelman for where dogfooding does not apply - imagine if their website was actually an obfuscated C webserver with the content imbedded in it, and therefore impossible to update lol.
I've been waiting for the competition to open up again, I think I have an entry good enough. There does appear to be some activity on their GitHub repo, but I have no idea when the next one will be. Does anyone here have any gossip about when the next one might be? Wild rumor would also suffice.
The biggest factor I think is that the judging process cannot be easily parallelized after the initial filtering and traditionally needed multiple in-person judging sessions---there are several photos in the official Twitter account. (2020 judges didn't meet in person for the obvious reason, but still used conference calls.)
Related, 8086tiny, a 2013 IOCCC winner. About twice as much source code, and if not cheating then certainly leaning heavily on the rules, but very impressive as it's a full x86 emulator with graphics and several peripherals.
`a[b]` is semantically equivalent to `*(a + b)`, so it is unexpectedly commutative. Let's swap them all and add some spacing. Ah, also some assignment expressions have been reused as a part of other expressions, which I will also pull out:
X = *Y & 7,
L = X & 1,
o = X / 2 & 1,
O = l[32] = 0,
c = y,
t = c & 7,
a = c / 8 & 7,
T = Y[1] >> 6,
g = ~-T ? y : (n) y,
d = BX = y,
--l[64]
Now it should be much more clear that a large chunk of this code is instruction decoding. At this point we need to expand macros to make more sense out of it:
#define Z short
#define a(_)*(_*)&
#define y a(Z)Y[++O]
#define n char
It would need more reading to determine the exact meaning of `Y`, but given that its type is `unsigned char*`, it should be some sort of pointer, most likely the instruction pointer. The macro `y`, which expands to `*(short*)&Y[++O]`, increases `O` so it looks like reading from IP+0, IP+1, ... while not actually advancing the IP.
So the mysterious code can be now mostly decoded:
X = *Y & 7, // [IP+0] = ?????XXX
L = X & 1, // = ???????L
o = X / 2 & 1, // = ??????o?
O = l[32] = 0, // Reset `O` for further uses of `y`
c = y, // [IP+1] = yyyyyyyy
t = c & 7, // = ?????ttt
a = c / 8 & 7, // = ??aaa???
T = Y[1] >> 6, // = TT??????
g = ~-T ? y : (char) y, // Read [IP+2] and do something
d = BX = Y[++O], // [IP+3] = dddddddd
--l[64]
[IP+1] surely looks like the ModR/M byte, so you can guess that `g` is a displacement even if you don't know what the heck is `~-T` (equals to `T - 1` in twos' complement). [IP+0] is too generic to pick the exact instruction, but my guess is that this code is shared for a lot of similar opcodes to save the code footprint.
The description implies they found some clever way to trick iocccsize at the time. Using the current version of iocccsize, it weighs in at 3842, well over the max of 2053.
No, `iocccsize` has been updated throughout the history of the contest. The hints mention this, but the accepted secondary size for `cable3.c` was 1,977 tokens. (Later versions of `iocccsize` don't compute secondary sizes at all.)
As far as I can tell, the actual abuse here was the inclusion of instruction decoding table in the custom BIOS which didn't count towards the total size at all:
> This conversion, along with instruction lengths and how each instruction modifies the flags, is assisted by some lookup tables which form part of the BIOS binary.
The BIOS file doesn't just contain what you'd expect from a normal PC BIOS, but also contains several tables used by the emulator for decoding of x86 instructions. Normally those tables would be in code as they're necessary regardless of BIOS implementation; they're presumably in the BIOS to save space.
Is there a competition where you win by showing something both compact and extraordinary but not obfuscated ? Like zlib compression in 100 lines or some song recognition using FFT in X lines, or a working ssh client in Y lines ?
Some of the winners of the last few years have been extraordinary - complete arcade games, with physics, in just 10 lines of BASIC. This competition definitely highlights the issue we have with software development, which is that we, as a species, don't generally get really good at it for decades...
Wow, that sent me down a rabbit hole -- but this line from the homepage makes it sound like he is _not_ part of that family?
> If you're looking for information about the Mexican scientist Oscar Toledo Esteva, then please go to the History page of the Familia Toledo organization site.
The articles on the browser tech are horribily outdated, and ditto with TLS libraries. But if they used something like mbedtls/bearssl and Gemini/Gopher, they would be still useful.
The simplest emulator in the world would be like this one:
unsigned char memory[65536];
int reg[8];
int main(void)
{
int pc = 0;
/* In this point you read something into memory */
while (1) {
instruction = memory[pc];
if (instruction == 0x76) break; /* HLT */
pc = (pc + 1) & 0xffff;
if ((instruction & 0xc0) == 0x40) /* Handle MOV r,r (opcodes 0x40-0x7f) */
reg[(instruction >> 3) & 7] = reg[instruction & 7];
}
return 0;
}
What it does? It simply reads each instruction from memory (PC is the Program Counter), and increments PC continuously limiting it to 16-bit.
It does a comparison for each possible instruction, for this example I only handled the most basic one of 8080 (MOV r,r) and HLT, but this is already 64 cases of the 256 possible!
Then I added iteratively chunks of opcodes to emulate and used language C macros to reduce the code size. For example, there is #define d(e) to join two 8-bit registers into a 16-bit value, and #define Q to read the accumulator.
In the end, to stay inside the size limit of the IOCCC, I removed the tree of 'if' statements and replaced it with a tree of trinary operators ?: (a way of doing 'if' inside an expression) and this is what the macros C and S does.
The 8080 instruction set opcodes have logical bit patterns, so you can identify the operation and operands through bit masking instead of requiring a huge LUT. Beyond that you can simplify a lot by sacrificing timing accuracy.
you can also write one using a 256-byte switch, which ain't so big. i wrote an 8080 emulator in fortran77 on a dec10 that way, back in the 1980s. but of course size wasn't an issue. it did prove to be quite portable - to vax and ibm vm/cms. we used it for teaching assembler programming.
After simplifying and reducing the code as far as possible, it is rewritten in a compressed format using macros and without whitespace. Look at the #define statements.
Not sure how to use brew without touching the llvm or Xcode etc., I switch to lima (search it under this news site for more details). It works (after doing sudo ... build essentials etc.).
For the BASIC, not sure how to quit the environment after learning it in 1979! Seems Ctrl-Z give you a seg-fault and it quit ... :-) But not sure where is the core-dump generated ;-p
For the CPM, must do the chmod +w A B as said (BUT NOT chmod +w A then chmod +w B ?!!). Still cannot exit the CPM until I dir and find the program HALT.COM (wow .COM) and A>HALT ... all good. :-)
I have seen a program, I think for IOCCC, where the code was aligned in 4 character columns that seemed to naturally fit the code, so it would have code arranged like:
for( long i=0; i<n; i++)
{int a=1, b=2, c=3, d=4;
for the whole program.
I lost it though and have been completely unable to find it, despite searching very hard.
Has anyone seen something like this before? I would be forever grateful for any pointer towards it.
For the 8080 emulator, you need the c_basic.bin (renamed to just "C") and c_bios.bin to be in the same directory as the executable, as per the article. I tried this and it dumps me into a BASIC shell. So cool!
Invariably at the time the 8080 emulators would have a dispatch loop, fetching the next opcode and dispatching through a jump table to some code to emulate each of the 256 opcodes. One problem with this was that you couldn't dispatch this way without altering the CPU flags, so you'd also have to save and restore those with LAHF/SAHF (fast) or PUSHF/POPF (slow). In general you'd need about 10 8086 instructions to emulate 1 8080 instruction, and in the early 1980s this meant your emulated CP/M program would run much slower than on your old CP/M computer.
My emulator would emulate 1 8080 opcode with 4 8086 opcodes as follows; An 8080 instruction, say 0x94 = sub h would be emulated with code loaded at address 0x9494. That code would be;
This is a classic example of trading space for time, your emulator is sparsely distributed through 64K of RAM. So you needed 128K to emulate your 64K 8080.