A C89 compiler that produces executables that are also valid ASCII text files [pdf]

hannob · on Feb 5, 2018

There was actually a tool "com2txt" back in the DOS days. So you could convert an executable and put it into an email....

Update: I have my data well sorted enough that I found it :-) It even comes with code and is under a vague free license: https://github.com/hannob/com2txt

ptspts · on Feb 6, 2018

Shameless plug: I also wrote a tool similar to com2txt in 1996: https://github.com/pts/pts-xcom . A quick comparison:

* com2txt.exe is 7110 bytes long, xcom.com is only 401 bytes long.

* xcom.com can also convert back from text to binary.

* xcom.com can also convert to data text (without the self-decoder header).

skocznymroczny · on Feb 6, 2018

Can it also stop an alien invasion?

techolic · on Feb 6, 2018

Send some aliens and test it out!

Cyph0n · on Feb 5, 2018

Thanks for sharing!

Kind of uncanny to see a perfectly readable Makefile written around the year I was born...

CobrastanJorji · on Feb 5, 2018

Make's been around since the '70s. Bell Labs cranked it out right after inventing C.

softbuilder · on Feb 6, 2018

Back in the BBS/FidoNet days someone sent me a comic GIF that, when renamed with a .COM extension, would execute under DOS. It was a nifty little demo, although .COM files were going the way of the dinosaur around then.

NuSkooler · on Feb 6, 2018

Wish I could remember the trick, but in the DOS days I used to type in a few characters (5ish IIRC) at the start of text files that allowed them to be renamed and executed as COM files.

prewett · on Feb 6, 2018

If you read the paper I think you'll find it is something like "ZM~~_#____PRinty__C", where _ is a space, (HTML compresses spaces). See section 8, also the beginning of the paper.

TickleSteve · on Feb 6, 2018

.COM files were straight binary, no header and loaded at address 256 (0x100) into any 64KB segment. All indirections were local to that segment, hence the 64KB limit of .com files.

The characters you were typing are (probably) the code for a jump to an entry point somewhere else in the file.

softbuilder · on Feb 6, 2018

No, COM files had no header. [2] I think that's part of why they were replaced. ZM was for EXEs. I think that one was someone's initials. (Yes, looked it up. [1])

I think what parent is remembering were characters that effectively created a jump instruction at the beginning of the file.

[1] https://en.wikipedia.org/wiki/DOS_MZ_executable

edit: [2] Section 6 of the paper talks about that, just noticed.

pjmlp · on Feb 6, 2018

They were replaced because they could not handle more than 64KB.

Sniffnoy · on Feb 6, 2018

Does this output self-modifying code, or does it also try to avoid that like ABC does?

bluefox · on Feb 5, 2018

I also remember JauMing Tseng's XPACK...

amq · on Feb 5, 2018

Why not just use base64 (from 1992)?

jdmichal · on Feb 5, 2018

It didn't just create a text file. It created an executable text file.

First two paragraphs of the README:

Com2txt is a tool on MS-DOS which converts a com file to a text file. It's DOS generic. Unlike tools such as uuencode, the text file generated by com2txt works as a com file, exactly like the original com file does. Using com2txt, you can create a com file which can be sent through networks such as internet, and runs without any decoding.

Moreover, the text file got by com2txt consists only of ECHOable characters; it doesn't contain characters such as `<' or `|'. So, using ECHO command, you can easily generate the textized com file and use it in a batch file. For detail see section 4.

sixothree · on Feb 6, 2018

I didn't gleam that from the original comment either.

Crazy fun stuff though.

azendent · on Feb 5, 2018

There's a nice video that does a great job of walking through what is happening here: https://www.youtube.com/watch?v=LA_DrBwkiJA

bringtheaction · on Feb 5, 2018

That is demoscene level awesomeness! Kudos to the author. Excellent video as well. And not least, lovely ending <3

modeless · on Feb 5, 2018

On a related note, how about C source code that you can chmod +x and execute directly, even with execve: https://gist.github.com/jdarpinian/84a28a1ed8a36313a4e0cad8b...

khc · on Feb 5, 2018

it's actually much easier than that: https://github.com/kahing/bin/blob/master/cleancache.c

modeless · on Feb 5, 2018

True, but there are a few gotchas with that version. It won't work with execve, it doesn't cache the binary, it won't work if called with "source", it doesn't set argv[0] properly when the binary is called, and a few other things. It is nice and terse though.

khc · on Feb 6, 2018

execve and binary caching are relatively easy to implement, as is argv[0]. Do people actually care about making "source" work?

modeless · on Feb 6, 2018

execve compatibility is not easy at all, since it requires a shebang line which isn't valid C. However thanks to emmelaich's // idea I just figured out a way to do it using fewer lines than your #if 0 solution.

https://gist.github.com/jdarpinian/1952a58b823222627cc1a8b83...

MaxBarraclough · on Feb 6, 2018

I can't decide if that's cute, or eye-bleedingly criminal.

Neat though, and I haven't seen it before.

emmelaich · on Feb 6, 2018

Nice. Mine is short and sweet -- just one line. Dunno whether it is better. I've had to put it in gdocs because HN fouls up the asters.

https://docs.google.com/document/d/1sBzur57FeLqzgcoGMz8jwOgM...

modeless · on Feb 6, 2018

Ha, I'd seen the 3 line #if 0 version but not this! If you don't care about the fact that execve won't work, then this is pretty great.

emmelaich · on Feb 6, 2018

(execve)...

Challenge accepted. :-) Might be just a matter of a shebang at the beginning and an 'exec' before the final $p.

modeless · on Feb 6, 2018

Yeah, a shebang is required, and unfortunately it isn't valid C so you can no longer feed the file directly to the compiler. I just figured out how to whittle it down to 2 lines though! Thanks for the // idea. Here it is in both shebang (2 line) and non-shebang (1 line) versions: https://gist.github.com/jdarpinian/1952a58b823222627cc1a8b83...

emmelaich · on Feb 6, 2018

Thanks, good stuff. dolmen/Olivier's followup is good too.

I now remember why I used the complicated expr command and make; it's to ensure it works with C and C++.

And any language that happens to accept // as a comment and is supported by "make". (note that no actual Makefile is required)

modeless · on Feb 6, 2018

That is a cool feature! It's a bit annoying to have separate versions for C and C++.

riking · on Feb 6, 2018

Why go to all that trouble when you can just

#!/usr/bin/tcc -run #include <stdio.h> int main() { puts("Hello, World!"); }

modeless · on Feb 6, 2018

Because most people don't have tcc installed, and the convenience is ruined if you have to install things for this to work. I really like that tcc has a flag for this; GCC and Clang really should copy it.

ogdoad · on Feb 8, 2018

When I read the source and started on the comments, I thought: won't be long until someone drops in TCC. But yes, TCC does limit you in this regard a bit.

v_lisivka · on Feb 6, 2018

It's much simpler:

    $ cat >test.c
    //usr/bin/gcc "$0" -o /tmp/out.exe || exit; exec /tmp/out.exe "$@"
    #include <stdio.h>
    void main(int argc, char **argv) {
        printf("Hello, world!");
    }
    ^D
    $ chmod a+x test.c
    $ ./test.c
    Hello, world!

modeless · on Feb 6, 2018

Even shorter, works unmodified with both C and C++, and doesn't recompile unless necessary:

  //usr/bin/make -s "${0%.*}"&&exec "${0%.*}" "$@";exit

raphlinus · on Feb 5, 2018

The phrase 'For example, on the popular and elegant X86 architecture, the single byte 0xF4 is the "HLT" instruction' slays me every time.

IgorPartola · on Feb 6, 2018

One of my favorite quotes about C:

> Dennis Ritchie invents a powerful gun that shoots both forward and backward simultaneously. Not satisfied with the number of deaths and permanent maimings from that invention he invents C and Unix.

schoen · on Feb 5, 2018

"An elegant weapon... for an age more naive about ISA design"?

metaobject · on Feb 6, 2018

Can you explain this? Is it the "popular and elegant" part?

raphlinus · on Feb 6, 2018

Absolutely. It's very subtle humor. Students of computer architecture consider x86 to be one of the least elegant architectures around. Its many warts include segment registers (originally a hacky workaround to stretch 64k of memory to 1M), and an extremely complex instruction encoding employing prefix bytes. Many of the legacy issues (such as not having enough registers) have been papered over, leaving traces behind. Many people felt that the complexity would doom the architecture, and that a cleaner, leaner RISC approach would win out.

However, Intel has used their advantage in process technology to throw massive amounts of transistors to make up for the problems caused by all this complexity, and has done well. RISC has done well in the mobile space because those transistors tend to be power-hungry, but everywhere else x86 is today almost the only game in town.

One reason it's especially funny is that "HLT" is one of those legacy instructions that has pretty much no use in a modern system, yet takes up a whole slot in the byte encoding, while common operations like MOV or ADD often require extra prefix bytes to specify the size of the operands.

Hope that helps!

kazinator · on Feb 6, 2018

Segment registers did not evolve from the hacky address space expansion mechanisms. It may look that way looking at nothing but the Intel history, but the descriptor-style segment registers existed in mainframe architectures before the 8086/88 existed. The 8086 has trivial segment registers (which were just scaled offset addresses) which then morphed into mainframe-like descriptors of the successors (registers being indices into tables of segment descriptors). That could have been a plan all along, though.

https://en.wikipedia.org/wiki/Memory_segmentation#History

pjmlp · on Feb 6, 2018

> Students of computer architecture consider x86 to be one of the least elegant architectures around.

I guess it depends when and where one studied.

Having grown with Z80 and x86, it surely looked kind of alright to me.

I only missed the flat addressing from 68000, but given that I only had access to it on Amigas available at some dev meetings, it wasn't something I bothered much with.

Also I don't remember anyone jumping of joy during the MIPS assignments (using SPIM).

TickleSteve · on Feb 6, 2018

HLT is absolutely used in modern systems, ARMs have the WFI and WFE (wait for interrupt, etc). They're essential for dropping in to low power modes, though admittedly HLT wasn't used for that back in the day.

CogitoCogito · on Feb 6, 2018

> One reason it's especially funny is that "HLT" is one of those legacy instructions that has pretty much no use in a modern system

Is HLT no longer used in OS idle loops? Are there now other instructions which are better to use instead?

fwsgonzo · on Feb 6, 2018

It's absolutely used. It's a completely normal and expected instruction to find on any CPU whether new or old.

Tobba_ · on Feb 6, 2018

It does have a slight advantage over ARM / most other RISC architectures in that the instructions are fairly small, meaning that you can get quite good decoding throughput without going wider. That advantage doesn't get entirely cancelled out by how badly allocated things are, since instructions can decode to multiple "actual" instructions (µops).

I'm still curious as to how Intel thought a mobile x86 chip could ever work.

FreeFull · on Feb 6, 2018

HLT is still used by kernels when they want to idle the processor until the next interrupt. There are many examples of instructions that aren't actually used much if at all nowadays, like POPAD and PUSHAD, or the binary-coded decimal instructions

jlouis · on Feb 6, 2018

Also, F4 is The close program function key in Windows....

So everything adds up. CYA!

Narishma · on Feb 6, 2018

That would be Alt+F4.

merraksh · on Feb 5, 2018

The histogram on the last page counts the occurrences of each character in the paper (all of them printable, of course). But because the histogram's counts are made of characters too, the author had to add a few extra numbers to make the histogram "converge". Brilliant.

Y_Y · on Feb 5, 2018

This is the work of Tom7, well known for other projects like Learnfun & Playfun, ARST ARSW and running a marathon in hockey skates.

sarchertech · on Feb 5, 2018

Thanks for pointing that out. I wouldn't have noticed who that was otherwise.

His video on learnfun/playfun is both hilarious and amazing.

https://www.youtube.com/watch?v=xOCurBYI_gY

hprotagonist · on Feb 5, 2018

The good doctor murphy is a mad genius.

I laughed very hard indeed when I first read this and got to the last 30 seconds of the video.

theknarf · on Feb 6, 2018

Meta literate programs. Not only do you have the code and a descriptive document about the code in the same document, but you also have the executable!

merraksh · on Feb 5, 2018

Do I see a Sierpinski triangle on page 9? It's the code that, according to the description, changes the value of the AL register.

bonzini · on Feb 5, 2018

No, it is the _data_ that the compiler precomputes to help changing the value of the AL register. It is unused here, in fact it is cropped to 160 columns and it has a caption in the middle so it's wrong even. He included it just because it looks cool.

zmodem · on Feb 6, 2018

The paper/executable starts with "ZM", but shouldn't a DOS .exe file start with "MZ"? (http://www.delorie.com/djgpp/doc/exe/) What am I missing?

SideQuark · on Feb 6, 2018

MZ or ZM works

[1] https://en.wikipedia.org/wiki/DOS_MZ_executable

zmodem · on Feb 6, 2018

TIL I wonder if there’s any interesting reason for that.

zippzom · on Feb 5, 2018

I must be missing something, but I don't see how the actual text of the paper originates from the source code. Those C instructions actually compile into the sentences of the paper as well?

moefh · on Feb 6, 2018

From a quick look at the source[1], it seems the compiler will always generate an executable with the text from the paper (which is read from the "paper/" directory, and some bits hard-coded in the compiler source). Or something. I don't really know SML.

From what I can tell, the .exe file generated by the compiler must be really big anyway (since the relevant sizes in the header can't be small because they have to be printable). So there must be some text, it might as well be the paper.

[1] https://sourceforge.net/p/tom7misc/svn/HEAD/tree/trunk/abc/e...

zippzom · on Feb 6, 2018

Ah so all the x86 bytes that the actual text generates are basically just filler for the actually relevant sections of the paper (i.e. the jumble of bytes that appears)? They're never actually read or executed by the CPU?

_RPM · on Feb 5, 2018

Anyone have a copy of the program? I want to try and run it.

loeg · on Feb 5, 2018

The text file ( http://www.cs.cmu.edu/~tom7/abc/paper.txt ) is the program.

anonymfus · on Feb 5, 2018

http://www.cs.cmu.edu/~tom7/abc/paper.exe for convenience of not renaming

plogik · on Feb 6, 2018

If only i wasn't too lazy I'd buy a hat just to take it for the author.