Hacker News new | past | comments | ask | show | jobs | submit login
Dennis Ritchie’s first C compiler (c. 1972) (github.com/mortdeus)
376 points by jnord on March 11, 2021 | hide | past | favorite | 157 comments



Brings back some memories because this was the first "large" program available to me for study when I was first learning to program, around 1978.

Having talked my way into using the local university EE dept's 11/45, I found the source tree was mounted (they had a 90MB CDC "washing machine" drive) and decided to print it out on the 132-col Tally dot-matrix printer.

Some minutes into this, the sysadmin bursts into the terminal room, angrily asking "who's running this big print job".

I timidly raise my hand. He asks what I'm printing. I tell him the C compiler source code because I want to figure out how it works. He responds "Oh, that's ok then, no problem, let me know if you need more paper loaded or a new ribbon".


I have to say, the way indentation and brackets were done here looks like it's just inviting subtle bugs. Take this for example:

    if (peekc) {
      c = peekc;
      peekc = 0;
     } else
      if (eof)
       return(0); else
       c = getchar();


If you judge the coding style of the early 1970s with modern standards, it isn't going to look great. 50 years is a loooong time.

dmr was one of, if not the first C programmer(s).


And he was almost certainly using ed(1) as his editor and a mechanical teletype at 7.5 or 10.0 characters per second as his terminal...

The C language (and all of Unix) was designed to be very terse as a consequence.


I try ed(1) once in a while and I find it pretty usable (and even rather efficient if you know what you are doing). cat(1) also works if you just want to type a new text.

Speaking of terseness, I love the the fact that C does not have 'fn'.


We used to speak of a great Unix systems programmer as someone who could write device drivers with cat and have them compile and run the first time.


Before I look up `man cat`, what can you do with `cat` other than just see what's in a file?


When not given a file, cat will just read from stdin, so you can use "cat > file.c", write some text, and send EOF with ^D when you're done.

Obviously, there's no way to go back and edit anything mid-stream, you have to write the whole thing out in one shot.


The backspace does work within the line.


If your terminal is in line-buffered mode.


You can join files together.

    $ cat foo bar > baz
will join the files foo and bar together into a single file called baz


That period lasted what, 10 years after Unix was created? And we'll be stuck with those decisions decades if not centuries.

Similar story with the design of QDOS / MS-DOS / Windows and nowadays with Android. Both designed for super underpowered machines that basically went away less than a decade after they were launched and that will be hobbled because of those early decisions for a long, long time.


We will be hobbled with these decisions for a long time precisely because the complete package of trade offs the designers made were so successful.

If they had gone for wart-free on properly powered hardware, they would be stuck back in Multics land or living with the gnu Hurd—cancelled for being over budget or moving so slowly that projects that actually accomplish what the users need overtake them.

Do I wish that C had fixed its operator precedence problem? Sure. But the trade offs as a total package make a lot of sense.


Some would say,

https://multicians.org/history.html

Instead we pile mitigations on top of mitigations, with hardware memory tagging being the last hope to fix it.


Is there an explanation on why C’s operator precedence is weird? Such as: why does the bitwise AND have higher precedence than logical AND?


There is, and it is amusing

“In retrospect it would have been better to go ahead and change the precedence of & to higher than ==, but it seemed safer just to split & and && without moving & past an existing operator. (After all, we had several hundred kilobytes of source code, and maybe 3 installations....)“

https://www.lysator.liu.se/c/dmr-on-or.html


> why does the bitwise AND have higher precedence than logical AND?

Why is this precedence weird? Bitwise AND tends to be used to transform data while a logical AND tends to be used for control flow.


I meant equals having a higher precedence than bitwise AND.

As in:

    if (x & 2 == 2)
...is actually parsed as:

    if (x & (2 == 2))
...which isn’t intuitive.


See the above example from dmr himself


> that will be hobbled because of those early decisions for a long, long time.

Perhaps this is why a programmer would want to rewrite a system & tout "funny success stories" about the effort & results?

https://news.ycombinator.com/item?id=25844428

> Why couldn't you just upgrade the dependencies once then set up the same CI/CD you're presumably using for Svelte so that you can them upgrade versions easily?

Because the existing system was painful & time/energy intensive to upgrade. It happens with tight coupling, dependency hell, unchecked incidental complexity, architecture churn, leaky abstractions, etc...

Maintenance hell & > 1 month major version upgrades tend to occur with large, encompassing, first-mover frameworks, often built on a brittle set of abstractions, as they "mature" & "adapt" to the competition. e.g. Rails, Angular...


Was it different for the designers of ALGOL/SIMULA/Pascal?


Yeah, those languages were IIRC designed to be edited offline (as a deck of punch cards) and submitted to a mainframe via high-speed card reader as a batch job.


Very interesting when you think about it. A language created in 2009 (Go) owes its syntax to a language from 1969 (B), and the latter looks like it does because it was designed during a short transition period between offline editing (1960s) and electronic terminals (1970s).

And there are people claiming that computer scientists are not conservative :)

To what extend this explanation is correct is another question... The article by Denis Richie says "Other fiddles in the transition from BCPL to B were introduced as a matter of taste, and some remain controversial, for example the decision to use the single character = for assignment instead of :=".

It's a kind of butterfly effect :) Mr. Richie prefered "=" over ":=" and fifty years later a server crashes somewhere because somebody wrote a=b=c instead of a=b==c.


Actually the transition from "offline editing" and "electronic terminals" was not short at all. Teletypes (aka "typewriters which can recieve content encoded in electricity, next to the user keyboard") date back way beyond computers, and were still in use in the 1980s (but evventually superseded by fax). Teletypes were cheaper, faster and more convenient then video terminals. Don't underestimate of having a printout of your session, especially when being online (i.e. connected to the mainframe or mini computer) is something valuable and your terminal is "dumb" and has no storage (except paper).


My first usage of a computer was on a printed teletype. My last such use was probably around 1985. They were around for a long time.


And for a lot of people, the lightbulb goes off once they realize what 'tty' stands for...


B was a descendant of Bootstrap CPL, a language never intended to be used for anything other than making a CPL compiler, really a butterfly effect.


If you can only read it line by line:

    return(0); else
makes a bit of sense.


I dont think it would look that out of place if he was using the ternary operator which is the same thing after all.

E.g.:

  if (peekc) {
    c = peekc;
    peekc = 0;
  } else
    eof ?
      return(0) :
      c = getchar();
The first else clause sill looks weird, but the final part isn't nearly as out of place (well i guess assigning in a ternary would be weird, but in terms of indentation) and its not like we actually changed anything.


Can't return in a ternary.


Correct, return is not an expression. But then again, he could if he wanted to, as the language designer ;)


Wouldn't it be wonderful if we could write

    a = return b;


What would that even do? Return b, but then set b to a right before “deleting” a? That would serve no purpose.


Maybe it was not worth the extra work.


It would have made code review a nightmarish activity for the team.


Why nightmarish? A reviewer may explain that it is prone for other human to overlook the end of conditional and go sleep as usual. I never understood the emotional component of blaming someone’s personal code styles, as if it were a religion with sacrifices and blasphemy instead of just a practical way of coding in a heterogeneous team.

This triggers me because many people jump on “bad c0de” in the forums, but then you read some practical code on github, and it is (a) not as perfectly beautiful as they imagine at all and (b) is still readable without nightmares they promised and (c) the algorithms and structure itself requires programmer’s perception and understanding levels far beyond the “it’s monday morning so i feel like missing an end of statement in a hand-written scanner” anyway.


That team had Bell Labs researchers, Ken Thompson and Doug McIlroy being among them. Their brains could handle much harder things!


The difference between Dennis Ritchie and the average programmer of today is, Dennis Ritchie did not write bugs in the first place.


Well, some say (I disagree) that C is a bug.




And what does it mean there? (automatic storage duration)


It means it lives on the stack, so it "exists" until the function returns.



Do you mean automatic type deduction?

Edit : I know it’s also storage specifier but it also does type deduction here, hope I’m not confused with the terminology


It doesn't do type deduction there.

In original C you could declare variables without a type, and these variables with auto have no type.

So this "auto" is essentially saying "this variable without a type uses automatic storage". That's a completely different language feature from "deduce the type of this variable".


A variable without a type is 'int' so ```auto xx;``` means an int on the function local variable stack.


Thanks.


No, it doesn't do type deduction. Without specifying the type, it assumes a type of int. Everything in C revolves around ints.


here it means automatic storage duration


The original keywords - https://github.com/mortdeus/legacy-cc/blob/master/last1120c/...

Interestingly, long was commented


It makes sense long would have needed a comment. It needs a comment because “long” and “double” are terrible names for data types. Long what? Double length what? Those type names could easily have opposite meanings and meant long floating point / double length integer. WORD/DWORD are nearly as bad - calling something a "word" incorrectly implies the data type has something to do with strings.

If you don't believe me, ask a non programmer friend what kind of thing an "integer" is in a computer program. Then ask them to guess what kind of thing a "long" is.

The only saving grace of these terms is they’re relatively easy to memorise. int_16/int_32/int_64 and float_32/float_64 (or i32/i64/f32/f64/...) are much better names, and I'm relieved that’s the direction most modern languages are taking.

(Edit: Oops I thought Microsoft came up with the names WORD / DWORD. Thanks for the correction!)


> ask a non programmer

Why should a non programmer understand programming terms? Words have different meanings in different contexts. That's how words work. There is no need to make these terms understandable to anyone. The layman does not need to understand the meaning of long or word in C source code.

Ask a non-golf player what is an eagle or ask a physicist, a mathematician and a politic the meaning of power.

Word and long may have been poor word choices, but asking a non-programmer is not a good way to test it.


> Ask a non-golf player what is an eagle or ask a physicist, a mathematician and a politic (sic) the meaning of power.

All those fields could do with less jargon. Especially since in many cases there is a common word.

Lawyers especially give me the impression that they use jargon to obscure their field from regular folks.

Our field, being new, should not make the same mistakes, but yet, here we are, where the default "file viewer" on Unix is cat, the pager is called "more", etc.

I don't see any reason why those types could not have had descriptive names from the start. Well, there was one, lack of experience, so ¯\_(ツ)_/¯


Dejargonification is great; I agree that jargon is a real barrier to entry. However, after having dealt with lawyers (and doctors of all stripes: MDs, PhDs, ...), I've come to see jargon for what it is: a weakly typed pointer. It's exactly analogous to a struct of function pointers with a user cookie in C.

Jargon lets you summarize whole concepts in one word. A lot of jargon is functional in the math/programming sense: you can pass 'arguments' to the jargon. The jargon adds levels of abstraction that let the users communicate faster, with less error, and higher precision.

For instance, the distinction between civil and criminal matters; or the distinction between malfeasance, misdemeanor, and felony.


> Lawyers especially give me the impression that they use jargon to obscure their field from regular folks.

Really? You think that lawyers (and physicians) use Latin/Greek words in order to confuse regular folks?

These fields are very old, and changing the meaning of something Mens Rea or lateral malleolus is going to require A) an exact drop-in replacement which will probably just as obscure, B) retraining of an entire set of people.

Our field is similar in that we have a mountain of jargon that's largely inaccessible to regular folks. The point of those words are not to converse with regular folks but to convey information to others in the field with as little ambiguity as possible.


In England many Latin terms were replaced with English terms in 1998 with the adoption of the new Civil Procedure Rules in an effort to make justice "more accessible" (alongside other reforms). These arguably actually added confusion - for example it's much more obvious that a "writ" is a specialist term versus "claim" which replaced it.


Or ask a non-mathematician what a ring, a field, and a group are.


It's about cognitive load. The more jargon diverges from concepts you already have learned and used for years, the more difficult it is to adapt. As an experiment, try replacing all of your variable names and types with numbered variables: v1, v2, v3, etc. Then compare to replacing them with words: not completely random words, but with completely misleading words, like the complete opposite of what they represent, length/height, mass/speed, etc. You'll find the latter is maddeningly hard to deal with, because your brain keeps bringing a whole lot of context that is just wrong, so you are constantly fighting your own brain.

Intuitive things come from prior experience. They are a kind of inertia that you just have to work with.


Jargon that builds on intuition can be its own problem. Jargon, by its very definition, has explicit technical meaning in a specific domain. Intuition in words is based on vernacular usage. It is vanishingly unlikely that the vernacular usage aligns with the domain-specific usage of a given term. This leads to plenty of false assumptions, and forces people to disambiguate jargon-usage vs vernacular-usage, which may both be found in a single piece of writing.

See economics for a great example of jargon-vernacular crossover.


> Jargon that builds on intuition can be its own problem.

Sure it can, which is why you gotta be double-careful naming things and not try to take metaphors too far. A jargon term needs to crisply identify the crux of the concept and not confuse with irrelevant or misleading details.


> A jargon term needs to crisply identify the crux of the concept and not confuse with irrelevant or misleading details.

Agreed. I've never seen a vernacular term fill this role well.

If you need to learn the technical concepts either way to be effective, might as well give them a name that doesn't conflict with another definition most people know.


I think "binary tree" is a decent example. It's not only a visual depiction of how the data structure is laid out, but the metaphor of "leaves" does also transfer over. It is possible to take it too far, trying to fit "bark" into the concept, which is of course, silly. Calling it a "dubranchion" or some such would be a disaster, IMHO.


"Binary tree" is not a vernacular term.

In economics, "cost" is a good example. This is a distinct concept from "price". "Comparative advantage" is another term in economics; this is perhaps not used in vernacular conversation, but I can tell you from personal experience that it certainly doesn't convey to most people the definition understood by someone with an education in economics -- the vernacular reading doesn't imply the jargon definition.

It seems to me that the difference is how the jargon is used. I imagine that someone without a CS background would quickly realize, when overhearing a conversation about binary trees, that the subject is something other than a type of flora.

I can tell you with confidence borne from frustrating experience that using economics jargon, such as that I mentioned above, with a lay audience gives the audience no such impression that the terms mean anything other than what they perceive them to mean.


Good variable names (and type names) matter for legibility. They should be clear, unambiguous, short, memorable and suggestive. Unambiguous is usually more important than short and memorable.

The word 'long' is ambiguous and unmemorable. And the type "word" is actively misleading. If you called a variable or class 'long', it wouldn't pass code review. And for good reason.

'Power' is an excellent example of what good technical terms look like. "Power" has a specific technical meaning in each of those fields, but in each case the technical meaning is suggested by the everyday meaning of the word "power". Ask a non-physicist to guess what "power" means in a physics context and they'll probably get pretty close. Ask a non-programmer to guess what "integer" means in programming and they'll get close. Similarly computing words like "panic", "signal", "file", "output", "stream", "motherboard" etc are great terms because they all suggest their technical meaning. You don't have to understand anything about operating systems to have an intuition about what "the computer panicked" means.

Some technical terms you just have to learn - like "GPU" or "CRDT". But at least those terms aren't misleading. I have no problem with the term "double precision floatingpoint" because at least its specific.

"long", "double", "short" and "word" are bad terms because they sound like you should be able to guess what they mean, but that is a game you will lose. And to make matters worse, 'long', 'double' and 'short' are all adjectives in the english language, but used in C as nouns[1]. They're the worst.

[1] All well named types are named after a noun. This is true in every language.


"long" and "short" are adjectives in C. The types are "long int" and "short int" and the "int" is implied if it's not present. A declaration like

    auto long sum;
Declares a variable named "sum" of type "long int" and of automatic storage duration.

The "double" comes from double-precision floating point. In the 1970s and 1980s anyone who came near a computer would know what that meant. Anyone who ever had to use a log table or slide rule (which was anyone doing math professionally) would know exactly what that meant.

There are good sound reasons for the well-chosen keywords. Just because one is ignorant of those reasons does not mean they were not good choices.


Well, you can certainly think of "long" and "short" as adjectives, but the grammar doesn't treat them that way.

"long", "short", "signed", "unsigned", "int", "float", and "double" are all type specifiers. The language specification includes an exhaustive list of the valid combinations of type specifiers, including which ones refer to the same type. The specifiers can legally appear in an order. For example, "short", "signed short", and "int short signed", among others, all refer to the same type. (They're referred to as "multisets" because "long" and "long long" are distinct.)


You really need to consider the context before making broad statements like this.

By your own standards, "word" and "long (word)" are actually excellent terms, because they're conventions used by the underlying hardware, and using the same conventions absolutely makes sense for a programming language that's close to the hardware:

https://en.wikipedia.org/wiki/Word_(computer_architecture)


> Microsoft’s WORD/DWORD are nearly as bad. Don’t call something a “word” if it doesn’t store characters.

"Word" as a term has been in the wide use since at least 50s-60s, you can't really blame MS for that

https://en.wikipedia.org/wiki/Word_(computer_architecture)


A 'word' on an x86 is 32-bit, but a WORD is 16 bits.

You can _absolutely_ blame Microsoft for using it to mean something that it isn't.


Yes, int_16/int_32 or something like that makes a lot more sense. Today, not when this compiler was written.

The PDP-9, PDP-10, and PDP-18 have 18 bits registers. The world had not settled on 16/32/64 bits at all.

Even the intel 80286 far/fat pointers are 24 bits.


Unsure why you’re being downvoted, IIRC the original C programmers reference spoke explicitly about how “int” meant the most efficient unit of storage on the target machine.

Admittedly I read that more than 30 years ago :-O


C’s type naming had real value when it was first designed, at a time when the industry hadn’t yet fully standardised on the 8-bit byte, 32/64-bit words, IEEE floating point, etc.

The fact that “int” could be 16-bits on a PDP-11, 32 on an IBM 370, 36 on a PDP-10 or Honeywell 6000 - that was a real aid for portability in those days.

But nowadays, that’s really historical baggage that causes more problems than it solves, yet we are stuck with it. I think if one was designing C today, one would probably use something like i8,i16,i32,i64,u8,u16,u32,u64,f32,f64,etc instead.

When I write C code, I use stdint.h a lot. I think that’s the best option.


An int in C was 16 bits until about 1980 when Unix started being ported to larger machines. C and Unix were originally just for the PDP11.


As this paper [0] explains, the initial version of the C compiler for PDP-11 Unix was finished in 1972. And less than a year later (1973), people had ported the C compiler (but not Unix) to Honeywell 6000 series mainframes, and shortly thereafter to IBM 370 series mainframes as well. (Note the text of the paper says "IBM 310" in a couple of places – that's a typo/transcription error for "370".) Both were "larger machines" – the Honeywell 6000 had 36 bit integer arithmetic with 18 bit addressing; the IBM 370 had 32 bit integer arithmetic with 24 bit addressing.

Alan Snyder's 1974 masters thesis [1] describes the Honeywell 6000 GCOS port in some detail. In 1977, there were three different ports of Unix underway – Interdata 7/32 port at Wollongong University in Australia, Interdata 8/32 port at Bell Labs, and IBM 370 mainframe port at Princeton University – and those three had C compilers too.

[0] https://www.bell-labs.com/usr/dmr/www/portpap.pdf

[1] https://apps.dtic.mil/dtic/tr/fulltext/u2/a010218.pdf (his actual thesis was submitted to MIT in 1974; this PDF is a 1975 republication of his thesis as an MIT Project MAC technical report)


Here’s something cool: the source code to Snyder’s compiler: https://github.com/PDP-10/Snyder-C-compiler


Unix was originally written for an 18-bit machine.


That was before C which is what we are talking about.


If you need an 18-bit int it would be called int18 under this scheme. And encountering things like int18 in code, at least you would never wonder what kind of int it was.


It could not be clearer that you didn't read the page referenced by the comment you replied to.

"long" was commented out.


No, the names are fine and self evident after glancing through K&R for 15 min.

The real mistake in retrospect is that int and long are platform dependent. This is an amazing time sink when writing portable programs.

For some reason C programmers looked down on the exact width integer types for a long time.

The base types should have been exact width from the start, and the cool sounding names like int and long should have been typedefs.

In practice, I consider this a larger problem than the often cited NULL.


It made more sense in an era when computers hadn't settled on 8-bit bytes yet. A better idea (not mine) is to separate the variable type from the storage type. There should be only one integer variable type with the same width as a CPU register (e.g. always 64-bit on today's CPUs), and storage types should be more flexible and explicit (e.g. 8, 16, 32, 64 bits, or even any bit-width).


And that creat has no e on the end.


It's a perfectly good word in Romanian.


int is neither 32 nor 64. Its width corresponded to a platform’s register width, which is now even more blurred because today’s 64 bit is often too big for practical use and CPUs have separate instructions for 16/32/64 bit operations, and we agreed that int should be likely 32 and long 64, but the latter depends on a platform ABI. So ints with modifiers may be 16, 32, 64 or even 128 on some future platform. intN_t are different fixed-width types (see also int_leastN_t, int_mostN_t, etc in stdint.h; see also limits.h).

Also, don’t forget about short, it feels so sad and alone!


> It makes sense long would have needed a comment.

I think you misunderstood. There's no explanatory comment. The "long" keyword is commented out, meaning that it was planned but not yet implemented.

    ...
            init("int", 0);
            init("char", 1);
            init("float", 2);
            init("double", 3);
    /*      init("long", 4);  */
            init("auto", 5);
            init("extern", 6);
            init("static", 7);
    ...


Hmm, given its position in that sequence, it looks like it means a floating-point type larger than a double.


(a) FORTRAN used ‘DOUBLE PRECISION’ since the '50s, so ‘double’ would be immediately obvious.

(b) Many important machines had word sizes that were not a multiple of 8.


Fortran has been using `int8`, `int16`, `int32`, `int64`, and `real32`, `real64`, `real128` kinds officially for at least 2 decades. `double precision` has been long declared obsolescent, although still supported by many compilers. Regarding types and kind, (modern) Fortran is tremendously more accurate and explicit about the kinds of types an values.


In C, long is not the name of a data type, it is a modifier. It turns out that C standard type is integer, so if you say long without another data type (such as double, for example), this means long int.


That made me curious. From section 2.2 of the 2nd edition of K&R, 'long' is not a type but a qualifier that applies to integers (not floats though), so you can declare a 'long int' type if you prefer.


Float doesn't make sense either. What is floating?

(I know it's floating point, but it's the same as long/double).


String doesn’t make sense either. But that’s how words acquire new meanings.


It's short for "Hollerith string". Nobody wants to type out that man's name every time they want to deal with the data type. Also, most programmers know zero about computer science.


That seems to be inaccurate. According to Wikipedia, string constants in FORTRAN 66 were named in honor of Hollerith, and the actual wording in the standard is: "4.2.6 Hollerith Type. A Hollerith datum is a string of characters. This string may consist of any characters capable of representation in the processor. The blank character is a valid and significant character in a Hollerith datum." Apparently the term "string of characters" is assumed to be self-explanatory here, and independent of the "Hollerith" nomenclature. The connection to Hollerith is via punched cards, for which the _encoding_ of characters as bit patterns (hole patterns) was defined; but Hollerith doesn't seem to be directly related to the concept of character strings as such.

It is probably rather by chance that we ended up with the term "string (of characters)", as opposed to for example "sequence of characters". In a different universe we might be talking about charseqs (rhymes with parsecs) instead of strings.


> Also, most programmers know zero about computer science.

What made you say that?


I have known a very good programmer, I'd say one of the best I have met, and he had extremely, surprisingly little knowledge of computer science and math (not having a formal education may have been a contributing factor). He coded in JS and Ruby.


Well, 'character string' could, too, make a strange impression on an uninitiated.


There is no short or unsigned either. Maybe int modifiers were pending work?

for is missing too.


Very interesting. Does anyone have any idea how the C compiler was bootstrapped. Was it first written in B, or was it written in PDP-11 assembler?


It was written in B. And B was written in B. To trace the bootstrap up into the high level, you have to go further back than the PDP-11. To bootstrap B on the PDP-7 and GE-635 machines he had access too, Ritchie wrote the minimum subset required to compile a compiler in B, in a language known as TMG (in the vein of, and a big influence on, Yacc and similar). This minimal bootstrap was then used to compile a B compiler written in B, and further development was self-hosted.

Later, the language would be retargeted to the PDP-11 while on the PDP-7. Various changes, like byte-addressed rather than word-addressed memory, led to it morphing into C after it was moved. There was no clear line between B and C -- the language was self-hosting the whole time as it changed from B into C.

Mr. Ritchie wrote a history from his perspective published in 1993. I've mostly just summarized it above. It's available here: https://web.archive.org/web/20150611114355/https://www.bell-...


I like that the Bootstrappable Builds folks are working on a bootstrap process that goes from a small amount of machine code (~512 bytes) all the way up to a full Linux distro, including compilers/interpreters for modern languages.

https://bootstrappable.org/


Nice, I hadn't heard of this project before!

In particular BOOTSTRA [1] looks really fun. I have also toyed with the idea of using MS-DOS 3.30 as a guaranteed-ubiquitous build environment.

It comes with a filesystem, a text editor (EDLIN.EXE), an object file linker (yup!), a debugger/assembler (DEBUG.EXE is an amazing tool), a programming language with decent string handling (GWBASIC.EXE), and a command interpreter with batch file scripting to glue it all together.

Not sure I would write the assembler as batch files though, that is really hardcore. :)

That entire build environment would fit on a single 360KB floppy.

[1] https://github.com/mniip/BOOTSTRA


I don't think that BOOTSTRA is in any way involved in the process bootrappable.org are working on, which starts with hex0 (the 512B machine code binary seed), proceeds through a ton of layers and eventually reaches non-interactive bash.

https://github.com/fosslinux/live-bootstrap/blob/master/part...


That repo is very cool. I want to see them build further on it, go through recursive gcc / clang versions, etc. all the way to a kernel.


What is "guaranteed-ubiquitous" about DOS 3.30?


There are (still!) a lot of copies floating around -- both online and offline -- and it runs on the most pervasive commodity platform that ever existed: 16 bit real mode x86.

Those binaries will boot on anything from an ancient IBM PC with an 8088 CPU all the way through to fairly recent x86 systems (if they still have legacy BIOS boot support).


You have a different meaning of "guaranteed" and "ubiquitous".

It wouldn't be defensible to use those descriptors for Windows XP. In a word, it would be wrong. Calling DOS 3.30 "guaranteed-ubiquitous" is even wronger than that.


When people say bootstrapping, is this how all computer languages came to be? Or there has been multiple attempts at bootstrapping? What I’m getting at is whether all languages have a single root.


Yes, bootstrapping with regard to languages is the trick of getting that first compiler running so you can compile the rest of the compiler.

It has been done many times. The first assemblers were written directly in machine language. The first compilers were written in assembly. Many implementations of FORTRAN, ALGOL, COBOL, etc. As late as the 1970s it was not unknown to write a new high level language directly in machine code. Steve Woz's BASIC for the Apple II was "hand-assembled", as he put it.

So, taken literally, certainly not. People still do it today as a hobby or educational project.

But if we take the question in a looser sense? Yes kind of, at least in the UNIXish world. No one has implemented a serious high level systems language, except in a high level systems language, for a long time now. Rust, for example, was initially implemented in Ocaml. And normally that's what would be called the first Rust bootstrap. But OCaml was implemented in C, probably compiled by Clang or GCC. GCC was written in C and... so on.

Often one finds it does lead back to DMR at a PDP-7. On the other hand, I strongly suspect something like IBM's COBOL compiler for their mainframes (still supported and maintained today!) would not have such a heritage.


IBM's COBOL was implemented in PL/X, which traces its heritage to PL/S. PL/S itself was developed as a replacement for using assembler.


> Steve Woz's BASIC for the Apple II was "hand-assembled", as he put it.

Very cool - so basically (haha) Woz's integer BASIC was bootstrapped by writing it in assembly language and translating it into machine code by hand.

I wonder if someone (Woz?) has written a 6502 assembler in integer BASIC to allow it to bootstrap itself?


I'm not an expert, but I guess "bootstrapping" only applies to compilers, and Apple II BASIC was an interpreter. An interpreter doesn't have to compile itself, only run the programs you give it. So on an 8-bit computer the priority is to have an implementation that (a, and most importantly) takes up as little space as possible and (b) is reasonably fast.


There are metacircular interpreters, like in Scheme (see SICP [1]). But this is rare, compared to bootstrapping compilers.

[1] https://mitpress.mit.edu/sites/default/files/sicp/full-text/...


The Apple ][ has a monitor in ROM that allows to enter machine code directly. The legend says that for the Apple I, Woz knew the complete integer basic machine code and could type it live into the monitor.


Also, the monitor on pre-][+ machines with integer BASIC also had a mini-assembler, so short assembly programs could be typed in as such instead of keying in the bytes. Both versions of the monitor had a mini-disassembler.

On the other hand, I still remember at least a few 6502 hex opcodes, not that I have any use for that information anymore. The instruction set is small enough that it doesn't surprise me that Woz would have it memorized.


I’m not aware of any basic that is implemented in itself. There must be one or two somewhere, but typically for systems like the apple ii or Commodore PET, basic was implemented always in assembly.

So most weren’t exactly bootstrapped the way we are talking here.


> No one has implemented a serious high level systems language, except in a high level systems language, for a long time now.

LuaJIT is a prominent counterexample. The base interpreter in written in assembly for performance. It comes with its own tradeoffs, especially portability.


LuaJIT uses it's DynASM library for both static and JIT code generation. DynASM is implemented in a mixture of Lua and C, and Lua is of course implemented using C. So LuaJIT actually depends on a C compiler to generate assembly. Plus, not all of LuaJIT is implemented using assembly, only some critical hotspots (e.g. the bytecode interpreter loop). The rest is just C.

DynASM is actually really cool.

I almost forgot: apropos bootstrapping, because LuaJIT requires Lua to build, it includes a single-file, stripped down version of PUC Lua 5.1 which it can build to bootstrap itself if the host lacks a Lua interpreter.


LuaJIT 1.x uses DynASM for JIT but LuaJIT 2.x doesn't and instead uses the IR to ASM compiler in https://github.com/LuaJIT/LuaJIT/blob/v2.1/src/lj_asm.c

A program using DynASM doesn't depend on Lua or a C compiler to generate assembly at runtime but you need both to build it.


Fascinating and inspiring, thank you for taking the time to distill it so clearly.


Interesting article on how the original CDC 6000 Pascal compilers were bootstrapped (including the scanned source code listings as PDFs!?!):

http://pascal.hansotten.com/niklaus-wirth/cdc-6000-pascal-co...

"The method used to create the first Pascal compiler, described by U. Ammann in The Zurich Implementation, was to write the compiler in a subset of unrevised Pascal itself, then hand-translate the code to SCALLOP, a CDC specific language. The compiler was then bootstrapped, or compiled using the Pascal source code and the SCALLOP based compiler, to get a working compiler that was written in Pascal itself. Then, the compiler was extended to accept the full unrevised Pascal language."


I don't know if I'm stating the obvious, but you only need to bootstrap if you're writing a compiler in the language it's compiling. So any language compiler written in a different language isn't bootstrapped.

Even if you're using a loose definition of "root" as in a parent language is any language involved in writing another language, I doubt there is one root. There are certainly have been languages and computer architectures (with assembly languages) that are independent islands.


Some languages have multiple implementations in different languages. The Bootstrappable Builds folks are working on proper bootstrap for everything.

https://bootstrappable.org/ https://bootstrapping.miraheze.org/wiki/Main_Page


Chuck Moore is an interesting example of someone who has bootstrapped several languages without touching a C compiler. He has written systems directly in machine code, and has gone as far as designing his own chips to directly run Forth and ColorForth.


Bootstrapping can also get recursive, eg. GCC and binutils are bootstrapped by compiling their sources with the GCC you already have on your build system; the result of this is an intermediary compiler, which is then used to compile the sources of itself again. (I should look up the following bit, but IIRC, the second generation result compiles the sources a third time, and the last two outputs are compared -- they should be equal, barring any non-deterministic optimizations.)


The basic sequence was:

- Rewrite B compiler in B (generating threaded code)

- Extend B to a language Ritchie called NB (new B). This compiler generated PDP assembly There is no version of the NB compiler known to exist.

- Continue extending NB until it became the very early versions of C.

You can read the longer version of this history here:

https://www.bell-labs.com/usr/dmr/www/chist.html


I think it was written in B as Ritchie extended B for the PDP-11. By that point the B compiler was written in B

https://sci-hub.do/10.1145/155360.155580


Real programmers use butterflies


A reference to https://xkcd.com/378/, for those wondering.


Flint knives and bear skins.


Abacus


So do I get this right: this is not the first C compiler (since that one would be written in B) and the first C compiler written in C?


Thank you for pointing this out. It obviously can't be his first C compiler because it's written in C! :)

I'm not even sure it's the first C compiler written in C, though - it just says in the github description "the very first c compiler known to exist in the wild."

Regardless, if it's from 1972 it's a very early version.


> Thank you for pointing this out. It obviously can't be his first C compiler because it's written in C! :)

This isn't obvious to me.

I just assumed that the first iteration was compiled by hand to bootstrap a minimum workable version. Then the language would be extended slightly, and that version would be compiled with the compiler v. n-1 until a full-feature compiler is made.

It makes sense to write a compiler in a different language, but given the era, I could see hand-compilation still being a thing.


It's not uncommon for a compiler to be partially written in its own language, so therefore this may as well be the first C compiler.




Why not just:

char waste[however-many-bytes-are-needed];

?


That's allocated in a different segment.


It is now. Was it then? AFAICT from https://en.wikipedia.org/wiki/PDP-11 none of the PDP-X machines had a segmented architecture.


Even so, https://sourceware.org/binutils/docs/as/PDP_002d11_002dPseud....

  9.34.2 Assembler Directives
  The PDP-11 version of as has a few machine dependent
  assembler directives.
  
  .bss
  Switch to the bss section.

  ...


Can't say for sure if it's the reason in this particular situation, but that would require dynamic memory allocation, and while it may seem backwards to reserve more memory than potentially needed on old memory-constrained systems, it is a pattern you see very frequently because on those systems, you actually know exactly how much memory you have. As a result, you know exactly how much memory you can "waste".

Especially on single-tasking systems, it doesn't matter how much memory you waste, because no other program is affected by it.

Or in other words: If you know you are guaranteed to have x KB of memory available and no other program can steal it, and you know that you don't need more than y KB, then why allocate up to y KB dynamically when you can just straight-up allocate y KB statically, which is less work?


IMO the way C programmers avoid dynamic memory allocation because it's so painful is probably the best feature of the language.


External arrays are static, they do not involve any dynamic allocation.


It's not clear from the question whether GP just wonders about the strange syntax used to reserve the space or about allocating a fixed amount of memory. The "however-many-bytes-are-needed" part sounded to me like they thought that a dynamic amount of memory should be allocated.


For the hobbyists and the curious using 2.11 BSD, its C compiler is the end of the line for development of the Ritchie C compiler, as far as I know.


I tried to find the last version that Dennis wrote: my findings are here:

https://github.com/dibyendumajumdar/C


It is surreal feeling when looking at first code, not sure how to explain why it is the case.


I felt the same upon seeing an actual Dead Sea Scroll at the Jordan Museum in Amman.

The feeling dissipated somewhat when our guide explained that it’s a treasure map


This is gold. I like how the files are named 00 01 10 11 because long filenames are scary.

    hshsiz 100;
    hshlen 800; /* 8*hshsiz */
    hshtab[800];
These were at file scope. I assume they default to int, but when was the demand for = added?


For those who wonder, the extract comes from https://github.com/mortdeus/legacy-cc/blob/ccbe90a2803e2cc7e...

IMO, it is close to Assembly:

* you reserve space and possibly set an original value with "hshsiz DB 100" or "paraml DB ?"

* assignment is a different business which involves a runtime instruction (MOV)

Hence the same difference (no = sign for the first, passive, compile-time operation; an = sign for the second, active, runtime operation) in this proto-C.

(Of course, this doesn't answer your question of "when" did the syntax of those 2 operation fuse :-) )


Long filenames are scary when your system has a 2.5MB disk.


It is very interesting how many hardcoded magic numbers it has. No headers and defines because they did not existed yet?.

The full compiler is super compressed. Hundreds of lines per file. It would be great if there was an explanation of the general idea of the design somewhere.


I'm only seeing a few thousand lines of code, so I think it would fit in someone's head easily enough. Might be the numbers were on paper.


So how do you "bootstrap" compile this? I mean without cheating, say on a rebuilt computer after all computers where destroyed?


Hand-compile it for the initial bootstrap would be one way to go. That, or write a minimal C in assembler, then make it self-hosting, then expand it into the full implementation.

Edit: Elsewhere in-thread, retrac posted a detailed summary of how C developed from B.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: