Hacker News new | past | comments | ask | show | jobs | submit login
Continuous Unix commit history from 1970 until today (github.com/dspinellis)
289 points by FrankyHollywood on June 16, 2022 | hide | past | favorite | 68 comments



You don't see this every day:

https://github.com/dspinellis/unix-history-repo/blob/Researc...

Is this B, or is it BCPL? What would have compiled this code back in the day?


It's B. BCPL has "LET MAIN() BE $(..." instead of "main $(...".

Running B was a challenge on the PDP-7 but easier on the PDP-11, apparently, because of the increase of memory size. The linked document has an interesting history about compiling B to threaded code, a form of interpreted code, and then to machine language. B never really made the jump to a full-fledged citizen because it quickly got replaced by C, although BCPL was popular for a long time.

https://www.bell-labs.com/usr/dmr/www/chist.html


Wikipedia's article on B says that BCPL used := for assignment and = for equality tests, whereas B used = for assignment and == for equality. Assuming that's correct, this must be B code.


I don’t know, but I love how clearly and concisely it expresses what would later become ubiquitous as do-while and continue.

That’s poetry. Nice find.


I love how thin the layer above assembly is: without knowing B, is my interpretation correct that this function effectively “inherits” the stack of the calling function? In other words, rather than passing function arguments and let the compiler deal with it, you’re supposed to push the string you want to lcase onto the top of the stack?

Reminds me a lot of writing my own compiler/assembler in university, where it’s expected that all this happens automatically nowadays.


No, that's not correct. It reads the string from standard input. A C translation would look like this:

    main()
    {
        int ch;
        while ((ch = read()) != 4) {
            if (ch > 0100 && ch < 0133)
                ch = ch + 040;
            if (ch == 015) continue;
            if (ch == 014) continue;
            if (ch == 011) {
               ch = 040040;
               write(040040);
               write(040040);
            }
            write(ch);
        }
    }
A more modern C version would look like:

    #include <stdio.h>

    int
    main(void)
    {
        int ch;
        while ((ch = getchar()) != -1) {
            if (ch > 0100 && ch < 0133)
                ch = ch + 040;
            if (ch == 015) continue;
            if (ch == 014) continue;
            // No need to handle tabstop specially
            putchar(ch);
        }
    }


Hmm, don’t think so. The function does not operate on a string, it seems to read a character using read() and write it back, transformed, using write(). Given that the function is named main, it’s probably the top level function anyway (from the programmer’s point of view, often the OS actually calls into a different function that is part of the language runtime, e.g. _start, which in turn calls main eventually, but that is usually hidden from the programmer).


This is the main function ... there is no calling function. Nor is a string on the stack being accessed.


Is that truly from 1970? For example, that commit's grandparent seems to have been specifically crafted to use "Date: Thu, 1 Jan 1970 00:00:00 +0000" https://github.com/dspinellis/unix-history-repo/commit/185f8....


That’s 0 in Unix epoch time (guess why!), so seems more like a missing timestamp than a crafted one. The fact that the linked file does not have a 0 timestamp, but a slightly later one, suggests it's valid, or at least intended to be valid.


I recall that in A Deepness in the Sky by Vernor Vinge, a space sci-fi set in the far future, they're still using Unix time underneath many many layers of abstractions, and with their cultural context they guess that humanity must have set it to start with the moment mankind first travelled into space to land on the Moon.


Hah, plausible. Not far off timewise, and yet totally wrong, but understandable how such a conclusion could be made.


They had "auto" vars in 1970. WG14, the ISO work group that maintains the C programming language specification, has just recently discussed acceptance of __auto_type.

EDIT: ops, the "auto" here means automatic allocation.


Yea, I have to say, to me, this is cool. Glad to see this sort of history being preserved.


So auto is used as a keyword here. Maybe C inherits this never-used auto from B?


auto stands for 'automatic', because such variables are automatically allocated for each function invocation. In C it became redundant because base types were added, and so the base type could start the definition (auto was still permitted with default base type of int until C99 I think). auto in B is a bit like 'let', it starts a declaration, along with 'extrn'.


very weird that two characters - $( and $) - were used before { and }

did old keyboards not have curly braces or what?


In C you can still use the digraphs <% and %> as an alternative to curly braces:

    int main() <%                                                                   
        printf("hello, world\n");                                                   
        return 0;                                                                   
    %>


{} were added to the 1967 revision of ASCII, along with `|~ and lower case. (EBCDIC never got them in the base character set, only in alternate ‘code pages’.)


I remember in 1990 IBM sponsored a small 370 for our university. I fiddled weeks to get curly braces to work correctly. We were all used to work with Sun workstations or at most VAXen at the time. It was unbelievable how complicated this was in the IBM world. They were still living in the world of full-time machine operators. My colleagues were glad I did it, my professor who had not programmed for years and was moving in higher spheres was not impressed I had spent so much time on it when he learned about it later.


This repo has been super useful as I've been writing a book that teaches Rust by rewriting classic Unix utilities. I settled on using the 4.4 BSD source as a base but having the whole history available has been really interesting. Recently I came across a bug in the 4.4 version of cat that wasn't fixed until a few years later (in FreeBSD).


Gource Visualization video which points to https://www.youtube.com/watch?v=S7JB0mhrGCQ does not work anymore.

> Video unavailable > This video is no longer available because the YouTube account associated with this video has been terminated.


We need to solve this problem.

YouTube is free to delete any account, even just to cut costs.


I'm not sure what the problem to be solved here is. It doesn't seem reasonable to force YouTube (or any other free video host) to indefinitely store and host content.

If you want something to stay around on the internet it has to take up space on somebody's drive and bandwidth on somebody's network connection - and for sufficiently large content like video you're going to have to do that yourself or convince/pay someone you trust to do so on your behalf.


I am thinking of everyone hosting their own videos, and being able to comment on each other's. Is there a federated YouTube?

Something like Mastodon/Pleroma.


Peertube


Are you sure it was YT and not the creater who deleted acct.

Also there is a solution already, it's called "The Internet". Upload your content far and wide.


> the YouTube account associated with this video has been terminated.

"Terminated" is a pretty harsh wording for an account that was willingly deleted.

When you resign nobody says you got terminated. You are terminated when you are fired.


I assume Github, the host of the OP, can do the same. How many people have entrusted their life's work to it?


I sincerely hope none given how easy git is to mirror and the risk of Microsoft killing accounts


Found a video showing the history of Python: https://youtu.be/cNBtDstOTmA


You don't see this every day.....

But you do see it every year for the last number of years

Some previous discussion from 3 years ago:

https://news.ycombinator.com/item?id=19429249



I like how Github shows it as infinity commits


What's up with that? There only seem to be 4, on HEAD?


Check the other branches.


I always expected that the commit count was for that branch. I guess it is global?


I saw the other branches when I made the comment.

The commit count is — usually — the commit count from the currently selected ref.

E.g., on a sample repo, "master" displays as 29,474 commits. "master^" displays as 29,473.


Yeah, is that a bug? lol


Sounds like a overflow bug prevention mechanism.

There are an infinite number of infinities, so surely one of them is the maximum possible commits in github.


Git runs into problems with more than 2¹⁶⁰ commits in a repository.


I love Spinellis' work on teaching reading of code.


Diomidis Spinellis' "Code Reading: The Open Source Perspective" is a thing I've wanted but didn't know existed, browsing it now to hopefully recommend, thanks for the pointer.

I work with computer engineering students and often tell them that reading more code would be good for them but have never had a great generic but concrete suggestion for how to get there.

The second best programming class I took in college was a graduate elective and the _only_ code-reading-based course I took or knew of being offered: a guided safari in the Linux kernel sources where we had to make targeted changes for the assignments. FTR, the best programming class was set up as "new language in a different paradigm every few weeks, write one small program that suits it and one small program that doesn't," not incidentally taught by the same person ( https://en.wikipedia.org/wiki/Raphael_Finkel ).


We have all this commit data at scale, it really feels like there are interesting stories or lessons that could be extracted from them.

There's kind of the obvious operational stuff like: What are the properties of commits that introduce bugs compared to those that don't. Which type of commits are rarely changed and which are more likely to be changed over time. But what I'd find even more interesting is some insight into how we solve problems and how well we're able to solve them. I guess part of the puzzle is missing - the external requirements / environment that give rise to some number of the commits.


There is a series of conferences MSR — Mining Software Repositories — with research papers looking at such questions. http://www.msrconf.org/ In fact, I presented this work in the 2015 MSR conference.


That's a lot of work!

A true labor of love.

Thanks!


How would you feel if your commits become publicly available for everyone to see forever?


That ship sailed nearly half a century ago. All of this source code was previously licensed to research universities starting in 1975. The earlier releases weren't under FLOSS license like we know them today, but with the intent that researchers would be reading, learning from, and modifying the code. And they did! creating later BSD Unix releases with more open licenses whose code was shared more widely under more permissive licenses.

Finally, the people who created this repo are some of the primary authors of the code. They wanted this to be in the open.


There was an interesting discussion in 2019 after a group of people started cracking the passwords of the original Unix developers that had been obtained from an old /etc/passwd file in this repo (https://github.com/dspinellis/unix-history-repo/blob/BSD-3-S...).

Rob Pike spoke out against the effort, calling it “distasteful.” https://inbox.vuxu.org/tuhs/CAKzdPgw0Vz8UFbK7c_Jr+RHGMssSxN=...

Nonetheless, in the end every password was cracked. Some highlights:

Steve Bourne: “bourne”

Dennis Ritchie: “dmac”

Kirk McKusick: “foobar”

Brian Kernighan: “/.,/.,”

Ken Thompson: “p/q2-q4!” (a chess move)

Bill Joy: cracked but not posted due to Rob Pike’s comments, but it contained a control character


Kernighan's is my favorite. The keyboard layout could be different, but im imagining him rapping his fingers against the three adjacent keys as if the motion itself were a secret handshake.


Isn't it cool? I mean, being in the history of a project like this... it could be around long after we are gone.


This is the point of GitHub. Also Unix was(/is) a masterwork of craftsmanship. Struggling to see a problem here.


Eh, I think the select and poll system calls are both kludges, the sockets API inferior to P9 dial, gethostbyname deeply problematic.

Then there are the ways threads interact badly with many classic functions, the way signal handlers play messily with everything else.

Don't even ask about X.


An important question:

Could I have thought of something at that time, with all the same constraints and without the benefit of hindsight, that would have been better?

For the vast -- and I mean vast -- majority of us, the answer to that question is a resounding No!.


At what time? Many of the issues are the result of evolution and bolting things on vs. redesigning things as conditions changed.


X is fantastic and amazing, not just for the time, even now. Not switching.


Really proud to be a part of history.


I hope everyone is ok with cursing….


Proud.


Fine. You?


Does any one is able to fully bootstrappe it now ?


So what’s the oldest line of code currently active?

What’s the longest-lived line of code in the repo?


Who holds the canonical unix repo?


There is no canonical Unix repository.

Unix (1969) predates source version control (1972).


> IBM's OS/360 IEBUPDTE software update tool dates back to 1962, arguably a precursor to version control system tools. A full system designed for source code control was started in 1972, Source Code Control System for the same system (OS/360). Source Code Control System's introduction, having been published on December 4, 1975, historically implied it was the first deliberate revision control system.[4] RCS followed just after,[5] with its networked version Concurrent Versions System. The next generation after Concurrent Versions System was dominated by Subversion,[6] followed by the rise of distributed revision control tools such as Git.[7]

* https://en.wikipedia.org/wiki/Version_control#History


Who owns the modern unix copyright?


The Open Group (Intel, IBM, Fujitsu, Huawei, Philips etc)

https://www.opengroup.org/about-us/who-we-are

https://www.opengroup.org/trademarks

Together with IEEE they are the ones giving the POSIX certification http://get.posixcertified.ieee.org/certification_guide.html


No, they own the trademark.

I think the copyright of the old Unix code ended up with Novell in the 90s, so it would now be owned by Micro Focus.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: