Hacker News new | past | comments | ask | show | jobs | submit login
The Source History of Cat (twobithistory.org)
151 points by janvdberg on Nov 12, 2018 | hide | past | favorite | 62 comments



I was surprised to read that some versions of cat (apparently BSD Net/2 and derivatives) have special code for sockets. What does cat do with sockets!?

Well, AF_UNIX sockets are sockets with paths in the filesystem. You must either connect() to them or bind() to them instead of open()ing them. Apparently, BSD-derived versions of cat will try to connect() to to a file if open() fails.

With GNU cat, if you try to cat a socket, it will go like this:

    $ ls -l test.sock
    srwxr-xr-x 1 luke users 0 Nov 12 21:07 test.sock
    $ cat test.sock 
    cat: test.sock: No such device or address
but BSD-derived cats will successfully open the socket for reading. That behavior can be accomplished on other systems by using socat instead; BSD cat behaves somewhat like:

    $ socat UNIX:test.sock STDOUT


Hah, I learned something about cat today. Thanks.

Amusingly, the BSD socket behavior can be disabled with the compiler macro -DNO_UDOM_SUPPORT, but as far as I can tell it is not documented nor hooked into the rest of the build system in any way since its introduction in 2001:

https://svnweb.freebsd.org/base?view=revision&revision=83482


> But, if you pull up the manual page for something like grep, you will see that it has not been updated since 2010 (at least on MacOS).

Well, GNU grep was last released 16 months ago, and the last change to its master branch was 4 weeks ago: http://git.savannah.gnu.org/cgit/grep.git

FreeBSD's grep was last updated back in August: https://github.com/freebsd/freebsd/tree/master/usr.bin/grep

OpenBSD's grep was last updated 11 months ago: http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/grep/

Oddly, it looks like the Darwin grep was last updated in 2012: https://opensource.apple.com/source/text_cmds/text_cmds-99/g...

Strange that Apple would be shipping such an ancient grep.


Iirc, Apple stopped updating but continued shipping all gnu utilities since gplv3 was attached to them


I don't believe that macOS grep was ever GNU grep. I believe that macOS always used a BSD variant of grep.


Using OS X 10.4.11 here, the grep file is dated Jan 2006, the end of the grep man pages says "2002/01/22".

  $ uname -v
  Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386
  $ grep --version
  grep (GNU grep) 2.5.1
Other man pages: ed says 1993, sed says BSD 2004, cat says 3rd Berkeley Distribution 1995.


Interesting. What does `type grep` say? Is it possible that it's /usr/local/bin/grep from homebrew/macports/…, and that /usr/bin/grep is BSD grep?

I found a comment claiming that prior to 10.8 (2012, Mountain Lion) it used GNU grep, but nothing I'd feel comfortable citing.


  $ type grep
  grep is hashed (/usr/bin/grep)
It does seem to be the original grep for this machine (it's a Mac Mini) - it has the same Jan 2006 date as most of the files in /usr/bin, and nothing has an earlier date. There's no other file called grep elsewhere.


For what it's worth, on a not-too-old Mac:

  $ uname -v
  Darwin Kernel Version 13.4.0: Mon Jan 11 18:17:34 PST 2016; root:xnu-2422.115.15~1/RELEASE_X86_64
  $ grep --version
  grep (BSD grep) 2.5.1-FreeBSD
I don't have historical information, but that's at least consistent.


> Strange that Apple would be shipping such an ancient grep.

I don't think it is that strange. Command line tools such as grep don't appear to be a development priority for Apple. Their focus appears to be on features visible to the average user, who uses the GUI instead of the command line.

Command line tools are mainly used by developers and power users, and the existing tools are generally good enough for most purposes, and people who want something better can always install the GNU versions using Homebrew/MacPorts/etc. There isn't much market demand for improvements in this area, so it makes sense Apple wouldn't invest in it.


Be aware if you are going to delve into history that grep is the source of much confusion, in part because exactly which program was grep on some systems has changed over the years. On FreeBSD, for example, some years ago grep was the GNU tool and the BSD tool was named "bsdgrep". They would both identify as the same version number.


> On FreeBSD, for example, some years ago grep was the GNU tool and the BSD tool was named "bsdgrep". They would both identify as the same version number.

Neither of these statements are true. grep on FreeBSD is still GNU grep, and it has a distinct version text from bsdgrep:

    $ grep -V
    grep (GNU grep) 2.5.1-FreeBSD
    
    Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions. There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
    
    $ bsdgrep -V
    bsdgrep (BSD grep) 2.6.0-FreeBSD

    $ uname -rK
    13.0-CURRENT 1300003


Tut-tut! So easily demonstratable otherwise.

MacOS:

* https://unix.stackexchange.com/questions/352977/

* https://unix.stackexchange.com/a/398249/5132

The very version of FreeBSD from some years ago:

   % bsdgrep --version
   bsdgrep (BSD grep) 2.5.1-FreeBSD
   % grep --version
   grep (GNU grep) 2.5.1-FreeBSD

   Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
   This is free software; see the source for copying conditions. There is NO
   warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
   
   %
More on that:

* https://unix.stackexchange.com/a/65609/5132

Kyle Evans and others on making bsdgrep into grep:

* https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201650


MacOS isn't FreeBSD. They're free to do whatever they want with the BSD-licensed software. Your comments made claims about FreeBSD that weren't factual. I would also emphasize that:

    strcmp("bsdgrep (BSD grep) 2.5.1-FreeBSD", "grep (GNU grep) 2.5.1-FreeBSD") != 0
Also, I'm personally in contact with Kyle Evans and am familiar with the general interest in making bsdgrep grep. But I also know that it hasn't happened yet.


> Strange that Apple would be shipping such an ancient grep.

Maybe they ceded this part of the OS to Homebrew? I know I never try to update anything stock in the OS. It's so much easier/faster to `brew install xxxxxx` than mess with the OS which might get overwritten with an official update anyway.



Agreed. But the lack of `{}` raises my blood pressure a few points...


Really nice history! I want to applaud the author on this loving treatment.

Also I want to point readers to the commentary of some of the Unix authors:

“Old programs have become encrusted with dubious features. Newer programs are not always written with attention to proper separation of function and design for interconnection.”

http://harmful.cat-v.org/cat-v/unix_prog_design.pdf

My point being: Unix (and derivatives) encompass a set of people who disagree about what constitutes Unix philosophy.


> My point being: Unix (and derivatives) encompass a set of people who disagree about what constitutes Unix philosophy.

That's certainly a Unix truism! It seems everyone has their own subjective beliefs about what Unix should be and decides their own beliefs constitute "the" Unix philosophy.


Interesting to think what a different conclusion the article would have arrived at if he'd chosen to look at GNU cat on Linux. A few sample points:

* 2002: 833 LoC (http://landley.net/aboriginal/history.html)

* 2013: 36kLoC, 2/3rds of them .h files (https://news.ycombinator.com/item?id=11340510#11341175)

* 2018: 37kLoC of .c file dependencies going into libcoreutils.a and some LoC of .h files (coreutils has 60kLoC of .h files)

The methodology for counting lines likely isn't consistent across those data points. But the trend is still unmistakeable. Maybe I'll tree-shake all the dead code out and come up with an accurate line count one of these days..


I just performed an ad hoc file-level tree-shaking for 'src/cat.c' in GNU coreutils 8.30, starting with `gcc src/cat.c` and gradually adding arguments until I got it to build. Here's the command I ended up with.

    gcc -I. -I./lib /
      src/version.c /
      lib/progname.c /
      lib/safe-read.c /
      lib/safe-write.c /
      lib/quotearg.c /
      lib/xmalloc.c /
      lib/localcharset.c /
      lib/c-strcasecmp.c /
      lib/mbrtowc.c /
      lib/xalloc-die.c /
      lib/c-ctype.c /
      lib/hard-locale.c /
      lib/exitfail.c /
      lib/closeout.c /
      lib/close-stream.c /
      lib/fclose.c /
      lib/fflush.c /
      lib/fseeko.c /
      lib/version-etc.c /
      lib/xbinary-io.c /
      lib/version-etc-fsf.c /
      lib/binary-io.c /
      lib/fadvise.c /
      lib/full-write.c /
      src/cat.c
Those .c files add up to 5021 lines.

The .c files include 44 header files:

    lib/binary-io.h
    lib/c-ctype.h
    lib/closeout.h
    lib/close-stream.h
    lib/config.h
    lib/c-strcaseeq.h
    lib/c-strcase.h
    lib/ctype.h
    lib/error.h
    lib/exitfail.h
    lib/fadvise.h
    lib/fcntl.h
    lib/fpending.h
    lib/freading.h
    lib/full-write.h
    lib/gettext.h
    lib/hard-locale.h
    lib/ignore-value.h
    lib/limits.h
    lib/localcharset.h
    lib/locale.h
    lib/minmax.h
    lib/progname.h
    lib/quotearg.h
    lib/quote.h
    lib/safe-read.h
    lib/stdio.h
    lib/stdio-impl.h
    lib/stdlib.h
    lib/string.h
    lib/sys/ioctl.h
    lib/sys-limits.h
    lib/sys/types.h
    lib/unistd.h
    lib/unused-parameter.h
    lib/verify.h
    lib/version-etc.h
    lib/wchar.h
    lib/wctype.h
    lib/xalloc.h
    lib/xbinary-io.h
    src/die.h
    src/ioblksize.h
    src/system.h
The header files add up to 19.7k lines.

So the total line count for files GNU cat actually needs to build is at least ~25k.

(I didn't bother checking for headers including other headers.)

Next step: do this for various versions of GNU coreutils.


Much more code for much less functionality than the BSD cat which can do sockets. Not surprised at all.


Thanks for taking the time to do this counting, very interesting result.


This is why I think the movement of the future will be about going back and stripping cruft out of old codebases. We've seen the weaknesses of the bazaar/ many eyes, and the main one IMHO is code complexity, which is often easiest to measure in loc.


Strangely, it seems that many versions of macOS on opensource.apple.com are missing grep. It used to be its own project until 10.7 Lion, after which it disappeared and then reappeared under text_cmds in 10.12 Sierra.


Apparently, the 10.7→10.8 update is when macOS switched from GNU grep to FreeBSD grep.


> My aunt and cousin thought of computer technology as a series of increasingly elaborate sand castles supplanting one another after each high tide clears the beach.

They are basically right though.

The counterexample of some Unix utilities means nothing. You're not getting a CS degree in order to develop the next version of cat, are you?

We have some things with a long history and they are easy to identify. It is just hindsight being 20/20.

For every one of those things, there are countless that can't be seen or felt. They aren't here; they got washed away.

Who uses the Michigan Terminal System?

Or a web framework from ten years ago?


> They are basically right though.

They are only right in the same way that a physics major is obsoleted by advances in physics: lhc, discovery of dark matter & energy, increasing expansion of the universe, etc.

A CS major isn't about learning the latest Angular framework derivative. A CS major is about learning fundamental aspects of computer science.


I am not sure that computer technology would have become powerful, inexpensive, and ubiquitous to the extent that it is become today were his aunt and his cousin correct.

The aunt and the cousin are thinking that 'computer technology' exists at the level of abstraction of the sandcastles in the metaphor. To some extent it does, but the vastly greater part of it is at the level of abstraction of the knowledge and theory of building sand castles, as gained over the course of many iterations.

One of the most common themes one hears, when reading what people write about computer science, is how few new ideas in computer science are actually involved in nearly anything anyone does on a computer (or teaches at the undergraduate level).


The people implementing those ideas often believe they are new, though.


When I first came here, this was all swamp. Everyone said I was daft to build an operating system on a swamp, but I built it all the same, just to show them. It sank into the swamp. So I built a second one. And that one sank into the swamp. So I built a third. That burned down, fell over, and then sank into the swamp. But the fourth one stayed up. And that’s what you’re going to get, Son, the strongest OS in all of England.


> Or a web framework from ten years ago?

Well, I still write ASP.NET Web Forms on a regular basis. Ten years is not that old, or is it? Though it is harder and harder to find developers for it, the young people simply don't start with Web Forms.


I think we have to distinguish computer science from engineering here. Computer science is a branch of mathematics, where theories are developed and results are obtained that in principle remain valid for ever. Think of theories of computation and complexity theory, but also logic, probability and so on.

Indeed, the observation that some Unix utilities have their roots in the seventies misses the point in this regard. I'd say this is a testament to the success of the unix approach or whatever you want to call it. It's not really about computer science.


If you like to have insights into how some UNIXes got built, these books are quite interesting.

"The Design and Implementation of the 4.4 BSD Operating System"

"The Design and Implementation of the FreeBSD Operating System"

"Mac OS X Internals: A Systems Approach"

"Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture"

"HP-UX 11i Internals"

"IA-64 Linux Kernel: Design and Implementation"


How could you omit Bach and Comer? (-:


I never read it, thus cannot express my opinion about its contents.

Xenix manuals and later Steven's books were my introduction into UNIX world.


Not it, them.

* http://jdebp.eu./FGA/operating-system-books.html

One of these days I shall get around to expressing my opinions, which as you can see are still missing. Indeed, the list itself is a decade out of date. (-:

I have some SCO UNIX manuals on the other side of the room as I type this.


Very nice list.


I just found a set of Solaris 4 manuals still in the wrapper yesterday.


I think that code bloat, especially in GNU, is a huge problem in our software because it makes programs difficult to maintain, to understand and modify. I feel like most people I interacted with online (present company excepted) don't care about it and don't see it as a problem. I can get that it doesn't affect them because they only use these projects as black boxes and don't maintain them, so it isn't relevant to their work.

I created a wiki page to measure the number of lines of code* of various types of software https://softwarecrisis.miraheze.org/wiki/Linecount - LOC is a very very rough proxy for what I actually want to measure, but the results are so stunning that even a inaccurate indirect measurement tells a lot. You can see that for 2 projects that do essentially the same thing there might be a 1000x difference in LOC.

It's fascinating what can happen to such a simple program like 'cat'. The same effect is amplified further when you look at projects like gcc. I tried to ask the question on a couple sites like stackexchange and reddit why does gcc take half an hour to build instead of a fraction of a second but this question was not taken well. I got a lot of resistance to it, X-Y answers, deleted etc. I don't think that the common software engineer wants to take the idea seriously that the day to day tools we use have a million fold inefficiency built into them by accident. I also noticed that 'make' has no profiler, nobody has even really done a breakdown of what takes how long to build in the gcc tree.

There are a lot of brilliant engineers who understand this problem and want to solve it though. We see that in Alan Kay's STEPS project, aligrudi's work, musl, toybox, maybe sbase and many of the independent bootstrapping projects that have popped up. There's a lot of inertia and weight to the standard GNU toolkit to push back against but I believe these problems are all solvable and by solving them we can create programming languages and tools with leverage far beyond what currently exists. I just hope such projects can be integrated rather than be forgotten.


I worked with some version of Unix in 1984 that had a program called dog. It would silently wait for a <CR> to be pressed after each screen of output. I've never seen it anywhere else.



That looks like a different program named dog by coincidence. My dog had no bells and whistles. Do you happen to know where the source for this is?


> [...] but it seems that many people still get most excited about the six months of work he put into rewriting cat [...]

Is it me or does 6 months seem like an awfully long time for re-writing such a small and simple program?


I would guess that it wasn’t his sole project for those 6 months, but rather something he kept incrementally improving until there was nothing left to improve.


It was a different era, one in which computers were a lot slower, source control was a lot more primitive, a lot of basic stuff was still being invented, but … yeah, I feel a lot better about my own productivity now!


While I agree about productivity now (although rewriting a source for cat that is used decades later seems very productive), I think the above commenter has it correct that it was probably a side project that he worked on and released after 6 months and not so much the speed of the CPUs.


The cat utility is among the simplest, but once upon a time true was about the simplest possible Unix utility.

    #!/bin/sh

Yes, that's really it. Fire up the shell, get it to exit with 0, which is taken as success. That's all that's really necessary for its spec.

GNU's is around 29 KiB compiled, and it uses some of that to support --version and --help flags. MacOS's is around 17 KiB compiled and ignores flags.


it used to be even simpler, a blank file


Cat is awesome :) There is also 'tac' (reverse of cat) installed on most systems


I came across something called bat recently. It's a rust clone of cat with a lot of nice features integrated. This seems to be a thing lately in the Rust community to put out vastly improved versions of tools we haven't really touched in ages. Loving it.


I'm a fan of exa as an ls-replacement :)

exa -l --git will list N/M git status flags in the output and:

exa --git-ignore will obey .gitignore when you're listing files :)

Works like a charm in my experience.


That capital C in the title weirds me out.


I always wondered where the name cat came from which the article doesn’t address. Any ideas?


It's short for conCATenate.

original man page: http://man.cat-v.org/unix-1st/1/cat


And that was because it's function was/is to concatenate files:

cat f1 f2 f3 >f4


Ah now I understand!


Actually catinate, which is a real word but less used. But you are still right.


In the earliest references, it was "concatenate". It wasn't until 7th edition UNIX (1979) that "catenate" was given.

References:

- 1971 draft (pre 1st edition) of the paper that would become the well-known 1974 CACM UNIX paper (earliest documentation on `cat` that I can find): https://www.tuhs.org/Archive/Distributions/Research/McIlroy_... (tune in on page 28)

- 6th edition cat(1) man page (1975): http://man.cat-v.org/unix-6th/1/cat

- 7th edition cat(1) man page (1979): http://man.cat-v.org/unix_7th/1/cat


Latin root word "catena", meaning "chain".


Latin root "con-" ("com-") meaning "with," or "together." As in, "concatenate" means something like, "chain together."

https://www.etymonline.com/word/com-

https://www.etymonline.com/word/concatenate


I only read the beginning and end, and I very much like the closing message here.

A tldr of the middle would be cool. Maybe there was a pattern.

I'd like to add another OS not mentioned that will hopefully become a well-appreciated artifact soon too, from Redox OS: https://gitlab.redox-os.org/redox-os/coreutils/blob/master/s...

I can't find it quickly now, but jackpot51 also has a very answer somewhere on Reddit about how their networking stack's DNS query command departs from a commonly deployed C program for Windows and Unix, iirc. fascinating




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: