Hacker News new | past | comments | ask | show | jobs | submit login
Unix V5, OpenBSD, Plan 9, FreeBSD, and GNU implementations of echo.c (gist.github.com)
168 points by dchest on July 19, 2011 | hide | past | favorite | 86 comments



slightly related but, I also find it funny that most of the time we are running none of these..

  $type echo
  echo is a shell builtin


Okay, so... from UNIX v5, OpenBSD added a -n flag that prevents a trailing newline, Plan 9 adds the -n flag and pushes the argv into a buffer (why?) before printing, FreeBSD does all that and also prevents a trailing newline if the last argument ends with "\c" (why?), and GNU does... something complicated.


"adds the argv into a buffer (why?)"

It's calling "write", which is a system call. Without the buffer, it would call write once for argv, and once for the newline if nflag is not set. Calling a system call twice would result in twice as many context switches, and thus be very slightly slower.


Oh, that makes sense.

Speaking of clever tricks, would it be faster or slower to use c99's variable-length arrays instead of malloc?


IIRC faster, because variable-length arrays in c99 just bump down the stack pointer, while malloc can be pretty expensive.

But a single malloc is nothing compared to the cost of a context switch into kernel mode.


You can simply use alloca() as a replacement for malloc() to avoid depending on VLA support.


alloca sucks. The error handling is "haha, you overflowed your stack." (VLA sucks for the same reason.)


Or just declare a large static buffer.


"Nobody will ever echo this much data!"


It's actually not a bad suggestion

    #define BUF_INITIAL 1024
    char buf[BUF_INITIAL];
    
    int main(int argc, char** argv)
    {
        char* p;
        ...
        p = len > BUF_INITIAL ? malloc(len) : buf;
        ...
        write(1, p, len);
        ...
    }


djb's allocator alloc() does this. He preallocates a 4 KB static buffer before hitting system malloc(). Avoiding the overhead of malloc() is pretty important for the performance of systems like qmail that fork many small processes.


That's ridiculous. The overhead of fork is about a bajillion times higher than malloc.


With a COW fork() I bet it's smaller. I smell a test coming on, but alas it's late here and I'm going to bed.

I'm also guessing that 4k was chosen because a malloc() of 1 page is faster than a malloc() of >1 page. Of course that's with the assumption that the systems use a 4k page size.


Are you sure? I'd bet it's greater, I'd not bet that it's an order of magnitude better, fork() has been optimized a lot more than brk().


Many small processes? How many?

4 KB * (many) could be frightening.


If your system has 32,000 processes and you allocate 4 KB to each of them... that's 128 MB. I'm not crapping my pants at that figure because even the oldest machine in my office, an old Thinkpad, has 2 GB of RAM.


Oh wait they will


If speed is important, why the copying of the data? Only 'slightly' non-portable:

  int main (int argc, const char * argv[])
  {
    for( int i = 1; i < argc; ++i)
    {
        ((char *)argv[i])[-1] = ' ';
    }
    size_t len = strlen(argv[argc - 1]);
    ((char *)argv[argc - 1])[len] = '\n';
    write(1, argv[1], argv[argc - 1] - argv[1] + len + 1);
    return EXIT_SUCCESS;
  }


It has to do with the way plan9 guarantees atomic writes of small buffers. A single write() call will correspond to a single read() call if the buffer is big enough to receive it.


Isn't that what stdio is for (fopen()/fread()/fwrite()/etc)? It performs buffering for you.


Stdio would add unnecessary overhead. And why bother with stdio (or its Plan 9 equivalent) when it's so easy to do the buffering faster yourself?


Overhead such as...? The only one I can think of is the mutex lock/unlock, but there are the _unlocked variants for those.

"And why bother with stdio (or its Plan 9 equivalent) when it's so easy to do the buffering faster yourself?"

For the same reason you'd use any abstraction: to keep your program simpler. Even if you consider it "so easy" there are plenty of opportunities there for off-by-one errors and buffer overflows that would disappear if you used stdio instead.


Remember that echo.c evolved long before UNIX has shared libraries. So avoiding stdio completely might mean the size of your binary drops from 50K to 5K. If you're on a PDP-11 with 256KB of RAM (being shared by several users) this would make a big difference.

My earliest UNIX programming was in the mid-80s on a machine with a (luxurious!) 1.5MB of RAM. I definitely remember avoiding stdio when writing small utilities that I wanted to start up and run fast.

Also remember that back in those days echo was not a shell builtin. (Hell, back then the testing operator '[' wasn't even a builtin. Some UNIXes like OS/X still have a vestigial /bin/[ executable!) Programs like echo that ran from shell scripts constantly had to be coded to start up as fast as possible.


This was actually the reason for FreeBSD's more complicated version. Revision 106835:

    Put echo on a diet, removing unnecessary use of stdio and getopt.
    
    Before...
    -r-xr-xr-x  1 root  wheel  58636 Oct 28 05:16 /bin/echo
    After...
    -rwxr-xr-x  1 root  wheel  12824 Nov 12 17:39 /usr/obj/usr/src/bin/echo/echo
http://svnweb.freebsd.org/base?view=revision&revision=10...


> Some UNIXes like OS/X still have a vestigial /bin/[ executable!

This includes linux, but it's in /usr/bin/[


Please point out the plentiful opportunities for off-by-one errors in Plan 9's echo.c. If you send me a copy of Plan 9's echo.c using stdio I will run it and benchmark it for you.


Actually, most of what the GNU is doing is recognising some escape sequences (including characters in hex form).


It seems like "-n" flag appeared in UNIX v7: http://www.bsdlover.cn/study/UnixTree/V7/usr/src/cmd/echo.c....


Plan 9's echo is used to talk to drivers from shell scripts, some of which expect to receive messages in a single buffer. That's why they went out of their way to make echo use a single write.



V6's assembly version also does less. The reason GNU's is so complex is because it has a line-numbering feature (cat -n) not supported in V6 or V7, and also tries to read and write large chunks, to avoid overhead from calling stdio functions in a loop. It also tries to take advantage of non-portable extensions where possible, but fall back to portable code when not supported. Yeah, it looks a bit complex at first, but it's not really that bad if you actually take the time to read it.


Those features simply do not belong in a program whose purpose is to concatenate its input. If you want to number a file's lines, 'echo ,n | ed file | sed 1d' or 'awk ''{ print NR " " $0 }''' will do just fine. You could even wrap your ed or awk script into a shell script with a descriptive name like "lineno" rather than something silly like "cat -n". The reason GNU's is so complex is because it does many things and does them poorly. The V6 implementation does exactly what is says on the tin, does it well, and does nothing more: it catenates files.


I am confused by your definition of "poorly". Are you asserting that GNU cat is slow, or unportable, or uses too much memory, or some other actual noticeable problem?



But now you've spawned many processes when one could have been used, and you'll incur the wrath of people that think that 'grep foo file' is 1000x more efficient than 'cat file | grep foo'.


I liked your use of ed. You can avoid one pipe with the -s option: `echo ,n | ed -s file`. Also, another POSIX one-process option besides awk: `pr -tn file` (with different padding).


GNU's Hello World (version 2.7) example is 586 KB gzipped.

https://www.gnu.org/software/hello/


"The GNU Hello program produces a familiar, friendly greeting. Yes, this is another implementation of the classic program that prints “Hello, world!” when you run it.

However, unlike the minimal version often seen, GNU Hello processes its argument list to modify its behavior, supports greetings in many languages, and so on. The primary purpose of GNU Hello is to demonstrate how to write other programs that do these things; it serves as a model for GNU coding standards and GNU maintainer practices."


That must have been one of Ken Thompson’s more productive days.

(alluding to a quote from him that I can’t source, “One of my most productive days was throwing away 1000 lines of code.”)


Nah, the codebase just hadn't been touched by the FSF yet.

See also, "UNIX Style, or cat -v Considered Harmful" (http://harmful.cat-v.org/cat-v/).

It seems telling that the GNU echo's source is "derived from code echo.c in Bash."


i don't get while people are being snide about code that does more and so has more lines. a small amount of code is very nice and elegant, but if it doesn't do what people need then it's pointless.


I think their point is that you should separate it in different utilities/binaries which would be very simple and have less bugs, and let users combine them as they wish.

For example, instead of cat -v, you'd have a second utility called, 'nonprint' which would just translate non-printing chars, and you'd call it using

     cat file1 file2 | nonprint


"nonprint" existed as "vis" on some systems AFAIR.


The Unix Programming Environment has a discussion of vis on page 172. It notes that you can use 'sed -n l' to do the same thing.


yes


You are making the system more complicated for everyone because of features that only a few users know about. This is how code bloat starts its life cycle.

If you need more features from a basic utility like echo or cat you should create your own version, maybe with a slightly different name, and leave the original as it is.


That's exactly what happened.

Those crazy kids at Berkeley cooked up BSD, which was written to meet their needs and subsequently forked into a few variants. The GNU people made a GNU collection of core utilities that met their particular needs and desires.

The Unix nerds at my University felt as you did, and ran a Unix System 5 variant into the late 90's.


Anyone can find those features in the man page. I guarantee that the number of people who use those features is much larger than the number of people who read the source before today.

The UNIX style promoted in the "cat -v Considered Harmful" paper may have made sense at one time, but it doesn't make sense anymore. For example:

It seems that UNIX has become the victim of cancerous growth at the hands of organizations such as UCB. 4.2BSD is an order of magnitude larger than Version 5, but, Pike claims, not ten times better.

This logic gives the same consideration to people who are digging in the source code for these utilities as to people who actually use them. When you consider the relative numbers, that's a very elitist attitude (for some value of "elite").

Also consider the explanation given in another comment for why "cat -n" is unnecessary:

If you want to number a file's lines, 'echo ,n | ed file | sed 1d' or 'awk ''{ print NR " " $0 }''' will do just fine.

Munging text like that is a pretty common skill for Unix users, but by no means universal. If the man page for cat is pretty simple and readable, and the feature doesn't bloat the code to the point of causing maintenance problems, and there's somebody who's willing to write the code, then enabling "cat -n" a win for users.


Not a very good strategy to avoid bloat... Not even mentioning you will end up with 100 times the number of tools you have now. I would not call such a system more simple, indeed it would be inferior on all possible points.

In a real system, even basic utilities rarely looks like a CS101 homework result. This is perfectly fine, especially in this case the amount of feature in gnu echo is perfectly reasonable, and the size of the executable will likely depends on various header than code size when the code is so small anyway.


You are making the system more complicated for everyone because of a program that only a few users know about. That is how code bloat starts its life cycle.

If you need a basic utility like echo or cat, you should create your own version and don't bother others with it.


/bin/echo on debian sid (x86):

13 .text 000028dc 08048b90 08048b90 00000b90 24 CONTENTS, ALLOC, LOAD, READONLY, CODE

/bin/cat on debian sid (x86):

13 .text 0000775c 080491b0 080491b0 000011b0 24 CONTENTS, ALLOC, LOAD, READONLY, CODE

That's 12kb and 32kb, respectively. It may have been a lot back in the day, but it's plenty small enough on today's systems.


It's just good practice, in general, to keep individual programs simple. If you have a look at DMR's description of why the pipe was invented, it suddenly clicks.

All of these tools are intended to be composable, analogous to functions. 'cat -v' is like a function with too many arguments, one that does too much. If you need, for example, to allocate a block of zero'd memory, you don't add new flags to malloc(); you use memset() after allocation, you write a for loop, or you use calloc().

Likewise, the basic tools available on Unix can and should be thought of as functions, which take some number of arguments, and the implicit argument of an input channel. They produce as their output an integer as a return value and two output channels, stdout and stderr. Making a function that does too much (and this is as subjective for functions as it is for command-line tools) is known to be bad practice, but for the shell, it is often misunderstood. To misunderstand this is to misunderstand the core principals of the Unix environment.

It has nothing to do with the typical non-programmer user. On, say, Linux or OSX, the user doesn't write functions or talk to the shell very often. They click buttons in a GUI that doesn't in a meaningful sense offer composable programs, and it's an inefficient but simple way to interact with the machine, a way that matches their habits and understanding. cat, echo, sed, and awk aren't for these users; they're for programmers, and the typical user does not know or care whether cat can show non-printing characters, but as a programmer, I certainly care about a clean design for my environment.


The "-n" option was added to UNIX around v6.

The "-n" special case opened the floodgates for many more options. And what if I actually wanted to print "-n"? There's no way to do it.


> And what if I actually wanted to print "-n"? There's no way to do it.

Good point.I first tried "echo \-n", and "echo -- -n". No luck. I get the correct visual effect with "echo - ^Hn" (^H generated with ^V, backspace), but the embedded backspace is still actually part of the output :P "echo - ^Hn | col" strips it. Seems like quite an oversight, really, and is prime example of how bugs sneak into code via features.

edit: ilikejam solved it w/o leaning on other tool. As he said "not pretty", but simpler than what I have: http://news.ycombinator.com/item?id=2781034


Use the source... ;-)

  % env POSIXLY_CORRECT=1 echo -n  
  -n
Also note that if your shell is Bash, then your echo is a built-in:

  % type echo
  echo is a shell builtin
That's why 'env' is necessary.


Your Linux is showing ;).

   kamloops$ uname -s
   NetBSD
   kamloops$ echo $0
   sh
   kamloops$ env POSIXLY_CORRECT=1 echo -n
   kamloops$


Sure you can:

dave@cronus $ ./echo -n "-n

dave@cronus > "

-n

dave@cronus $

Not pretty, though.


Cute, but does not work with bash builtin echo, for which two -n's equals one -n.

   bash-3.2$ echo -n -n foo
   foobash-3.2$ 
Like always with echo -n, no matter what you try, it's not portable.


So...

[dave@mini ~]$ echo -n "-n foo

> "

-n foo

[dave@mini ~]$

Easy!


All respect for persistence, but that's not a solution, because the OP wanted to echo just "-n". I put the foo in to better show what was happening.

For similar reasons,

  echo "" -n
does not work, etc.


' echo -n "-n

" ' works fine with the bash built-in. It echo's "-n".


You're right. I should have said "no sane way to do it". :-)


Or echo -en "-n\n".


The printf utility works far better if you want detailed control of your output: printf "%s\n" -n


`echo -e -\\x6E` works in GNU echo and Bash builtin (and likely others).


You can see the original UNIX sources at http://minnie.tuhs.org/cgi-bin/utree.pl


One should avoid echo anyway due to portability issues. Use printf instead.


More *nix systems have a /bin/echo than /bin/printf or /usr/bin/printf


Yes, and they all do something dfferent. Things like "-e" and treatment of "\c" are not uniform.

If a system has printf, and I think all recent Unix-like systems do, then it works at least 98% the same.


Here's Mac OS X's:

http://www.opensource.apple.com/source/shell_cmds/shell_cmds...

It's close to the FreeBSD implementation.


I just wish that in the early UNIX days they reserved some of the flags to mean one thing only, and required all commands to have them (where it made sense). Like:

-r recursive (i.e., it should always be recursive mode if a command operates on files, and should always exist if it makes sense for that command)

-v verbose

-s sort

-i ignore case

-q quiet (suppress output)

If there were, say, 20 well-chosen standard flags (and they were enforced) it could have given the UNIX tools another level of nice regularity.


I do not have much experience reading C code. Is the use of gotos and labels in the GNU code common?


In C it's generally accepted that forward jumps using goto are okay because the version that avoids goto would be complicated and confusing. Often used for error handling and memory management.


Error handling because you want to make a beeline to the handler, and memory management because you need it to be quick and goto is cheap?


In both cases, to avoid duplicating code. If you malloc() something in your function and intend on free()ing it before returning, it's often considered best practices to write the return block (which includes the free()s) once, and use "goto returnblocklabel" if you need to return early.

So rather than this:

  void test()
  {
    int *x = malloc(1000);
    int *y = malloc(1000);
    for(int i = 0; i < 999; i++) {
      if(badThing) {
        free(x);
        free(y);
        return;
      }
      doStuff()
      if(otherBadThing) {
        free(x);
        free(y);
        return;
      }
    }
    free(x);
    free(y);
    return;
  }
You'd have:

  void test()
  {
    int *x = malloc(1000);
    int *y = malloc(1000);
    for(int i = 0; i < 999; i++) {
      if(badThing)
        goto ret;
      doStuff()
      if(otherBadThing)
        goto ret;
    }
  ret:
    free(x);
    free(y);
    return;
  }


I realize your example is contrived, but a simple 'break' statement (which in essence is a goto...) would work just as well. :) I somewhat vaguely recall a situation in my C class that I wanted to use goto to avoid duplicate code but the professor had previously threatened huge negative points if his scripts detected one. (That whole semester was just as much about conforming your code to his narrow specifics because "that's what happens in the real world." as learning C.)


Can you explain what you mean by use of gotos for memory management?


Here's an example:

    void do_stuff(char *name) {
        char *my_name = strdup(name);
        if (condition_signalling_no_work()) goto done;
        if (do_something_with(my_name)) goto done;
        
        ...
        
        done:
          free(my_name);
          return;
    }
If it helps think of it as a finally block. The function does some things and no matter what it has to run that block at the end.


Ah, so for resource cleanup in general, not memory management in particular.


For error handling it is considered fine.

The way it is used in that code is pretty horrible, IMO, and splitting it into functions and using "return" in place of goto would have been far better.


A somewhat related by-the-way: if you grep for "goto " in the Linux kernel sources, you'll find several hundreds of occurrences.


touch.c is another good example. Especially one from Open Solaris. ^_^


Following the general software trend, it just keeps growing bigger and slower while still doing nothing new. But is this used anywhere, really? At least Bash uses its builtin.


Bash's implementation[1] isn't especially fast. It gets its speed by not having to fork, which is expensive on systems with dynamic linking.

[1] http://git.savannah.gnu.org/cgit/bash.git/plain/builtins/ech... http://git.savannah.gnu.org/cgit/bash.git/plain/support/rech... http://git.savannah.gnu.org/cgit/bash.git/plain/support/zech...


  > while still doing nothing new
Really? So the GNU implementation is an exact match for the functionality of the SysV implementation? If so, I have a bridge I'd like to sell you...


wow :O




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: