Okay, so... from UNIX v5, OpenBSD added a -n flag that prevents a trailing newline, Plan 9 adds the -n flag and pushes the argv into a buffer (why?) before printing, FreeBSD does all that and also prevents a trailing newline if the last argument ends with "\c" (why?), and GNU does... something complicated.
It's calling "write", which is a system call. Without the buffer, it would call write once for argv, and once for the newline if nflag is not set. Calling a system call twice would result in twice as many context switches, and thus be very slightly slower.
djb's allocator alloc() does this. He preallocates a 4 KB static buffer before hitting system malloc(). Avoiding the overhead of malloc() is pretty important for the performance of systems like qmail that fork many small processes.
With a COW fork() I bet it's smaller. I smell a test coming on, but alas it's late here and I'm going to bed.
I'm also guessing that 4k was chosen because a malloc() of 1 page is faster than a malloc() of >1 page. Of course that's with the assumption that the systems use a 4k page size.
If your system has 32,000 processes and you allocate 4 KB to each of them... that's 128 MB. I'm not crapping my pants at that figure because even the oldest machine in my office, an old Thinkpad, has 2 GB of RAM.
It has to do with the way plan9 guarantees atomic writes of small buffers. A single write() call will correspond to a single read() call if the buffer is big enough to receive it.
Overhead such as...? The only one I can think of is the mutex lock/unlock, but there are the _unlocked variants for those.
"And why bother with stdio (or its Plan 9 equivalent) when it's so easy to do the buffering faster yourself?"
For the same reason you'd use any abstraction: to keep your program simpler. Even if you consider it "so easy" there are plenty of opportunities there for off-by-one errors and buffer overflows that would disappear if you used stdio instead.
Remember that echo.c evolved long before UNIX has shared libraries. So avoiding stdio completely might mean the size of your binary drops from 50K to 5K. If you're on a PDP-11 with 256KB of RAM (being shared by several users) this would make a big difference.
My earliest UNIX programming was in the mid-80s on a machine with a (luxurious!) 1.5MB of RAM. I definitely remember avoiding stdio when writing small utilities that I wanted to start up and run fast.
Also remember that back in those days echo was not a shell builtin. (Hell, back then the testing operator '[' wasn't even a builtin. Some UNIXes like OS/X still have a vestigial /bin/[ executable!) Programs like echo that ran from shell scripts constantly had to be coded to start up as fast as possible.
This was actually the reason for FreeBSD's more complicated version. Revision 106835:
Put echo on a diet, removing unnecessary use of stdio and getopt.
Before...
-r-xr-xr-x 1 root wheel 58636 Oct 28 05:16 /bin/echo
After...
-rwxr-xr-x 1 root wheel 12824 Nov 12 17:39 /usr/obj/usr/src/bin/echo/echo
Please point out the plentiful opportunities for off-by-one errors in Plan 9's echo.c. If you send me a copy of Plan 9's echo.c using stdio I will run it and benchmark it for you.
Plan 9's echo is used to talk to drivers from shell scripts, some of which expect to receive messages in a single buffer. That's why they went out of their way to make echo use a single write.
V6's assembly version also does less. The reason GNU's is so complex is because it has a line-numbering feature (cat -n) not supported in V6 or V7, and also tries to read and write large chunks, to avoid overhead from calling stdio functions in a loop. It also tries to take advantage of non-portable extensions where possible, but fall back to portable code when not supported. Yeah, it looks a bit complex at first, but it's not really that bad if you actually take the time to read it.
Those features simply do not belong in a program whose purpose is to concatenate its input. If you want to number a file's lines, 'echo ,n | ed file | sed 1d' or 'awk ''{ print NR " " $0 }''' will do just fine. You could even wrap your ed or awk script into a shell script with a descriptive name like "lineno" rather than something silly like "cat -n". The reason GNU's is so complex is because it does many things and does them poorly. The V6 implementation does exactly what is says on the tin, does it well, and does nothing more: it catenates files.
I am confused by your definition of "poorly". Are you asserting that GNU cat is slow, or unportable, or uses too much memory, or some other actual noticeable problem?
But now you've spawned many processes when one could have been used, and you'll incur the wrath of people that think that 'grep foo file' is 1000x more efficient than 'cat file | grep foo'.
I liked your use of ed. You can avoid one pipe with the -s option: `echo ,n | ed -s file`. Also, another POSIX one-process option besides awk: `pr -tn file` (with different padding).
"The GNU Hello program produces a familiar, friendly greeting. Yes, this is another implementation of the classic program that prints “Hello, world!” when you run it.
However, unlike the minimal version often seen, GNU Hello processes its argument list to modify its behavior, supports greetings in many languages, and so on. The primary purpose of GNU Hello is to demonstrate how to write other programs that do these things; it serves as a model for GNU coding standards and GNU maintainer practices."
i don't get while people are being snide about code that does more and so has more lines. a small amount of code is very nice and elegant, but if it doesn't do what people need then it's pointless.
I think their point is that you should separate it in different utilities/binaries which would be very simple and have less bugs, and let users combine them as they wish.
For example, instead of cat -v, you'd have a second utility called, 'nonprint' which would just translate non-printing chars, and you'd call it using
You are making the system more complicated for everyone because of features that only a few users know about. This is how code bloat starts its life cycle.
If you need more features from a basic utility like echo or cat you should create your own version, maybe with a slightly different name, and leave the original as it is.
Those crazy kids at Berkeley cooked up BSD, which was written to meet their needs and subsequently forked into a few variants. The GNU people made a GNU collection of core utilities that met their particular needs and desires.
The Unix nerds at my University felt as you did, and ran a Unix System 5 variant into the late 90's.
Anyone can find those features in the man page. I guarantee that the number of people who use those features is much larger than the number of people who read the source before today.
The UNIX style promoted in the "cat -v Considered Harmful" paper may have made sense at one time, but it doesn't make sense anymore. For example:
It seems that UNIX has become the victim of cancerous growth at the hands of
organizations such as UCB. 4.2BSD is an order of magnitude larger than Version
5, but, Pike claims, not ten times better.
This logic gives the same consideration to people who are digging in the source code for these utilities as to people who actually use them. When you consider the relative numbers, that's a very elitist attitude (for some value of "elite").
Also consider the explanation given in another comment for why "cat -n" is unnecessary:
If you want to number a file's lines, 'echo ,n | ed file | sed 1d' or 'awk ''{ print NR " " $0 }''' will do just fine.
Munging text like that is a pretty common skill for Unix users, but by no means universal. If the man page for cat is pretty simple and readable, and the feature doesn't bloat the code to the point of causing maintenance problems, and there's somebody who's willing to write the code, then enabling "cat -n" a win for users.
Not a very good strategy to avoid bloat... Not even mentioning you will end up with 100 times the number of tools you have now. I would not call such a system more simple, indeed it would be inferior on all possible points.
In a real system, even basic utilities rarely looks like a CS101 homework result. This is perfectly fine, especially in this case the amount of feature in gnu echo is perfectly reasonable, and the size of the executable will likely depends on various header than code size when the code is so small anyway.
You are making the system more complicated for everyone because of a program that only a few users know about. That is how code bloat starts its life cycle.
If you need a basic utility like echo or cat, you should create your own version and don't bother others with it.
It's just good practice, in general, to keep individual programs simple. If you have a look at DMR's description of why the pipe was invented, it suddenly clicks.
All of these tools are intended to be composable, analogous to functions. 'cat -v' is like a function with too many arguments, one that does too much. If you need, for example, to allocate a block of zero'd memory, you don't add new flags to malloc(); you use memset() after allocation, you write a for loop, or you use calloc().
Likewise, the basic tools available on Unix can and should be thought of as functions, which take some number of arguments, and the implicit argument of an input channel. They produce as their output an integer as a return value and two output channels, stdout and stderr. Making a function that does too much (and this is as subjective for functions as it is for command-line tools) is known to be bad practice, but for the shell, it is often misunderstood. To misunderstand this is to misunderstand the core principals of the Unix environment.
It has nothing to do with the typical non-programmer user. On, say, Linux or OSX, the user doesn't write functions or talk to the shell very often. They click buttons in a GUI that doesn't in a meaningful sense offer composable programs, and it's an inefficient but simple way to interact with the machine, a way that matches their habits and understanding. cat, echo, sed, and awk aren't for these users; they're for programmers, and the typical user does not know or care whether cat can show non-printing characters, but as a programmer, I certainly care about a clean design for my environment.
> And what if I actually wanted to print "-n"? There's no way to do it.
Good point.I first tried "echo \-n", and "echo -- -n". No luck. I get the correct visual effect with "echo - ^Hn" (^H generated with ^V, backspace), but the embedded backspace is still actually part of the output :P "echo - ^Hn | col" strips it. Seems like quite an oversight, really, and is prime example of how bugs sneak into code via features.
I just wish that in the early UNIX days they reserved some of the flags to mean one thing only, and required all commands to have them (where it made sense). Like:
-r recursive (i.e., it should always be recursive mode if a command operates on files, and should always exist if it makes sense for that command)
-v verbose
-s sort
-i ignore case
-q quiet (suppress output)
If there were, say, 20 well-chosen standard flags (and they were enforced) it could have given the UNIX tools another level of nice regularity.
In C it's generally accepted that forward jumps using goto are okay because the version that avoids goto would be complicated and confusing. Often used for error handling and memory management.
In both cases, to avoid duplicating code. If you malloc() something in your function and intend on free()ing it before returning, it's often considered best practices to write the return block (which includes the free()s) once, and use "goto returnblocklabel" if you need to return early.
So rather than this:
void test()
{
int *x = malloc(1000);
int *y = malloc(1000);
for(int i = 0; i < 999; i++) {
if(badThing) {
free(x);
free(y);
return;
}
doStuff()
if(otherBadThing) {
free(x);
free(y);
return;
}
}
free(x);
free(y);
return;
}
You'd have:
void test()
{
int *x = malloc(1000);
int *y = malloc(1000);
for(int i = 0; i < 999; i++) {
if(badThing)
goto ret;
doStuff()
if(otherBadThing)
goto ret;
}
ret:
free(x);
free(y);
return;
}
I realize your example is contrived, but a simple 'break' statement (which in essence is a goto...) would work just as well. :) I somewhat vaguely recall a situation in my C class that I wanted to use goto to avoid duplicate code but the professor had previously threatened huge negative points if his scripts detected one. (That whole semester was just as much about conforming your code to his narrow specifics because "that's what happens in the real world." as learning C.)
The way it is used in that code is pretty horrible, IMO, and splitting it into functions and using "return" in place of goto would have been far better.
Following the general software trend, it just keeps growing bigger and slower while still doing nothing new. But is this used anywhere, really? At least Bash uses its builtin.