Hacker News new | past | comments | ask | show | jobs | submit login

I'm probably nitpicking, but if you're using cat to pipe a single file into the sdtin of another program, you most likely don't need the cat in the first place, you can just redirect the file to the process' stdin. Unless, of course, you're actually concatenating multiple files or maybe a file and stdin together.

Disclaimer: I do cat-piping myself quite a bit out of habit, so I'm not trying to look down at the author or anything like that! :)




In fact, I don't like people optimizing shell scripts for performance. I mean, shell scripts are slow by design and if you need something fast, you choose the wrong technology in the first place.

Instead, shell script should be optimized for readability and portability and I think it is much easier to understand something like 'read | change >write' than 'change <read >write'. So I like to write pipelines like this:

  cat foo.txt \
    | grep '^x' \
    | sed 's/a/b/g' \
    | awk '{print $2}' \
    | wc -l >bar.txt
It might be not the most efficient processing method, but I think it is quite readable.

For those who disagree with me: You might find the pure-bash-bible [1] valuable. While I admire their passion for shell scripts, I think they are optimizing to the wrong end. I would be more a fan of something along the lines of 'readable-POSIX-shell-bible' ;-)

[1]: https://github.com/dylanaraps/pure-bash-bible


IMHO, shell scripts are a minefield and if you want something readable and portable, this is also the wrong technology. They are convenient though. They are like the Excel macros of the UNIX world.

Now back to the topic of "cat", which is a great example of why shell scripts are minefields.

Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

Now, if F='-n', second trap. What you think is a file will be considered an option and cat will wait for user input, like when no file is given. Ok, so you need to do cat -- "$F" | blah_blah.

That should be OK in every case now, but remember that "cat" is just another executable, or maybe a builtin. For some reason, on your system "cat --" may not work, or some asshat may have added "." in your PATH and you may be in a directory with a file named "cat". Or maybe some alias that decides to add color.

There are other things to consider, like your locale that may mess up you output with comas instead of decimal points and unicode characters. For that reason, you need to be very careful every time you call a command and even more so if you pipe the output.

For that reason, I avoid using "cat" in scripts. It is an extra command call and all the associated headaches I can do without.


> Now, if F='-n', second trap

You're not wrong, but I think it's worth pointing out that's a trap that comes up any time you exec another program, whether it's from shell or python. I can't reasonably expect `subprocess.run(["cat", somevar])` to work if `somevar = "-n"`.

(Now, obviously, I'm not going to "cat" from python, but I might "kubectl" or something else that requires care around the arguments)


> Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

I think that you forgot to edit the "I mean" to "echo $F" :)


I agree with the sentiment, but my critique applies so generally that it must be noted: if a command accepts a filename as a parameter, you should absolutely pass it as a parameter rather than `cat` it over stdin.

For example, you can write this pipeline as:

    grep '^x' foo.txt \
        | sed 's/a/b/g' \
        | awk '{print $2}' \
        | wc -l > bar.txt
This is by no means scientific, but I've got a LaTeX document open right now. A quick `time` says:

    $ time grep 'what' AoC.tex
    real    0m0.045s
    user    0m0.000s
    sys     0m0.000s

    $ time cat AoC.tex | grep what
    real    0m0.092s
    user    0m0.000s
    sys     0m0.047s
Anecdotally, I've witnessed small pipelines that absolutely make sense totally thrash a system because of inappropriate uses of `cat`. When you `cat` a file, the OS must (1) `fork` and `exec`, (2) copy the file to `cat`'s memory, (3) copy the contents of `cat`'s memory to the pipe, and (4) copy the contents of the pipe to `grep`'s memory. That's a whole lot of copying for large files -- especially when the first command grep in the sequence usually performs some major kind of reduction on the input data!


In my opinion, it's perfectly fine either way unless you're worried about performance. I personally tend to try to use the more performant option when there's a choice, but a lot of times it just doesn't matter.

That said, I suspect the example would be much faster if you didn't use the pipeline, because a single tool could do it all (I'm leaving in the substitution and column print that are actually unused in the result):

    awk '/^x/{gsub("a","b");print $2; count++}END{print NR}' foo.txt


That syntax is very unusual from anything I've seen. I am also a fan of splitting pipelines with line breaks for readability, however I put the pipe on the end of each line and omit the backslash. In Bash, a line that ends with a pipe always continues on the next line.

In any case, it's probably just a matter of personal taste.


That's actually very readable. I'm now regretting that I hadn't seen this about 3 months ago--I recently left a project that had a large number of shell scripts I had written or maintained for my team. This probably would've made it much easier for the rest of the team to figure out what the command was doing.


If the order is your concern, you can also put the <read at the beginning of the line. <file grep x works the same as: cat file | grep x


I've been using unix for 25 years and I did not know that.


I dunno, You are bringing 5 cores to bear and there is no global interpreter lock which is not a bad start


I like 'collection pipeline' code written in this style regardless of language. If we took away the pipe symbols (or the dots) and just used indentation we'd have something that looked like asm but with flow between steps rather than common global state.

I periodically think it would be a good idea to organize a language around.


awk can do all of that except sed. And I am not sure about the last. No need to wc ($NF in AWK, if I can recall), no need for grep, you have the /match/ statement, with regex too.


> except sed

Doesn't gsub(/a/, "b") do the same thing as s/a/b/g?


Yes, I recall it hours ago.


I find something like this:

   grep '^x' < input | sed 's/foo/bar/g' 
to be very readable, as the flow is still visually apparent based on punctuation.


I don't like this style at all. If you're following the pipeline, it starts in the middle with "input", goes to the left for the grep, then to the right (skipping over the middle part) to sed.

     cat input | grep '^x' | sed 's/foo/bar/g'
Is far more readable, in my opinion. In addition, it makes it trivial to change the input from a file to any kind of process.

I'm STRONGLY in favor of using "cat" for input. That "useless use of cat" article is pretty dumb, IMHO.


Note that '<input grep | foo' is also valid.


In this particular example, ‘unnecessary use of cat’ is accompanied by ‘unnecessary use of grep’.

    cat input | grep '^x' | sed 's/foo/bar/g'

    sed '/^x/s/foo/bar/g' <input


That's not the same thing. The sed output will still keep lines not starting with x (just not replacing foo with bar in those) where grep will filter those out.


Yeah, Muphry's law at work. Corrected version:

   sed -n '/^x/s/foo/bar/gp' <input
This may be an inadvertent argument for the ‘connect simpler tools’ philosophy.


You can just remove the <


You can if input is a file. It might be a program with no arguments or something else.


In your original command, how can 'input' be a program with no arguments?


Oh, damn. You're exactly right.

OK, to save some of my face, this will work:

    grep 'foo' <(input) | sed 's/baz/bar/g'
... at least in zsh and probably bash.


I don’t like that at all. That creates a subshell and is also less readable than

    input | grep foo | sed ...


That specific example is less readable, but I do like being able to do this:

    diff <(prog1) <(prog2)
and get a sensible result.

And sometimes programs just refuse to read from stdin but do just fine with an unseekable file on the command line. True, you do have this:

    input | recalcitrant_program /dev/stdin
... but it's a bit of a tossup as to which one's more readable at this point. They're both relying on advanced shell functionality.


> That specific example is less readable, but I do like being able to do this:

> diff <(prog1) <(prog2)

> and get a sensible result.

That is called process substitution and is exactly the kind of use case that it's designed for. So yes, process substitution does make sense there.

> input | recalcitrant_program /dev/stdin

> ... but it's a bit of a tossup as to which one's more readable at this point. They're both relying on advanced shell functionality.

There's no tossup at all. Process substitution is easily more readable than your second example because you're honouring the normal syntax of that particular command's parameters rather than kludging around it's lack of STDIN support.

Also I wouldn't say either example is using advanced shell functionalities either. Process substitution (your first example) is a pretty easy thing to learn and your second example is just using regular anonymous pipes (/dev/stdin isn't a shell function, it's a proper pseudo-device like /dev/random and /dev/null) thus the only thing the shell is doing is the same pipe described in this threads article (with UNIX / Linux then doing the clever stuff outside of the shell).


This is a very silly way of writing it though. grep|sed can almost always be replaced with a simple awk: awk '/^x/ { sub("a", "b"); print $2; }' foo.txt. This way, the whole command fits on one line. If it doesn't, put your awk script in a separate file and simply call it with "awk -f myawkscript foo.txt".


I would disagree that their way of writing it is silly.

It is instantly plainly obvious to me what each step of their shell script is doing.

While I can absolutely understand what your shell script does after parsing it, it's meaning doesn't leap out at me in the same way.

I would describe the prior shell script as more quickly readable than the one that you've listed.

So, perhaps it's not a question of one being more silly than the other—perhaps the author just has different priorities from you?


I use awk in exactly this way personally, but, awk is not as commonly readable as grep and sed (in fact, that use of grep and sed should be pretty comprehensible to someone who just knows regular expressions from some programming languages and very briefly glances at the manpages, whereas it would be difficult to learn what that awk syntax means just from e.g. the GNU awk manpage). So, just as you could write a Perl one-liner but you shouldn't if you want other people to read the code, I'd probably advise against the awk one-liner too.


Not sure why you say grep and sed are more readable than awk! (not sure what 'commonly readable' means). Or that even that particular line in awk is harder to understand than the grep and sed man pages. The awk manpage even has examples, including print $2. The sed manpages must be the most impenetrable manpages known to 'man', if you don't already understand sed. (People might already know s///g because 99% of the time, that's all sed is used for.)


>sub("a", "b");

That should be gsub, shouldn't it? (sub only replaces the first occurrence)


Yes.


The "useless use of cat" was a repeated gripe on Usenet back in the day: http://porkmail.org/era/unix/award.html


I actually think that cat makes it more obvious what's happening in some cases.

I had recently built a set of tools used primarily via pipes: (tool-a | tool-b | tool-c) and it looks clearer when I mock (for testing) one command (cat results | tool-b | tool-c) instead of re-flowing it just to avoid cat and use direct files.


People use cat to look at the file first, then hit up arrow, add a pipe, etc.


Yes, this. Quite often I start writing out complex pipelines using head/tail to test with a small dataset and then switch it out for cat when I am done to run it on the full thing. And it's often not worth refactoring these things later unless you are really trying to squeeze performance out of them.


I think it's also a grammatical wart of shell syntax. Things going into a command are usually on the left, but piping in a file goes on the right.


   <file command | command | command
is perfectly fine.


The arrow now points backwards.


Of course, if any of your commands prompt for input, you'll be disappointed that's not always as easy as it appears on the surface.

Does anyone have a better way to do this kind of thing?


The standard is expect [1]. There are also libraries for many programming languages which perform a similar task, such as pexpect [2].

[1] https://core.tcl.tk/expect/index [2] https://pexpect.readthedocs.io/en/stable/


The better solution is to change the command so it expects programatic arguments / pass command line parameters.

i.e.

prefer `apt-get install -y` over `yes | apt-get install foo`


I can see how it's redundant. But I use cat-pipes because I once mistyped the redirection and nuked my carefully created input file :)

(Similarly, the first thing I used to do on Windows was set my prompt to [$p] because many years ago I also accidentally nuked a part of Visual Studio when I copied and pasted a command line that was prefixed with "C:\...>". Whoops.)


Not nitpicking. Useless Use of Cat is an old thing: http://catb.org/jargon/html/U/UUOC.html


For interactive use, I would like to point out that even better than this use of cat is less. If you pipe less into something then it forgets it’s interactive behaviour and works like cat on a single file. So:

  $ less foo | bar
Is similar too:

  $ bar < foo
Except that less is typically more clever than that and might be more like:

  $ zcat foo | bar
Depending on the file type of foo.


I would be remiss if I did not point out that calling said program cat is a misnomer. Instead of 'string together in a series' (the actual dictionary definition, which coincidentally, pipes actually do) it quickly became 'print whatever I type to the screen.'

Of course, the example @arendtio uses is correct, because they obviously care about such things.


Having separate commands for outputting the content of a single file and several files would, however, be an orthogonality violation. YMMV whether having a more descriptive name for the most common use of cat would be worth the drawback.


It would fit in the broader methodology of 'single purpose tools that do their job well' or 'small pieces, loosely joined', but yes, probably too annoying to bother with.


I usually replace cat with pv and gets a nice progress bar and ETA :-)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: