I'm probably nitpicking, but if you're using cat to pipe a single file into the ...

arendtio · on Jan 22, 2019

In fact, I don't like people optimizing shell scripts for performance. I mean, shell scripts are slow by design and if you need something fast, you choose the wrong technology in the first place.

Instead, shell script should be optimized for readability and portability and I think it is much easier to understand something like 'read | change >write' than 'change <read >write'. So I like to write pipelines like this:

  cat foo.txt \
    | grep '^x' \
    | sed 's/a/b/g' \
    | awk '{print $2}' \
    | wc -l >bar.txt

It might be not the most efficient processing method, but I think it is quite readable.

For those who disagree with me: You might find the pure-bash-bible [1] valuable. While I admire their passion for shell scripts, I think they are optimizing to the wrong end. I would be more a fan of something along the lines of 'readable-POSIX-shell-bible' ;-)

[1]: https://github.com/dylanaraps/pure-bash-bible

GuB-42 · on Jan 22, 2019

IMHO, shell scripts are a minefield and if you want something readable and portable, this is also the wrong technology. They are convenient though. They are like the Excel macros of the UNIX world.

Now back to the topic of "cat", which is a great example of why shell scripts are minefields.

Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

Now, if F='-n', second trap. What you think is a file will be considered an option and cat will wait for user input, like when no file is given. Ok, so you need to do cat -- "$F" | blah_blah.

That should be OK in every case now, but remember that "cat" is just another executable, or maybe a builtin. For some reason, on your system "cat --" may not work, or some asshat may have added "." in your PATH and you may be in a directory with a file named "cat". Or maybe some alias that decides to add color.

There are other things to consider, like your locale that may mess up you output with comas instead of decimal points and unicode characters. For that reason, you need to be very careful every time you call a command and even more so if you pipe the output.

For that reason, I avoid using "cat" in scripts. It is an extra command call and all the associated headaches I can do without.

kemitche · on Jan 22, 2019

> Now, if F='-n', second trap

You're not wrong, but I think it's worth pointing out that's a trap that comes up any time you exec another program, whether it's from shell or python. I can't reasonably expect `subprocess.run(["cat", somevar])` to work if `somevar = "-n"`.

(Now, obviously, I'm not going to "cat" from python, but I might "kubectl" or something else that requires care around the arguments)

mirimir · on Jan 23, 2019

> Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.

I think that you forgot to edit the "I mean" to "echo $F" :)

madmax96 · on Jan 22, 2019

I agree with the sentiment, but my critique applies so generally that it must be noted: if a command accepts a filename as a parameter, you should absolutely pass it as a parameter rather than `cat` it over stdin.

For example, you can write this pipeline as:

    grep '^x' foo.txt \
        | sed 's/a/b/g' \
        | awk '{print $2}' \
        | wc -l > bar.txt

This is by no means scientific, but I've got a LaTeX document open right now. A quick `time` says:

    $ time grep 'what' AoC.tex
    real    0m0.045s
    user    0m0.000s
    sys     0m0.000s

    $ time cat AoC.tex | grep what
    real    0m0.092s
    user    0m0.000s
    sys     0m0.047s

Anecdotally, I've witnessed small pipelines that absolutely make sense totally thrash a system because of inappropriate uses of `cat`. When you `cat` a file, the OS must (1) `fork` and `exec`, (2) copy the file to `cat`'s memory, (3) copy the contents of `cat`'s memory to the pipe, and (4) copy the contents of the pipe to `grep`'s memory. That's a whole lot of copying for large files -- especially when the first command grep in the sequence usually performs some major kind of reduction on the input data!

Poiesis · on Jan 22, 2019

In my opinion, it's perfectly fine either way unless you're worried about performance. I personally tend to try to use the more performant option when there's a choice, but a lot of times it just doesn't matter.

That said, I suspect the example would be much faster if you didn't use the pipeline, because a single tool could do it all (I'm leaving in the substitution and column print that are actually unused in the result):

    awk '/^x/{gsub("a","b");print $2; count++}END{print NR}' foo.txt

fooblat · on Jan 22, 2019

That syntax is very unusual from anything I've seen. I am also a fan of splitting pipelines with line breaks for readability, however I put the pipe on the end of each line and omit the backslash. In Bash, a line that ends with a pipe always continues on the next line.

In any case, it's probably just a matter of personal taste.

AdmiralAsshat · on Jan 22, 2019

That's actually very readable. I'm now regretting that I hadn't seen this about 3 months ago--I recently left a project that had a large number of shell scripts I had written or maintained for my team. This probably would've made it much easier for the rest of the team to figure out what the command was doing.

icedchai · on Jan 22, 2019

If the order is your concern, you can also put the <read at the beginning of the line. <file grep x works the same as: cat file | grep x

rabidrat · on Jan 23, 2019

I've been using unix for 25 years and I did not know that.

tejtm · on Jan 22, 2019

I dunno, You are bringing 5 cores to bear and there is no global interpreter lock which is not a bad start

michaelfeathers · on Jan 22, 2019

I like 'collection pipeline' code written in this style regardless of language. If we took away the pipe symbols (or the dots) and just used indentation we'd have something that looked like asm but with flow between steps rather than common global state.

I periodically think it would be a good idea to organize a language around.

anthk · on Jan 23, 2019

awk can do all of that except sed. And I am not sure about the last. No need to wc ($NF in AWK, if I can recall), no need for grep, you have the /match/ statement, with regex too.

Crestwave · on Jan 23, 2019

> except sed

Doesn't gsub(/a/, "b") do the same thing as s/a/b/g?

anthk · on Jan 23, 2019

Yes, I recall it hours ago.

msla · on Jan 22, 2019

I find something like this:

   grep '^x' < input | sed 's/foo/bar/g'

to be very readable, as the flow is still visually apparent based on punctuation.

OskarS · on Jan 22, 2019

I don't like this style at all. If you're following the pipeline, it starts in the middle with "input", goes to the left for the grep, then to the right (skipping over the middle part) to sed.

     cat input | grep '^x' | sed 's/foo/bar/g'

Is far more readable, in my opinion. In addition, it makes it trivial to change the input from a file to any kind of process.

I'm STRONGLY in favor of using "cat" for input. That "useless use of cat" article is pretty dumb, IMHO.

dearrifling · on Jan 22, 2019

Note that '<input grep | foo' is also valid.

kps · on Jan 22, 2019

In this particular example, ‘unnecessary use of cat’ is accompanied by ‘unnecessary use of grep’.

    cat input | grep '^x' | sed 's/foo/bar/g'

→

    sed '/^x/s/foo/bar/g' <input

irishsultan · on Jan 23, 2019

That's not the same thing. The sed output will still keep lines not starting with x (just not replacing foo with bar in those) where grep will filter those out.

kps · on Jan 23, 2019

Yeah, Muphry's law at work. Corrected version:

   sed -n '/^x/s/foo/bar/gp' <input

This may be an inadvertent argument for the ‘connect simpler tools’ philosophy.

sigjuice · on Jan 22, 2019

You can just remove the <

msla · on Jan 22, 2019

You can if input is a file. It might be a program with no arguments or something else.

sigjuice · on Jan 22, 2019

In your original command, how can 'input' be a program with no arguments?

msla · on Jan 22, 2019

Oh, damn. You're exactly right.

OK, to save some of my face, this will work:

    grep 'foo' <(input) | sed 's/baz/bar/g'

... at least in zsh and probably bash.

laumars · on Jan 22, 2019

I don’t like that at all. That creates a subshell and is also less readable than

    input | grep foo | sed ...

msla · on Jan 22, 2019

That specific example is less readable, but I do like being able to do this:

    diff <(prog1) <(prog2)

and get a sensible result.

And sometimes programs just refuse to read from stdin but do just fine with an unseekable file on the command line. True, you do have this:

    input | recalcitrant_program /dev/stdin

... but it's a bit of a tossup as to which one's more readable at this point. They're both relying on advanced shell functionality.

laumars · on Jan 22, 2019

> That specific example is less readable, but I do like being able to do this:

> diff <(prog1) <(prog2)

> and get a sensible result.

That is called process substitution and is exactly the kind of use case that it's designed for. So yes, process substitution does make sense there.

> input | recalcitrant_program /dev/stdin

> ... but it's a bit of a tossup as to which one's more readable at this point. They're both relying on advanced shell functionality.

There's no tossup at all. Process substitution is easily more readable than your second example because you're honouring the normal syntax of that particular command's parameters rather than kludging around it's lack of STDIN support.

Also I wouldn't say either example is using advanced shell functionalities either. Process substitution (your first example) is a pretty easy thing to learn and your second example is just using regular anonymous pipes (/dev/stdin isn't a shell function, it's a proper pseudo-device like /dev/random and /dev/null) thus the only thing the shell is doing is the same pipe described in this threads article (with UNIX / Linux then doing the clever stuff outside of the shell).

Hello71 · on Jan 22, 2019

This is a very silly way of writing it though. grep|sed can almost always be replaced with a simple awk: awk '/^x/ { sub("a", "b"); print $2; }' foo.txt. This way, the whole command fits on one line. If it doesn't, put your awk script in a separate file and simply call it with "awk -f myawkscript foo.txt".

ncallaway · on Jan 22, 2019

I would disagree that their way of writing it is silly.

It is instantly plainly obvious to me what each step of their shell script is doing.

While I can absolutely understand what your shell script does after parsing it, it's meaning doesn't leap out at me in the same way.

I would describe the prior shell script as more quickly readable than the one that you've listed.

So, perhaps it's not a question of one being more silly than the other—perhaps the author just has different priorities from you?

geofft · on Jan 22, 2019

I use awk in exactly this way personally, but, awk is not as commonly readable as grep and sed (in fact, that use of grep and sed should be pretty comprehensible to someone who just knows regular expressions from some programming languages and very briefly glances at the manpages, whereas it would be difficult to learn what that awk syntax means just from e.g. the GNU awk manpage). So, just as you could write a Perl one-liner but you shouldn't if you want other people to read the code, I'd probably advise against the awk one-liner too.

yesenadam · on Jan 23, 2019

Not sure why you say grep and sed are more readable than awk! (not sure what 'commonly readable' means). Or that even that particular line in awk is harder to understand than the grep and sed man pages. The awk manpage even has examples, including print $2. The sed manpages must be the most impenetrable manpages known to 'man', if you don't already understand sed. (People might already know s///g because 99% of the time, that's all sed is used for.)

yesenadam · on Jan 22, 2019

>sub("a", "b");

That should be gsub, shouldn't it? (sub only replaces the first occurrence)

Hello71 · on Jan 22, 2019

tyingq · on Jan 22, 2019

The "useless use of cat" was a repeated gripe on Usenet back in the day: http://porkmail.org/era/unix/award.html

Leace · on Jan 22, 2019

I actually think that cat makes it more obvious what's happening in some cases.

I had recently built a set of tools used primarily via pipes: (tool-a | tool-b | tool-c) and it looks clearer when I mock (for testing) one command (cat results | tool-b | tool-c) instead of re-flowing it just to avoid cat and use direct files.

empath75 · on Jan 22, 2019

People use cat to look at the file first, then hit up arrow, add a pipe, etc.

longwave · on Jan 22, 2019

Yes, this. Quite often I start writing out complex pipelines using head/tail to test with a small dataset and then switch it out for cat when I am done to run it on the full thing. And it's often not worth refactoring these things later unless you are really trying to squeeze performance out of them.

toxik · on Jan 22, 2019

I think it's also a grammatical wart of shell syntax. Things going into a command are usually on the left, but piping in a file goes on the right.

richardwhiuk · on Jan 22, 2019

   <file command | command | command

is perfectly fine.

nine_k · on Jan 22, 2019

The arrow now points backwards.

chrisfinazzo · on Jan 22, 2019

Of course, if any of your commands prompt for input, you'll be disappointed that's not always as easy as it appears on the surface.

Does anyone have a better way to do this kind of thing?

milkey_mouse · on Jan 22, 2019

The standard is expect [1]. There are also libraries for many programming languages which perform a similar task, such as pexpect [2].

[1] https://core.tcl.tk/expect/index [2] https://pexpect.readthedocs.io/en/stable/

richardwhiuk · on Jan 22, 2019

The better solution is to change the command so it expects programatic arguments / pass command line parameters.

i.e.

prefer `apt-get install -y` over `yes | apt-get install foo`

maximilianburke · on Jan 22, 2019

I can see how it's redundant. But I use cat-pipes because I once mistyped the redirection and nuked my carefully created input file :)

(Similarly, the first thing I used to do on Windows was set my prompt to [$p] because many years ago I also accidentally nuked a part of Visual Studio when I copied and pasted a command line that was prefixed with "C:\...>". Whoops.)

invsblduck · on Jan 22, 2019

Not nitpicking. Useless Use of Cat is an old thing: http://catb.org/jargon/html/U/UUOC.html

dan-robertson · on Jan 22, 2019

For interactive use, I would like to point out that even better than this use of cat is less. If you pipe less into something then it forgets it’s interactive behaviour and works like cat on a single file. So:

  $ less foo | bar

Is similar too:

  $ bar < foo

Except that less is typically more clever than that and might be more like:

  $ zcat foo | bar

Depending on the file type of foo.

chrisfinazzo · on Jan 22, 2019

I would be remiss if I did not point out that calling said program cat is a misnomer. Instead of 'string together in a series' (the actual dictionary definition, which coincidentally, pipes actually do) it quickly became 'print whatever I type to the screen.'

Of course, the example @arendtio uses is correct, because they obviously care about such things.

Sharlin · on Jan 22, 2019

Having separate commands for outputting the content of a single file and several files would, however, be an orthogonality violation. YMMV whether having a more descriptive name for the most common use of cat would be worth the drawback.

chrisfinazzo · on Jan 22, 2019

It would fit in the broader methodology of 'single purpose tools that do their job well' or 'small pieces, loosely joined', but yes, probably too annoying to bother with.

assafmo · on Jan 22, 2019

I usually replace cat with pv and gets a nice progress bar and ETA :-)