I'm probably nitpicking, but if you're using cat to pipe a single file into the sdtin of another program, you most likely don't need the cat in the first place, you can just redirect the file to the process' stdin. Unless, of course, you're actually concatenating multiple files or maybe a file and stdin together.
Disclaimer: I do cat-piping myself quite a bit out of habit, so I'm not trying to look down at the author or anything like that! :)
In fact, I don't like people optimizing shell scripts for performance. I mean, shell scripts are slow by design and if you need something fast, you choose the wrong technology in the first place.
Instead, shell script should be optimized for readability and portability and I think it is much easier to understand something like 'read | change >write' than 'change <read >write'. So I like to write pipelines like this:
It might be not the most efficient processing method, but I think it is quite readable.
For those who disagree with me: You might find the pure-bash-bible [1] valuable. While I admire their passion for shell scripts, I think they are optimizing to the wrong end. I would be more a fan of something along the lines of 'readable-POSIX-shell-bible' ;-)
IMHO, shell scripts are a minefield and if you want something readable and portable, this is also the wrong technology. They are convenient though. They are like the Excel macros of the UNIX world.
Now back to the topic of "cat", which is a great example of why shell scripts are minefields.
Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.
Now, if F='-n', second trap. What you think is a file will be considered an option and cat will wait for user input, like when no file is given. Ok, so you need to do cat -- "$F" | blah_blah.
That should be OK in every case now, but remember that "cat" is just another executable, or maybe a builtin. For some reason, on your system "cat --" may not work, or some asshat may have added "." in your PATH and you may be in a directory with a file named "cat". Or maybe some alias that decides to add color.
There are other things to consider, like your locale that may mess up you output with comas instead of decimal points and unicode characters. For that reason, you need to be very careful every time you call a command and even more so if you pipe the output.
For that reason, I avoid using "cat" in scripts. It is an extra command call and all the associated headaches I can do without.
You're not wrong, but I think it's worth pointing out that's a trap that comes up any time you exec another program, whether it's from shell or python. I can't reasonably expect `subprocess.run(["cat", somevar])` to work if `somevar = "-n"`.
(Now, obviously, I'm not going to "cat" from python, but I might "kubectl" or something else that requires care around the arguments)
> Replace "foo.txt" with a user supplied variable, let's call it "$F". It becomes cat $F | blah_blah... I mean cat "$F" | blah_blah, first trap, but everyone knows that.
I think that you forgot to edit the "I mean" to "echo $F" :)
I agree with the sentiment, but my critique applies so generally that it must be noted: if a command accepts a filename as a parameter, you should absolutely pass it as a parameter rather than `cat` it over stdin.
This is by no means scientific, but I've got a LaTeX document open right now. A quick `time` says:
$ time grep 'what' AoC.tex
real 0m0.045s
user 0m0.000s
sys 0m0.000s
$ time cat AoC.tex | grep what
real 0m0.092s
user 0m0.000s
sys 0m0.047s
Anecdotally, I've witnessed small pipelines that absolutely make sense totally thrash a system because of inappropriate uses of `cat`. When you `cat` a file, the OS must (1) `fork` and `exec`, (2) copy the file to `cat`'s memory, (3) copy the contents of `cat`'s memory to the pipe, and (4) copy the contents of the pipe to `grep`'s memory. That's a whole lot of copying for large files -- especially when the first command grep in the sequence usually performs some major kind of reduction on the input data!
In my opinion, it's perfectly fine either way unless you're worried about performance. I personally tend to try to use the more performant option when there's a choice, but a lot of times it just doesn't matter.
That said, I suspect the example would be much faster if you didn't use the pipeline, because a single tool could do it all (I'm leaving in the substitution and column print that are actually unused in the result):
That syntax is very unusual from anything I've seen. I am also a fan of splitting pipelines with line breaks for readability, however I put the pipe on the end of each line and omit the backslash. In Bash, a line that ends with a pipe always continues on the next line.
In any case, it's probably just a matter of personal taste.
That's actually very readable. I'm now regretting that I hadn't seen this about 3 months ago--I recently left a project that had a large number of shell scripts I had written or maintained for my team. This probably would've made it much easier for the rest of the team to figure out what the command was doing.
I like 'collection pipeline' code written in this style regardless of language. If we took away the pipe symbols (or the dots) and just used indentation we'd have something that looked like asm but with flow between steps rather than common global state.
I periodically think it would be a good idea to organize a language around.
awk can do all of that except sed. And I am not sure about the last. No need to wc ($NF in AWK, if I can recall), no need for grep, you have the /match/ statement, with regex too.
I don't like this style at all. If you're following the pipeline, it starts in the middle with "input", goes to the left for the grep, then to the right (skipping over the middle part) to sed.
cat input | grep '^x' | sed 's/foo/bar/g'
Is far more readable, in my opinion. In addition, it makes it trivial to change the input from a file to any kind of process.
I'm STRONGLY in favor of using "cat" for input. That "useless use of cat" article is pretty dumb, IMHO.
That's not the same thing. The sed output will still keep lines not starting with x (just not replacing foo with bar in those) where grep will filter those out.
> That specific example is less readable, but I do like being able to do this:
> diff <(prog1) <(prog2)
> and get a sensible result.
That is called process substitution and is exactly the kind of use case that it's designed for. So yes, process substitution does make sense there.
> input | recalcitrant_program /dev/stdin
> ... but it's a bit of a tossup as to which one's more readable at this point. They're both relying on advanced shell functionality.
There's no tossup at all. Process substitution is easily more readable than your second example because you're honouring the normal syntax of that particular command's parameters rather than kludging around it's lack of STDIN support.
Also I wouldn't say either example is using advanced shell functionalities either. Process substitution (your first example) is a pretty easy thing to learn and your second example is just using regular anonymous pipes (/dev/stdin isn't a shell function, it's a proper pseudo-device like /dev/random and /dev/null) thus the only thing the shell is doing is the same pipe described in this threads article (with UNIX / Linux then doing the clever stuff outside of the shell).
This is a very silly way of writing it though. grep|sed can almost always be replaced with a simple awk: awk '/^x/ { sub("a", "b"); print $2; }' foo.txt. This way, the whole command fits on one line. If it doesn't, put your awk script in a separate file and simply call it with "awk -f myawkscript foo.txt".
I use awk in exactly this way personally, but, awk is not as commonly readable as grep and sed (in fact, that use of grep and sed should be pretty comprehensible to someone who just knows regular expressions from some programming languages and very briefly glances at the manpages, whereas it would be difficult to learn what that awk syntax means just from e.g. the GNU awk manpage). So, just as you could write a Perl one-liner but you shouldn't if you want other people to read the code, I'd probably advise against the awk one-liner too.
Not sure why you say grep and sed are more readable than awk! (not sure what 'commonly readable' means). Or that even that particular line in awk is harder to understand than the grep and sed man pages. The awk manpage even has examples, including print $2. The sed manpages must be the most impenetrable manpages known to 'man', if you don't already understand sed. (People might already know s///g because 99% of the time, that's all sed is used for.)
I actually think that cat makes it more obvious what's happening in some cases.
I had recently built a set of tools used primarily via pipes: (tool-a | tool-b | tool-c) and it looks clearer when I mock (for testing) one command (cat results | tool-b | tool-c) instead of re-flowing it just to avoid cat and use direct files.
Yes, this. Quite often I start writing out complex pipelines using head/tail to test with a small dataset and then switch it out for cat when I am done to run it on the full thing. And it's often not worth refactoring these things later unless you are really trying to squeeze performance out of them.
I can see how it's redundant. But I use cat-pipes because I once mistyped the redirection and nuked my carefully created input file :)
(Similarly, the first thing I used to do on Windows was set my prompt to [$p] because many years ago I also accidentally nuked a part of Visual Studio when I copied and pasted a command line that was prefixed with "C:\...>". Whoops.)
For interactive use, I would like to point out that even better than this use of cat is less. If you pipe less into something then it forgets it’s interactive behaviour and works like cat on a single file. So:
$ less foo | bar
Is similar too:
$ bar < foo
Except that less is typically more clever than that and might be more like:
I would be remiss if I did not point out that calling said program cat is a misnomer. Instead of 'string together in a series' (the actual dictionary definition, which coincidentally, pipes actually do) it quickly became 'print whatever I type to the screen.'
Of course, the example @arendtio uses is correct, because they obviously care about such things.
Having separate commands for outputting the content of a single file and several files would, however, be an orthogonality violation. YMMV whether having a more descriptive name for the most common use of cat would be worth the drawback.
It would fit in the broader methodology of 'single purpose tools that do their job well' or 'small pieces, loosely joined', but yes, probably too annoying to bother with.
Disclaimer: I do cat-piping myself quite a bit out of habit, so I'm not trying to look down at the author or anything like that! :)