I only discovered process substitution a few months ago but it's already become a frequently used tool in my kit.
One thing that I find a little annoying about unix commands sometimes is how hard it can be to google for them. '<()', nope, "command as file argument to other command unix," nope. The first couple of times I tried to use it, I knew it existed but struggled to find any documentation. "Damnit, I know it's something like that, how does it work again?..."
Unless you know to look for "Process Substitution" it can be hard to find information on these things. And that's once you even know these things exist....
Anyone know a good resource I should be using when I find myself in a situation like that?
Be aware that process substitution (and named pipes) can bite you in the arse in some situations --- for example, if the program expects to be able to seek in the file. Pipes don't support this and the program will see it as an I/O error. This'd be fine if programs just errored out cleanly but they frequently don't check that seeking succeeds. unzip treats a named pipe as a corrupt zipfile, for example:
$ unzip <(cat z)
Archive: /dev/fd/63
End-of-central-directory signature not found. Either this file is not
a zipfile, or it constitutes one disk of a multi-part archive. In the
latter case the central directory and zipfile comment will be found on
the last disk(s) of this archive.
unzip: cannot find zipfile directory in one of /dev/fd/63 or
/dev/fd/63.zip, and cannot find /dev/fd/63.ZIP, period.
Aside: If I recall correctly, with the zip file format, the index is at the end of the file. A (named) pipe works fine with, for instance, a bzipped tarball.
I wouldn't be surprised if the ZIP file format has its origins outside the Unix world given its pipe-unfriendlyness.
I've seen the ABS guide criticized for being obsolete and recommending wrong or obsolete best practices. A recommended replacement is The Bash Hacker's Wiki: http://wiki.bash-hackers.org/doku.php
For those wondering, man, which uses less as a pager, has vi-like key bindings.
"/<\(" starts a regex-based search for "<(" (you must escape the open paren).
This is the origin of the regex literal syntax in most programming languages that have them. It was first introduced by Ken Thompson in the "ed" text editor.
If you're interested in stock Posix shell rather than bashish, the dash man page is a whole lot shorter and easier to follow, makes a great concise reference.
They have, I have complained loudly about this[1], never hard anything back (this is SOP I understand), but I have seen improvements last year.
Double quotes around part of a query means make sure this part is actually matched in the index. (I think they still annoy me be including sites that are linked to using this phrase[2], but that is understandable.)
Then there is the "verbatim" setting that you can activate under search tools > "All results" dropdown.
[1]:And the reason they annoyed me was because they would still fuzz my queries despite me doublequoting and choosing verbatim.
[2]: To verify this you could open the cached version and on top of the page you'd see something along the lines of: "the following terms exist only in links pointing to this page."
Because if you want to ignore punctuation and case in normal situations, you leave them out of the search index. And then you can't query the same search index for punctuation and/or case-sensitive queries.
So they'd have to create a second index for probably less than 0.01% of their queries, and that second index would be larger and harder to compress.
As much as I'd love to see a strict search, from a business perspective I don't think it makes sense to a provide one.
I wish they'd supply that too, but they do seem to have gotten better at interpreting literally when it makes sense in context. I've been learning C# and have found, for example, that searches with the term "C#" return the appropriate resources when in the past I'd have probably seen results for C.
Google handles some constructs with punctuation as atomic tokens as special cases. C# and C++ are examples. A# through G# also return appropriate results, for the musical notes. H# and onward through the alphabet do not.
.NET is another example. Google will ignore a prepended dot on most words, but .NET is handled specially as an atomic token. I would bet this is a product of human curation, not of algorithms that have somehow identified .NET as a semantic token.
Searching for punctuation in a general case is hard, though. You wouldn't want a search for Lisp to fail to match pages with (Lisp). We often forget that the pages are tokenized and indexed, that Google and the other search engines aren't a byte-for-byte scan across the entire web.
I was recently trying to understand the difference between the <%# and <%= server tags in ASP.NET. Google couldn't even interpret those as tokens to search for. It took me a long time to figure out the former's true name as the data-bind operator in order to search for that and find the MS docs.
Occasionally it's useful to spell out the names of the characters, both when searching and when writing documentation, blog posts, and SO Q&A. That way, searching for "asp.net less than percent hash" might tell you it's the data-bind operator.
That's actually one of the things that I really dislike with bash, that it doesn't read the whole script before executing it. I've been bitten by it before, when I write some long-running script, then e.g. write a comment at the top of it as it's running, and then when bash looks for the next command, it's shifted a bit and I get (at best) a syntax error and have to re-run :-(
There are several ways to get Bash to read the whole thing before executing.
My preferred method is to write a main() function, and call main "$@" at the very end of the script.
Another trick, useful for shorter scripts, is to just wrap the body of the script in {}, which causes the script to be a giant compound command that is parsed before any of it is executed; instead of a list of commands that is executed as read.
Fun fact: That's how goto worked in the earliest versions of Unix, before the Bourne shell was invented: goto was an external command that would seek() in the shell-script until it found the label it was looking for, and when control returned to the shell it would just continue executing from the new location.
To this day, when the shell launches your program, you can find the shell-script it's executing as file-descriptor 255, just in case you want to play any flow-control shenanigans.
Pipes are probably the original instantiation of dataflow processing (dating back to the 1960s). I gave a tech talk on some of the frameworks:
https://www.youtube.com/watch?v=3oaelUXh7sE
Vince Buffalo is author of the best book on bioinformatics: Bioinformatics Data Skills (O'Reilly). It's worth a read for learning unix/bash style data science of any flavour.
Or even if you think you know unix/bash and data there are new and unexpected snippets every few pages that surprise you.
In zsh, =(cmd) will create a temporary file, <(cmd) will create a named pipe, and $(cmd) creates a subshell. There are also fancy options that use MULTIOS. For example:
If you like pipes, then you will love lazy evaluation. It is unfortunate, though, that Unix doesn't support that (operations can block when "writing" only, not when "nobody is reading").
If nobody is reading, you will eventually fill the pipe buffer (which is about 4k), and the writing will stop. It's a bigger queue than most of us would expect when compared to generator expressions, but it can and does create back pressure while making reads efficient.
Lazy evaluation with pipes would be problematic because alerts/echos would be ignored by default as they are necessarily part of the stdin/stdout chain.
I.E.: This section of code
let a = "some_file_name".as_string();
println!("Opening: {}", a);
let path = std::path(a);
let mut fd = std::io::open(path);
would get optimized to
let mut fd = std::io::open(std::path("some_file_name".as_string()));
with strict lazy evaluation. The user feedback is removed, which is a big part of shell scripting.
I guess you found another issue with Unix. The user does not care in general how something is performed, just that it is performed correctly and with good performance.
I guess an OS should be functional at its interface to the user, and only imperative deep down to keep things running efficiently.
However, note that this hypothetical functional layer on top also would ensure efficiency, as it enables lazy evaluation. This type of efficiency could under certain circumstances be even more valuable than the bare-metal performance of system programming languages.
>The user does not care in general how something is performed, just that it is performed correctly and with good performance.
This is the crux of the matter. With BASH scripting the user does care how a task is preformed as that task maybe system administration, involve sensitive system components, OR sensitive data.
Lazy evaluation is great for binary/cpu level optimization. But passing system administration tasks though the same process is scary as you lose the 1:1 mapping you previously had.
Well, in any case, the problem can be resolved by adding a kernel-level api function that allows one to wait (block) until results are requested from the other end of the pipe.
The opposite is already true and has the same effect.
Each stage of the pipeline is executed when it has data to execute. So ultimately the main blocking event is IO (normally the first stage in a pipeline). Every other process is automatically marked as blocked, until its stdin is populated by the output of the former. Once its task is complete it re-checks stdin, and if nothing is present blocks itself.
So the execution of each task is controlled by the process who's data that task needs to operate.
In your system why would you want to block the previous step? This would just interfere with the previous+1 step, and you'd have to populate that message further up the chain. This seems needlessly complicated. As you have to add extra IPC.
Why: consider the generation of a stream of random numbers; assume each random number requires a lot of CPU-intensive work; obviously, you don't want to put unnecessary load on the CPU, and hence it is better to not fill any buffer ahead of time (before the random numbers are being requested).
AFAIK process substitution is a bash-ism (not part of POSIX spec for /bin/sh). I recently had to go with the slightly less wieldy named pipes in a dash environment and put the pipe setup, command execution and teardown in a script.
Named pipes have been rare for me, but simple process substitution is every day.
Very often I do something like this in quick succession. Command line editing makes this trivial.
$ find . -name "*blarg*.cpp"
# Some output that looks like what I'm looking for.
# Run the same find again in a process, and grep for something.
$ grep -i "blooey" $(find . -name "*blarg*.cpp")
# Yep, those are the files I'm looking for, so dig in.
# Note the additional -l in grep, and the nested processes.
$ vim $(grep -il "blooey" $(find . -name "*blarg*.cpp"))
Granted, you can only append new arguments and using the other ! commands will often be less practical than editing. Still, it's amazing how frequently this is sufficient.
I've always thought it'd be nice if there was a `set` option or something similar that would make bash record command lines and cache output automatically in implicit variables, so that it doesn't re-run the commands. The semantics are definitely different and you wouldn't want this enabled at all times, but for certain kinds of sessions it would be very handy.
I used to do it something like that, but I find it personally easier to understand the way I described it and evolved into that. I also like what I'm ultimately doing (vim/vi in this case) to be over on the left: edit whatever this mess on the right produces.
When you connect to processes in a pipe such as ...
a | b
you connect stdout (fd #1) of a to stdin (fd #0) of b. Technically, the shell process will create a pipe, which is two filedescriptors connected back to back. It then will fork two times (create two copies of itself) where it replaces standard output (filedescriptor 1) of the first copy by one end of the pipe and replaces standard input (filedescriptor 0) of the second copy by the other end of the pipe. Then the first copy will replace itself (exec) by a, the second copy will replace itself (exec) by b. Everything that a writes to stdout will appear on stdin of b.
But nothing prevents the shell from replacing any other filedescriptor by pipes. And when you create a subprocess by writing "<(c)" in your commandline, it's just one additional fork for the shell, and one additional filedescriptor pair to be created. One side, as in the simple case, will replace stdin (fd #0) of "c"... and because the input side of this pipe doesn't have a predefined output of "a" (stdout is already taken by "|b") the shell will somehow have to tell "a" what filedescriptor the pipe uses. Under Linux one can refer to opened filedescriptors as "/dev/fd/<FDNUM>" (symlink to /proc/self/fd/<FDNUM> which itself is a symlik to /proc/<PID>/fd/<FDNUM>), so that's what's replaced as a "name" to refer to the substituted process on "a"'s command line:
Try this:
$ echo $$
12345 # <--- PID of your shell
$ tee >( sort ) >( sort ) >( sort ) otherfile | sort
and in a second terminal
$ pstree 12345 # <--- PID of your shell
zsh,301
├─sort,3600 # <-- this one reads from the other end of the shell's fd #14
├─sort,3601 # <-- this one reads from the other end of the shell's fd #15
├─sort,3602 # <-- this one reads from the other end of the shell's fd #15
├─sort,3604 # <-- this one reads from stdout of tee
└─tee,3603 /proc/self/fd/14 /proc/self/fd/15 /proc/self/fd/16 otherfile
If your system doesn't support the convenient /proc/self/fd/<NUM> shortcut, the shell might decide not to create a pipe, but rather create temporary fifos in /tmp and use those to connect the filedescriptors.
Anybody know of a way to increase the buffer size of pipes? I've experienced cases where piping a really fast program to a slow one caused them both to go slower as the OS pauses first program writing when pipe buffer is full. This seemed to ruin the caching for the first program and caused them both to be slower even though normally pipes are faster as you're not touching disk.
Seems entirely appropriate given his blog post, and others like it on his site as well as the book he wrote, are clearly aimed at people interested in learning bioinformatics.
moreutils [1] has some really cool programs for pipe handling.
pee: tee standard input to pipes
sponge: soak up standard input and write to a file
ts: timestamp standard input
vipe: insert a text editor into a pipe
Pipes are very cool and useful, but it's hard for me to understand this common worship of something like that. Yes, it's useful and elegant, but is it really the best thing since Jesus Christ?
It really is a question that I've been having for a long time. I didn't just say that to piss people off. I guess that's the risk you run of coming across when you try to insert yourself in a conversation where the other participants have already agreed on a set of shared opinions - this is great - and you try to question that common assumption/opinion.
I have honestly been questioning my own understanding of pipe, since I've failed to see the significance before; first I thought it was just `a | b` as in "first do a, then b". So then it just seemed like a notational way of composing programs. Then I thought, uh, ok say what? Composing things is the oldest trick in the conceptual book. But then I read more about it and saw that it had this underlying wiring of standard input and output and forking processes that made me realize that I had underestimated it. So, given that, I was wondering if there is even more that I've been missing.
I have for that matter read about named pipes before and tried it out a bit. It's definitely a cool concept.
I don't think it's that you have a differing opinion. Those are great and most people would be OK with that. I really believe you got the down votes because what you said came off as condescending.
If you had said what you just said in your follow up, I think you'd actually have gotten some up votes.
I only discovered process substitution a few months ago but it's already become a frequently used tool in my kit.
One thing that I find a little annoying about unix commands sometimes is how hard it can be to google for them. '<()', nope, "command as file argument to other command unix," nope. The first couple of times I tried to use it, I knew it existed but struggled to find any documentation. "Damnit, I know it's something like that, how does it work again?..."
Unless you know to look for "Process Substitution" it can be hard to find information on these things. And that's once you even know these things exist....
Anyone know a good resource I should be using when I find myself in a situation like that?