Since the blog author is commenting here, you have this statement part way down your blog:
> That is, grep doesn't support an analogous -0 flag.
However, the GNU grep variant does have an analogous flag:
-z, --null-data
Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. Like the -Z or --null option, this option can be used with commands like sort -z to process arbitrary file names.
Ah cool, I didn't know that! I'll update the blog post. (What a cacophony of flags)
Edit: It seems that grep -0 isn't taken for something else and they should have used it for consistency? The man page says it's meant to be used with find -print0, xargs -0, perl -0, and sort -z (another inconsistency)
It is taken in grep, just poorly documented; grep -5 means grep -C 5, and grep -0 means grep -C 0. It's not taken in sort, though, so I don't know why they didn't use -0 for sort.
Takeaways: (1) There is no consistency in flag names, even --long ones (2) impressively many tools do support it! Note that some affect only input or only output. (3) All do NUL-terminated, not NUL-separated. That's fortunate — matches \n usage, and gives distinct representations for [] vs [""].
It's best to give up on any kind of consistency between command options. Any project is free to do anything it wants, and they all do. Someone is eventually going to come up with standard N+1[1] which does things consistently, but they are going to have to either recreate a bazillion tools or create some sort of huge translation framework configuration on top of existing tools to get there. And even then it'll take literally decades before people migrate away from the current tools. Basically, the sad truth is this isn't going to happen.
I use this exact pattern a lot. One thing to consider is that in the process substitution version, do_something can't modify the enclosing variables. The vast majority of the time I want to modify variables in the loop body and not the generating process, but it's worth keeping in mind.
One common pattern I use this for is running a bunch of checks/tests, e.g.
EXIT_CODE=0
while read -r F
do
do_check "$F" || EXIT_CODE=1
done < <(find ./tests -type f)
exit "$EXIT_CODE"
This is a more complicated alternative to the following:
find ./tests -type f | while read -r F
do
do_check "$F" || exit 1
done
The simpler version will abort on the first error, whilst the first version will always run all of the checks (exiting with an error afterwards, if any of them failed)
I usually write zsh scripts and I think there’s a shell option in zsh that allows the loop at the end of the pipe to modify variables in the enclosing body: I remember at least one occasion where I was surprised about this discrepancy between shells.
>Different shells exhibit different behaviors in this situation:
>- BourneShell creates a subshell when the input or output of anything (loops, case etc..) but a simple command is redirected, either by using a pipeline or by a redirection operator ('<', '>').
>- BASH, Yash and PDKsh-derived shells create a new process only if the loop is part of a pipeline.
>- KornShell and Zsh creates it only if the loop is part of a pipeline, but not if the loop is the last part of it. The read example above actually works in ksh88, ksh93, zsh! (but not MKsh or other PDKsh-derived shells)
>- POSIX specifies the bash behaviour, but as an extension allows any or all of the parts of the pipeline to run without a subshell (thus permitting the KornShell behaviour, as well).
No, process substitution must be provided by the kernel/syslibs, it is not feature of bash. For example there is bash on AIX, but process substitution is not possible because the OS do not support it.
ksh93 depends exclusively on the kernel implementation of /dev/fd devices. I just checked `cat <(ls)` a moment ago on both Linux and AIX 7.2--the latter fails in ksh93t+.
Bash uses /dev/fd when available, but also appears to have an internal implementation which silently creates named pipes and cleans them up. In Bash 5.0.18 on AIX, fake process substitution works just fine, in my testing.
Also for the `while` enthusiasts, here's how you zip the output of two processes in bash:
paste -d \\n <(do_something1) <(do_something2) | while read -r var1 && read -r var2; do
... # var1 comes from do_something1, var2 comes from do_something2
done
For thousands of arguments this sloution is much slower (high CPU usage) than xargs, because either it implements the logic as a shell script (slow) or it runs an external program for each argument (slow).
One thing parallel can do better than xargs is collect output.
If you use `xargs -P`, all processes share the same stdout and output may be mixed arbitrarily between them. (If the program being executed uses line buffering, lines usually won't be mixed together from multiple invocations, but they can be if they're long enough).
In contrast, `parallel` by default doesn't mix together output from different commands at all, instead buffering the entire output until the command exits and then printing it.
With `--line-buffer` the unit of atomicity can be weakened from an entire command output to individual lines of output, reducing latency.
Alternately, with `--keep-order`, `parallel` can ensure the outputs are printed in the same order as the corresponding inputs, which makes the output deterministic if the program is deterministic. Without that you'll get results in an arbitrary order.
These aren't technically things that xargs and shell can't do; you could reimplement the same behavior by hand with the shell. But by the same token, there isn't anything xargs can do that the shell can't do alone; you could always use the shell to manually split up the input and invoke subprocesses. It's just a question of how much you want to reimplement by hand.
OK thanks, looks like there are several features of GNU parallel that users like.
For the output interleaving issue, what I do is use the $0 Dispatch Pattern and write a shell function that redirects to a file:
do_one() {
task_with_stdout > $dir/$task_id.txt
}
So if there are 10,000 tasks then I get 10,000 files, and I can check the progress with "ls", and I can also see what tasks failed and possibly restart them.
You even have some notion of progress by checking the file size with ls -l.
I tend to use a pattern where each task also outputs a metadata file: the exit status, along with the data from "time" (rusage, etc.)
But I admit that this is annoying to rewrite in every script that uses xargs! It does make sense to have this functionality in a tool.
But I think that tool should be a LANGUAGE like Oil, not a weirdo interface like GNU parallel :)
But thanks for the explanation (and thanks to everyone in this subthread) -- I learned a bunch and this is why I write blog posts :)
Thank you for writing this, it really crystalized for me why I feel the way I do about oil. I hate it. When I want a language, I want a real language like python not a weirdo jumped up shell (see what I did there?). What I want in a shell is a super small, fast, universally understood thing for basic tasks and easy expandability through tools like parallel and python.
For what it's worth, I consider oil to be closer to a unixy PowerShell rather than a more powerful bash. Note that this is not a slight, PowerShell is sweet for what it is. It (oil) really takes a hard left from the POSIX philosophy of focusing on one thing and doing it well. I'm also bitter that, if it's going to veer so far away from POSIX, that it didn't go the whole hundred and become a function language with comprehensions and such.
For what it's worth, everything you mentioned above about your approach can be done with parallel.
The point of oil that there are really basic things like safe quoting that shells should do well, yet none of the posix shell do!
Functional: there are interesting shells like Elvish. But it really goes PowerShell by adding internal rich data pipelines that dont have a unixy stream-of-bytes representation.
Oil does NOT go that way; it works on stuff like QSN to make pure unix interconnects more robust.
* does not buffer stderr
* does not check if the disk is full for a period of time during a task (thus risking incomplete output)
* does not clean up, if killed
* does not work correctly if task_with_stdout is a composed command
Given that GNU Parallel is a drop-in replacement for xargs, I am curious why you find it a 'weirdo interface'.
A lot of this comes down to familiarity. I tend to use "make -j 100" for what you're describing. If I write the Makefile carefully [1], it will handle resuming a half-finished job. I just looked and GNU parallel has a --resume argument which probably does something similar, and maybe with less hassle. But I don't do this often enough—and/or GNU parallel isn't "better enough"—that I'm likely to ever invest the time to learn GNU parallel.
btw, oil looks very cool. I hate how many footguns are in common shells.
[1] eg writing to a tempfile and atomically renaming into place: "task_with_stdout > $dir/task_id.txt.tmp && mv $dir/task_id.txt{.tmp,}"
Restart capability and remote executions make gnu parallel the tool if choice for HPC. For example, you might very well use gnu parallel to run 1000s of cpu-hours of numerical simulation using patterns such as these ones,
Not to be pedantic, but that's a bit of a non-argument. _Of course_ you can do it with xargs and shell, but imho parallel is generally more convenient, especially for remote execution. It provides a higher level of abstraction for such tasks.
> What does it do that xargs and shell can't? (honest question)
For me, an essential feature of GNU parallel is that it is semantically equivalent to "sh". Imagine that you write a file that contains a long list of commands. You can pipe that file to "sh" to run the commands, or pipe it to "parallel" to do the same, but faster. If you are building the list of commands on the fly, then you can use xargs with a slightly different syntax. But somehow using "sh" or "parallel" gives a certain peace of mind due to its straightforward semantics. I never used any argument of GNU parallel apart from -j
My usage pattern: to build a list of commands explicitly then run it (possibly teeing the list into a temporary file to inspect it):
for i in one two three; do
printf "echo $i\n"
done |sh # or |parallel
I use GNU Parallel for long-running jobs for its --eta option. If a job will take days or longer it's useful to know that early in the process. You might want to cancel it and try something else, and if you want to proceed with the long job you can make plans around when your data will be ready.
GNU Parallel can be sourced into a bash session from a plain text file and used as a function. I've used it to get around overly-restrictive build environments. (overly restrictive because the team that manages the build image wasn't open to modifying their image for my use case)
Veering off course here, after experiencing how incredibly long it took to install Sqitch, I will go out of my way to avoid anything that is more than a single script, certainly anything requiring CPAN too. I don’t think there’s anything technically wrong with these programs or with Perl, they’re just presented in ways that are unique hassles in this day and age.
I remember when Perl was the coolest thing ever. I even wrote numerical simulations in it, just to try. Only with the invention of Python and Ruby did we realize how much better things could be. Of course that the Perl inventor was an IOCCC winner should've been a red flag.
If you need more visibility into long running processes, pueue is another alternative. You can of course use `xargs -P1 pueue add ./process_file.sh` to add the jobs in the first place. Sends a job to pueued, returns immediately. Great for re-encoding dozens of videos. For jobs that aren’t already multi-core, set the queue parallelism with pueue, after you’ve seen your cpu is under-utilised.
Obviously downside to the visibility and dynamism is that it redirects stdout. You can read it back later, in order. But it’s not there for continued processing immediately.
(author here) Hm I don't see either of these points because:
GNU xargs has --verbose which logs every command. Does that not do what you want? (Maybe I should mention its existence in the post)
xargs -P can do everything GNU parallel do, which I mention in the post. Any counterexamples? GNU parallel is a very ugly DSL IMO, and I don't see what it adds.
--
edit: Logging can also be done with by recursively invoking shell functions that log with the $0 Dispatch Pattern, explained in the post. I don't see a need for another tool; this is the Unix philosophy and compositionality of shell at work :)
Parallel's killer feature is how it spools subprocess output, ensuring that it doesn't get jumbled together. xargs can't do that. I use parallel for things like shelling out to 10000 hosts and getting some statistics. If I use xargs the output stomps all over itself.
As far as I'm aware, xargs still has the problem of multiple jobs being able to write to stdout at the same time, potentially causing their output streams to be intermingled. Compare this with parallels --group.
Also parallels can run some of those threads on remote machines. I don't believe xargs has an equivalent job management function.
In your examples you fail to put 'xargs -P' in the middle of a pipeline: You only put it at the end.
In other words:
some command | xargs -P other command | third command
This is useful if 'other command' is slow. If you buffer on disk, you need to clean up after each task: Maybe there is not enough free disk space to buffer the output of all tasks.
UNIX is great in that you can pipe commands together, but due to the interleaving issue 'xargs -P' fails here. It does not live up to the UNIX philosophy. Which is probably why you unconsciously only use it at the end of a pipeline.
parallel doesn't either, it just nags. I agree about how silly and annoying it is. Imagine if every time the parallel author opened Firefox he got a message reminding him to personally thank me if he uses his web browser for research, or if every time his research program calls malloc he has to acknowledge and cite Ulrich Drepper. Very very silly.
Parallel is the better tool but the nagware impairs its reputation.
I'm surprised the links don't mention find. The -print0 flag makes it safe for crazy filenames, which pairs with the xargs -0 flag, or the perl -0 flag, etc. And you have -maxdepth if you don't want it to trawl.
This is only tangentially related, but after all the posts here the last few days about thought terminating cliches, I can’t help but reflect on the “X considered harmful” title cliche
Is it thought terminating, though? "X considered harmful" seems more intended to spark discussion in an intentionally inflammatory way than to stifle it.
(In any case, this surely is tangential, since the title is not "X considered harmful" for any value of X—at best it comments on a post by that title, as, indeed, you are doing.)
I've been thinking about titles, and it's hard to make a good one that doesn't look like a total cliché. "X considered harmful", "an opinionated guide to X", some kind of joke or reference, what could be a collection of tags (X, Y and Z), "things I have learned doing X", etc.
I specifically clicked on this topic because of the word “opinionated”. As I already know how to use xargs, I was curious what kind of non-conventional or controversial opinion the author might have.
As I've said to a sibling comment, I don't think it's a bad title, and "an opinionated guide to X" is one of the better cliché for titles that I see (the worst being the journalist that feels like they have to make a joke).
Of xargs, for, and while, I have limited myself to while. It's more typing everytime but saves me from having to remember so many quirks of each command.
cat input.file | ... | while read -r unit; do <cmd> ${unit}; done | ...
between 'while read -r unit' and 'while IFS= read -r unit' I can probably handle 90% of the cases. (maybe I should always use IFS since I tend to forget the proper way to use it).
That way will bite you when the tasks in question are cheaper than fork+exec. There was a thread just the other day in which folks were creating 8 million empty files with a bash loop over touch. But it's 60X faster (really, I measured) to use xargs, which will do batches (and parallelism if you tell it to).
The example of "foo bar" didn't work with while but inserting tr fixes it:
echo "foo bar" | tr ' ' '\n' | while read -r var; do echo ${var}; done
For examples in general, I guess something like "cat file.csv" could work. (the difference between using IFS= and not using it is essentially whether we want to preserve leading and trailing whitespaces or not. If we want to preserve, then we should use IFS=).
I always wonder why something like xargs is not a shell built-in. It's such a common pattern, but I dread formulating the correct incantation every time.
I was happy to read that the author comes to the same conclusion and proposes an `each` builtin (albeit only for the Oil shell)! Like that there is no need to learn another mini language as pointed out.
If you're a zsh user it offers a version of something like xargs in zargs¹. As the documentation shows it can be really quite powerful in part because of zsh's excellent globbing facilities, and I think without that support it wouldn't be all that useful as a built-in.
I'd also perhaps argue that the reason we don't want xargs to be a built-in is precisely because of zargs and the point in your second paragraph. If it was built-in it would no doubt be obscenely different in each shell, and five decades later a standard that no one follows would eventually specify its behaviour ;)
> -n instead of -L (to avoid an ad hoc data language)
Apparently GNU xargs is missing it, but BSD xargs has -J, which is a `-I` which works with `-n`: with `-I` each replstr gets replaced by one of the inputs, with `-J` the replstr gets replaced by the entire batch (as determined by `-n`).
No idea how old this command is. Most of the AIX/Linux admins I knew were very bad shell programmers, skills end with awfull for-loops, useless use of cat, and awk '{print $3}'.
I’m unconvinced by the post OP was responding to. It’s a utility, it provides some means to get things done. *nix provides many means of parsing text and running commands, each have their idioms based on their own axioms. It seems as if a composer is lambasting the clarinet because they don’t care for its fingerings. I’ve only used xargs sparingly, can somebody enlighten me as to why it’s bad, aside from the fact that there are other ways to do some things it does?
I wish this was the default behavior of xargs (the 'tr \\n \\0 | xargs -0' bit). I don't know why xargs splits on spaces and tabs as well as newlines by default and doesn't even have a flag to just split on lines.
Ok filenames can theoretically have newlines in them but I'd be happy to deal with that weird case. I can't recall ever having encountered it in years of using bash on various systems.
Shell pipes would then orthogonally provide the stuff like substitution that xargs does in it's own unique way (that I just can't be bothered learning) - instead you'd just pipe the find output through sed or 'grep -v' or whatever you wanted before piping into xargs.
I guess that's what aliases but I'm too lazy anymore to bother with configuring often short-lived systems all the time.
xargs defaults to all whitespace because it was designed to get around the problem of short argv lengths (like, I'm talking 4k or less on older Unix-y systems, sometimes as low as 255 bytes).
So the defaults went with principle of least surprise, pretending it's like a very long args list that you could theoretically enter at the shell, including quotes.
You could, for example, edit the args list in vi and line split / indent as you please but not impact the end result.
I'm not sure I like the `$1` and shell function pattern. It might avoid the -I minilanguage, but at the cost of "being clever" in a way that takes a minute to wrap your head around. It's a neat trick, but I don't think it would be easy to understand if you are reading the code for the first time.
I find that using the example of `rm` to discuss whether to pick `find -exec` or `find | xargs` rather strange, given the existence of `find -delete`. Maybe pick a different example operation to automate.
The linked article doesn’t suggest this. They explicitly suggest against it.
> Besides the extra ls, the suggestion is bad because it relies on shell's word splitting. This is due to the unquoted $(). It's better to rely on the splitting algorithms in xargs, because they're simpler and more powerful.
> A lobste.rs user asked why you would use find | xargs rather than find -exec. The answer is that it can be much faster. If you’re trying to rm 10,000 files, you can start one process instead of 10,000 processes!
Fair enough, but I still favor find -exec. I find it generally less error prone, and it's never been so slow that I wished I had instead used xargs.
Also, if you're specifically using -exec rm with find, you could instead use find with -delete.
I tend to prefer xargs because it works in more contexts e.g. I've got a tool which automatically generates databases but sometimes the cleanup doesn't work. `find -exec` does nothing, but `xargs -n1 dropdb` (following an intermediate grep) does the job. From there, it makes sense to… just use xargs everywhere.
And I always fail to remember that the -exec terminator must be escaped in zsh, so using -exec always takes me multiple tries. So I only use -exec when I must (for `find` predicates).
i agree. `find somewhere -exec some_command {} +` can be dramatically faster. but it does not guarantee a single invocation of `some_command`, it may make multiple invocations if you pass very large numbers of matching files
after spending a bit of time reading the man page for find, i rarely use xargs any more. find is pretty good.
tangent:
another instance i've seen where spawning many processes can lead to bad performance is in bash scripts for git pre-recieve hooks, to scan and validate the commit message of a range of commits before accepting them. it is pretty easy to cobble together some loop in a bash script that executes multiple processes _per commit_. that's fine for typical small pushes of 1-20 commits -- but if someone needs to do serious graph surgery and push a branch of 1000 - 10,000 commits that can can cause very long running times -- and more seriously, timeouts, where the entire push gets rejected as the pre-receive script takes too long. a small program using the libgit2 API can do the same work at the cost of a single process, although then you have the fun of figuring out how to build, install and maintain binary git pre-receive hooks.
Not OP but to me the best thing about PowerShell is that it recognizes that text is not always the best way to output results from commands if you care about creating pipelines. In short, it passes objects around so there's no need for parsing text.
You don't need to `foreach { Remove-Item $_.Name }` because Remove-Item can take the objects returned by Get-ChildItem directly.
Also, expanding the regex into `-Include` parameters is somewhat cheating since `-Include` only takes globs, and it just so happens that that particular regex can be converted into globs.
The general equivalent is:
gci -re | ?{ $_.Name -match '.*_test\.(py|cc)' } | ri
(I used the shorter aliases because someone will probably read yours and reinforce the stereotype that PS is overly verbose.)
Thanks, this is definitely closer to the original use of `egrep`! As for aliases, I prefer long forms because I don't need to think what the seemingly random collections of letters mean, and tab-completion / PowerShell ISE makes it mostly a non-issue when writing.
> That is, grep doesn't support an analogous -0 flag.
However, the GNU grep variant does have an analogous flag:
-z, --null-data
Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. Like the -Z or --null option, this option can be used with commands like sort -z to process arbitrary file names.