An Opinionated Guide to Xargs

pwg · on Aug 21, 2021

Since the blog author is commenting here, you have this statement part way down your blog:

> That is, grep doesn't support an analogous -0 flag.

However, the GNU grep variant does have an analogous flag:

-z, --null-data

Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. Like the -Z or --null option, this option can be used with commands like sort -z to process arbitrary file names.

chubot · on Aug 21, 2021

Ah cool, I didn't know that! I'll update the blog post. (What a cacophony of flags)

Edit: It seems that grep -0 isn't taken for something else and they should have used it for consistency? The man page says it's meant to be used with find -print0, xargs -0, perl -0, and sort -z (another inconsistency)

kragen · on Aug 21, 2021

It is taken in grep, just poorly documented; grep -5 means grep -C 5, and grep -0 means grep -C 0. It's not taken in sort, though, so I don't know why they didn't use -0 for sort.

tyingq · on Aug 21, 2021

I think that's because they needed to support both input and output. So there's both -Z and -z. No such thing as an uppercase 0 :)

cben · on Aug 30, 2021

See https://github.com/fish-shell/fish-shell/issues/3164#issueco... for incomplete but large survey of NUL flags.

Takeaways: (1) There is no consistency in flag names, even --long ones (2) impressively many tools do support it! Note that some affect only input or only output. (3) All do NUL-terminated, not NUL-separated. That's fortunate — matches \n usage, and gives distinct representations for [] vs [""].

l0b0 · on Aug 21, 2021

It's best to give up on any kind of consistency between command options. Any project is free to do anything it wants, and they all do. Someone is eventually going to come up with standard N+1[1] which does things consistently, but they are going to have to either recreate a bazillion tools or create some sort of huge translation framework configuration on top of existing tools to get there. And even then it'll take literally decades before people migrate away from the current tools. Basically, the sad truth is this isn't going to happen.

[1] https://xkcd.com/927/

kazinator · on Aug 21, 2021

In 2002, I implemented xargs in Lisp, in the Meta-CVS project.

It is quite necessary, because you cannot pass an arbitrarily large command line or environment in exec system calls.

Of course, this doesn't have the problem requiring -0 because we're not reading textual lines from standard input, but working with lists of strings.

  ;;; This source file is part of the Meta-CVS program,
  ;;; which is distributed under the GNU license.
  ;;; Copyright 2002 Kaz Kylheku

  (in-package :meta-cvs)

  (defconstant *argument-limit* (* 64 1024))

  (defun execute-program-xargs (fixed-args &optional extra-args fixed-trail-args)
    (let* ((fixed-size (reduce #'(lambda (x y)
                                   (+ x (length y) 1))
                               (append fixed-args fixed-trail-args)
                               :initial-value 0))
           (size fixed-size))
      (if extra-args
        (let ((chopped-arg ())
              (combined-status t))
          (dolist (arg extra-args)
            (push arg chopped-arg)
            (when (> (incf size (1+ (length arg))) *argument-limit*)
              (setf combined-status
                    (and combined-status
                         (execute-program (append fixed-args
                                                  (nreverse chopped-arg)
                                                  fixed-trail-args))))
              (setf chopped-arg nil)
              (setf size fixed-size)))
          (when chopped-arg
            (execute-program (append fixed-args (nreverse chopped-arg)
                                     fixed-trail-args)))
          combined-status)
        (execute-program (append fixed-args fixed-trail-args)))))

fiddlerwoaroof · on Aug 21, 2021

I frequently find myself reaching for this pattern instead of xargs:

    do_something | ( while read -r v; do
    . . .
    done )

I’ve found that it has fewer edge cases (except it creates a subshell, which can be avoided in some shells by using braces instead of parens)

aaaaaaaaaaab · on Aug 21, 2021

Some additional tips:

1. You don't need the parentheses.

2. If you use process substitution [1] instead of a pipe, you will stay in the same process and can modify variables of the enclosing scope:

    i=0
    while read -r v; do
        ...
        i=$(( i + 1))
    done < <(do_something)

The drawback is that this way `do_something` has to come after `done`, but that's bash for you ¯\_(ツ)_/¯

[1] https://www.gnu.org/software/bash/manual/html_node/Process-S...

chriswarbo · on Aug 21, 2021

I use this exact pattern a lot. One thing to consider is that in the process substitution version, do_something can't modify the enclosing variables. The vast majority of the time I want to modify variables in the loop body and not the generating process, but it's worth keeping in mind.

One common pattern I use this for is running a bunch of checks/tests, e.g.

    EXIT_CODE=0
    while read -r F
    do
        do_check "$F" || EXIT_CODE=1
    done < <(find ./tests -type f)
    exit "$EXIT_CODE"

This is a more complicated alternative to the following:

    find ./tests -type f | while read -r F
    do
      do_check "$F" || exit 1
    done

The simpler version will abort on the first error, whilst the first version will always run all of the checks (exiting with an error afterwards, if any of them failed)

fiddlerwoaroof · on Aug 21, 2021

I usually write zsh scripts and I think there’s a shell option in zsh that allows the loop at the end of the pipe to modify variables in the enclosing body: I remember at least one occasion where I was surprised about this discrepancy between shells.

aaaaaaaaaaab · on Aug 21, 2021

Interesting! Indeed, Greg's BashFAQ notes it too: https://mywiki.wooledge.org/BashFAQ/024

>Different shells exhibit different behaviors in this situation:

>- BourneShell creates a subshell when the input or output of anything (loops, case etc..) but a simple command is redirected, either by using a pipeline or by a redirection operator ('<', '>').

>- BASH, Yash and PDKsh-derived shells create a new process only if the loop is part of a pipeline.

>- KornShell and Zsh creates it only if the loop is part of a pipeline, but not if the loop is the last part of it. The read example above actually works in ksh88, ksh93, zsh! (but not MKsh or other PDKsh-derived shells)

>- POSIX specifies the bash behaviour, but as an extension allows any or all of the parts of the pipeline to run without a subshell (thus permitting the KornShell behaviour, as well).

vbezhenar · on Aug 22, 2021

Check out bash lastpipe option.

fiddlerwoaroof · on Aug 21, 2021

Yeah, although I use the parentheses mostly because I like how it reads. And that process substitution trick is important too.

I think the redirection can come first, though (not at a computer to test):

    < <( do_something ) while read . . .

aaaaaaaaaaab · on Aug 21, 2021

Yeah, for commands, the input/output redirections can precede them, but for some reason it doesn't work for builtin constructs like `while`:

    $ < <( echo foo ) while read -r f; do echo "$f"; done
    -bash: syntax error near unexpected token `do'
    $ < <( echo foo ) xargs echo
    foo

    $ bash --version
    GNU bash, version 5.1.4(1)-release (x86_64-apple-darwin20.2.0)

fiddlerwoaroof · on Aug 21, 2021

Maybe wrap the loop either with parentheses or braces?

aaaaaaaaaaab · on Aug 21, 2021

Tried that, but nope :D I'll let you figure this one out once you get near a computer!

entire-name · on Aug 21, 2021

Redirection like this doesn't seem to work if it comes first on GNU bash 5.0.17(1)-release.

For documentation purposes, this is the exact thing I tried to run:

    $ < <(echo hi) while read a; do echo "got $a"; done
    -bash: syntax error near unexpected token `do'

    $ while read a; do echo "got $a"; done < <(echo hi)
    got hi

Maybe there is another way...

JNRowe · on Aug 21, 2021

One way which isn't great, but an option nonetheless… The zsh parser is happy with that form:

    $ zsh -c '< <(echo hi) while read a; do echo "got $a"; done'
    got hi

My position isn't that it is a good reason to switch shells, but if you're using it anyway then it is an option.

fiddlerwoaroof · on Aug 21, 2021

I’ve always preferred zsh and, as I’ve slowly adopted nix, I’ve slowly stopped writing bash in favor of zsh

lottin · on Aug 21, 2021

This is not POSIX compliant though.

fiddlerwoaroof · on Aug 21, 2021

These days bash and/or zsh are available nearly every place I care about, so I find POSIX compliance to be much less relevant.

pgtan · on Aug 21, 2021

No, process substitution must be provided by the kernel/syslibs, it is not feature of bash. For example there is bash on AIX, but process substitution is not possible because the OS do not support it.

nonesuchluck · on Aug 22, 2021

ksh93 depends exclusively on the kernel implementation of /dev/fd devices. I just checked `cat <(ls)` a moment ago on both Linux and AIX 7.2--the latter fails in ksh93t+.

Bash uses /dev/fd when available, but also appears to have an internal implementation which silently creates named pipes and cleans them up. In Bash 5.0.18 on AIX, fake process substitution works just fine, in my testing.

pgtan · on Aug 22, 2021

Yes, you are right. Bash 5 on AIX 7.2 works with process substitution. Thanks for the advise!

aaaaaaaaaaab · on Aug 21, 2021

Also for the `while` enthusiasts, here's how you zip the output of two processes in bash:

    paste -d \\n <(do_something1) <(do_something2) | while read -r var1 && read -r var2; do
        ... # var1 comes from do_something1, var2 comes from do_something2
    done

ptspts · on Aug 21, 2021

For thousands of arguments this sloution is much slower (high CPU usage) than xargs, because either it implements the logic as a shell script (slow) or it runs an external program for each argument (slow).

fiddlerwoaroof · on Aug 21, 2021

Sure, if performance matters use xargs. I find this is easier to read and think about.

tomcam · on Aug 21, 2021

Thank you. Your comment coalesced a number of things in my mind that I hadn’t grasped properly as a UNIX midwit, especially the braces thing.

thayne · on Aug 22, 2021

creating a subshell can lead to some surprising behavior if you aren't careful though.

WhatIsDukkha · on Aug 21, 2021

I tend to reach for gnu parallel instead of xargs -

https://www.gnu.org/software/parallel/parallel_alternatives....

parallel is probably on the complex side but its also been actively developed, bugfixed and had a lot of road miles from large computing users.

chubot · on Aug 21, 2021

I mention it here: https://www.oilshell.org/blog/2021/08/xargs.html#xargs-p-aut...

What does it do that xargs and shell can't? (honest question)

comex · on Aug 21, 2021

One thing parallel can do better than xargs is collect output.

If you use `xargs -P`, all processes share the same stdout and output may be mixed arbitrarily between them. (If the program being executed uses line buffering, lines usually won't be mixed together from multiple invocations, but they can be if they're long enough).

In contrast, `parallel` by default doesn't mix together output from different commands at all, instead buffering the entire output until the command exits and then printing it.

With `--line-buffer` the unit of atomicity can be weakened from an entire command output to individual lines of output, reducing latency.

Alternately, with `--keep-order`, `parallel` can ensure the outputs are printed in the same order as the corresponding inputs, which makes the output deterministic if the program is deterministic. Without that you'll get results in an arbitrary order.

These aren't technically things that xargs and shell can't do; you could reimplement the same behavior by hand with the shell. But by the same token, there isn't anything xargs can do that the shell can't do alone; you could always use the shell to manually split up the input and invoke subprocesses. It's just a question of how much you want to reimplement by hand.

chubot · on Aug 21, 2021

OK thanks, looks like there are several features of GNU parallel that users like.

For the output interleaving issue, what I do is use the $0 Dispatch Pattern and write a shell function that redirects to a file:

    do_one() {
      task_with_stdout > $dir/$task_id.txt
    }

So if there are 10,000 tasks then I get 10,000 files, and I can check the progress with "ls", and I can also see what tasks failed and possibly restart them.

You even have some notion of progress by checking the file size with ls -l.

I tend to use a pattern where each task also outputs a metadata file: the exit status, along with the data from "time" (rusage, etc.)

But I admit that this is annoying to rewrite in every script that uses xargs! It does make sense to have this functionality in a tool.

But I think that tool should be a LANGUAGE like Oil, not a weirdo interface like GNU parallel :)

But thanks for the explanation (and thanks to everyone in this subthread) -- I learned a bunch and this is why I write blog posts :)

Godel_unicode · on Aug 21, 2021

Thank you for writing this, it really crystalized for me why I feel the way I do about oil. I hate it. When I want a language, I want a real language like python not a weirdo jumped up shell (see what I did there?). What I want in a shell is a super small, fast, universally understood thing for basic tasks and easy expandability through tools like parallel and python.

For what it's worth, I consider oil to be closer to a unixy PowerShell rather than a more powerful bash. Note that this is not a slight, PowerShell is sweet for what it is. It (oil) really takes a hard left from the POSIX philosophy of focusing on one thing and doing it well. I'm also bitter that, if it's going to veer so far away from POSIX, that it didn't go the whole hundred and become a function language with comprehensions and such.

For what it's worth, everything you mentioned above about your approach can be done with parallel.

cben · on Aug 30, 2021

The point of oil that there are really basic things like safe quoting that shells should do well, yet none of the posix shell do!

Functional: there are interesting shells like Elvish. But it really goes PowerShell by adding internal rich data pipelines that dont have a unixy stream-of-bytes representation. Oil does NOT go that way; it works on stuff like QSN to make pure unix interconnects more robust.

MainJane · on Aug 22, 2021

Your `do_one`:

  * does not buffer stderr
  * does not check if the disk is full for a period of time during a task (thus risking incomplete output)
  * does not clean up, if killed
  * does not work correctly if task_with_stdout is a composed command

Given that GNU Parallel is a drop-in replacement for xargs, I am curious why you find it a 'weirdo interface'.

michaelcampbell · on Aug 22, 2021

Any sufficiently NIH'd can be considered weird. -- Not Isaac Asimov

scottlamb · on Aug 21, 2021

A lot of this comes down to familiarity. I tend to use "make -j 100" for what you're describing. If I write the Makefile carefully [1], it will handle resuming a half-finished job. I just looked and GNU parallel has a --resume argument which probably does something similar, and maybe with less hassle. But I don't do this often enough—and/or GNU parallel isn't "better enough"—that I'm likely to ever invest the time to learn GNU parallel.

btw, oil looks very cool. I hate how many footguns are in common shells.

[1] eg writing to a tempfile and atomically renaming into place: "task_with_stdout > $dir/task_id.txt.tmp && mv $dir/task_id.txt{.tmp,}"

xmcqdpt2 · on Aug 21, 2021

Restart capability and remote executions make gnu parallel the tool if choice for HPC. For example, you might very well use gnu parallel to run 1000s of cpu-hours of numerical simulation using patterns such as these ones,

https://docs.computecanada.ca/mediawiki/index.php?title=GNU_...

Using xargs for this kind of work is euhm... not a good idea.

lacksconfidence · on Aug 21, 2021

i don't know if xargs cant, but i use gnu parallel to split an input pipe into N parallel pipes processing slices of the input stream.

Edit: To clarify, xargs usually wants to spin up a process per task. I have parallel spin up N processes and then continuously feed them.

bsmithers · on Aug 21, 2021

Not to be pedantic, but that's a bit of a non-argument. _Of course_ you can do it with xargs and shell, but imho parallel is generally more convenient, especially for remote execution. It provides a higher level of abstraction for such tasks.

enriquto · on Aug 22, 2021

> What does it do that xargs and shell can't? (honest question)

For me, an essential feature of GNU parallel is that it is semantically equivalent to "sh". Imagine that you write a file that contains a long list of commands. You can pipe that file to "sh" to run the commands, or pipe it to "parallel" to do the same, but faster. If you are building the list of commands on the fly, then you can use xargs with a slightly different syntax. But somehow using "sh" or "parallel" gives a certain peace of mind due to its straightforward semantics. I never used any argument of GNU parallel apart from -j

My usage pattern: to build a list of commands explicitly then run it (possibly teeing the list into a temporary file to inspect it):

    for i in one two three; do
            printf "echo $i\n"   
    done |sh  # or |parallel

oofabz · on Aug 22, 2021

I use GNU Parallel for long-running jobs for its --eta option. If a job will take days or longer it's useful to know that early in the process. You might want to cancel it and try something else, and if you want to proceed with the long job you can make plans around when your data will be ready.

MainJane · on Aug 22, 2021

It is documented in the GNU Parallel documentation: https://www.gnu.org/software/parallel/parallel_alternatives....

If you seriously believe you can implement everything using xargs, then this (contrived) example is for you: https://unix.stackexchange.com/questions/405552/using-xargs-...

Newer versions include 'parset' which can set shell variables in parallel, which is useful if you want to 'map' values from one array to another.

bloopernova · on Aug 21, 2021

GNU Parallel can be sourced into a bash session from a plain text file and used as a function. I've used it to get around overly-restrictive build environments. (overly restrictive because the team that manages the build image wasn't open to modifying their image for my use case)

orf · on Aug 21, 2021

Resumption, error reporting and much better progress monitoring.

vhold · on Aug 21, 2021

Oh I didn't know about resumption.. parallel has so many features packed into its CLI it's kind of ridiculous.

For others that didn't know about it, see the examples here: https://www.gnu.org/software/parallel/parallel_tutorial.html...

Here's another surprising feature: https://www.gnu.org/software/parallel/parallel_tutorial.html...

O_H_E · on Aug 22, 2021

Wow!, That is surprising and potentially very useful.

leephillips · on Aug 21, 2021

Remote execution.

chubot · on Aug 21, 2021

I'd like to see a demo of it! I will try rewriting it with the $0 Dispatch Pattern and ssh :)

xmcqdpt2 · on Aug 21, 2021

Good luck balancing node usage!

Here is an example of how it works,

https://docs.computecanada.ca/mediawiki/index.php?title=GNU_...

This + restart capabilities make gnu parallel very well suited to running 1000s of compute-heavy jobs on HPC clusters.

figomore · on Aug 21, 2021

I used Parallel to distribute the rendering of a little Blender animation It worked very well.

https://github.com/tfmoraes/blender_gnu_parallel_render/blob...

orhmeh09 · on Aug 21, 2021

Issue complaint prompts to promote the author, for one.

orhmeh09 · on Aug 21, 2021

The nagware prompts of parallel are so objectionable that I will do a lot of things to avoid using it at all. So pretentious!

grawlinson · on Aug 21, 2021

Seems like some distributions patch out the nagware. I know Arch Linux does[0].

[0]: https://github.com/archlinux/svntogit-community/tree/package...

MainJane · on Aug 22, 2021

Have you read the FAQ?

https://git.savannah.gnu.org/cgit/parallel.git/tree/doc/cita...

nerdponx · on Aug 21, 2021

On the contrary, I think more FOSS authors should do things like this. Freedom doesn't mean you don't get to take credit for your work.

queuebert · on Aug 21, 2021

It's also written in Perl!

orhmeh09 · on Aug 21, 2021

Veering off course here, after experiencing how incredibly long it took to install Sqitch, I will go out of my way to avoid anything that is more than a single script, certainly anything requiring CPAN too. I don’t think there’s anything technically wrong with these programs or with Perl, they’re just presented in ways that are unique hassles in this day and age.

MainJane · on Aug 22, 2021

> anything that is more than a single script

... which is exactly what GNU Parallel is. Your concern is even mentioned in the design documentation: https://www.gnu.org/software/parallel/parallel_design.html

queuebert · on Aug 22, 2021

I remember when Perl was the coolest thing ever. I even wrote numerical simulations in it, just to try. Only with the invention of Python and Ruby did we realize how much better things could be. Of course that the Perl inventor was an IOCCC winner should've been a red flag.

cormacrelf · on Aug 22, 2021

If you need more visibility into long running processes, pueue is another alternative. You can of course use `xargs -P1 pueue add ./process_file.sh` to add the jobs in the first place. Sends a job to pueued, returns immediately. Great for re-encoding dozens of videos. For jobs that aren’t already multi-core, set the queue parallelism with pueue, after you’ve seen your cpu is under-utilised.

Obviously downside to the visibility and dynamism is that it redirects stdout. You can read it back later, in order. But it’s not there for continued processing immediately.

senkora · on Aug 21, 2021

I always think of xargs as the inverse of echo. echo converts arguments to text streams, and xargs converts text streams to arguments.

0xdeadb00f · on Aug 22, 2021

That's a pretty neat way of thinking about it!

jordemort · on Aug 21, 2021

I appreciate this. If I wrote my own opinionated guide to xargs, it would be a single profane sentence.

thrwyexecbrain · on Aug 22, 2021

In Bash (not every shell supports this) functions can be exported, which enables this nice pattern with xargs:

    myfunc() {
        printf " %s" "I got these arguments:" "$@" $'\n'
    }
    export -f myfunc
    seq 6 | xargs -n2 bash -c 'myfunc "$@"' "$0"

westurner · on Aug 21, 2021

Wanting verbose logging from xargs, years ago I wrote a script called `el` (edit lines) that basically does `xargs -0` with logging. https://github.com/westurner/dotfiles/blob/develop/scripts/e...

It turns out that e.g. -print0 and -0 are the only safe way: line endings aren't escaped:

    find . -type f -print0 | el -0 --each -x echo

GNU Parallel is a much better tool: https://en.wikipedia.org/wiki/GNU_parallel

chubot · on Aug 21, 2021

(author here) Hm I don't see either of these points because:

GNU xargs has --verbose which logs every command. Does that not do what you want? (Maybe I should mention its existence in the post)

xargs -P can do everything GNU parallel do, which I mention in the post. Any counterexamples? GNU parallel is a very ugly DSL IMO, and I don't see what it adds.

--

edit: Logging can also be done with by recursively invoking shell functions that log with the $0 Dispatch Pattern, explained in the post. I don't see a need for another tool; this is the Unix philosophy and compositionality of shell at work :)

jeffbee · on Aug 21, 2021

Parallel's killer feature is how it spools subprocess output, ensuring that it doesn't get jumbled together. xargs can't do that. I use parallel for things like shelling out to 10000 hosts and getting some statistics. If I use xargs the output stomps all over itself.

chubot · on Aug 21, 2021

Ah OK thanks, I responded to this here: https://news.ycombinator.com/item?id=28259473

Godel_unicode · on Aug 21, 2021

As far as I'm aware, xargs still has the problem of multiple jobs being able to write to stdout at the same time, potentially causing their output streams to be intermingled. Compare this with parallels --group.

Also parallels can run some of those threads on remote machines. I don't believe xargs has an equivalent job management function.

MainJane · on Aug 22, 2021

In your examples you fail to put 'xargs -P' in the middle of a pipeline: You only put it at the end.

In other words:

  some command | xargs -P other command | third command

This is useful if 'other command' is slow. If you buffer on disk, you need to clean up after each task: Maybe there is not enough free disk space to buffer the output of all tasks.

UNIX is great in that you can pipe commands together, but due to the interleaving issue 'xargs -P' fails here. It does not live up to the UNIX philosophy. Which is probably why you unconsciously only use it at the end of a pipeline.

You can find a different counterexample on https://unix.stackexchange.com/questions/405552/using-xargs-... I will be impressed if you can implement that using xargs. Especially if you can make it more clean than the paralel version.

LeoPanthera · on Aug 21, 2021

Yeah but xargs doesn't refuse to run until I have agreed to a EULA stating I will cite it in my next academic paper.

jeffbee · on Aug 21, 2021

parallel doesn't either, it just nags. I agree about how silly and annoying it is. Imagine if every time the parallel author opened Firefox he got a message reminding him to personally thank me if he uses his web browser for research, or if every time his research program calls malloc he has to acknowledge and cite Ulrich Drepper. Very very silly.

Parallel is the better tool but the nagware impairs its reputation.

blibble · on Aug 21, 2021

or every time a process called fork() you had to read some stupid message

michaelsbradley · on Aug 22, 2021

echo will cite | parallel --bibtex

l0b0 · on Aug 21, 2021

I scanned until I saw `ls | egrep '.*_test\.(py|cc)' | xargs -d $'\n' -- rm`, and then stopped. This is a terrible idea[1][2].

[1] https://mywiki.wooledge.org/ParsingLs

[2] https://unix.stackexchange.com/q/128985/3645

tyingq · on Aug 21, 2021

I'm surprised the links don't mention find. The -print0 flag makes it safe for crazy filenames, which pairs with the xargs -0 flag, or the perl -0 flag, etc. And you have -maxdepth if you don't want it to trawl.

legobmw99 · on Aug 21, 2021

This is only tangentially related, but after all the posts here the last few days about thought terminating cliches, I can’t help but reflect on the “X considered harmful” title cliche

abetusk · on Aug 21, 2021

Yes, I absolutely hate them. I was thinking of creating a "considered harmful" considered harmful rant but it already exists [0].

[0] https://meyerweb.com/eric/comment/chech.html

JadeNB · on Aug 21, 2021

Is it thought terminating, though? "X considered harmful" seems more intended to spark discussion in an intentionally inflammatory way than to stifle it.

(In any case, this surely is tangential, since the title is not "X considered harmful" for any value of X—at best it comments on a post by that title, as, indeed, you are doing.)

Zababa · on Aug 21, 2021

I've been thinking about titles, and it's hard to make a good one that doesn't look like a total cliché. "X considered harmful", "an opinionated guide to X", some kind of joke or reference, what could be a collection of tags (X, Y and Z), "things I have learned doing X", etc.

zeroimpl · on Aug 21, 2021

I specifically clicked on this topic because of the word “opinionated”. As I already know how to use xargs, I was curious what kind of non-conventional or controversial opinion the author might have.

Zababa · on Aug 21, 2021

As I've said to a sibling comment, I don't think it's a bad title, and "an opinionated guide to X" is one of the better cliché for titles that I see (the worst being the journalist that feels like they have to make a joke).

ineedasername · on Aug 21, 2021

In this case a less cliche/click-baity title could simply be:

"A Response to Xargs Criticism"

Zababa · on Aug 21, 2021

I think this title is fine, it's mostly that after spending some time on Hacker News all the titles start to look the same.

phone8675309 · on Aug 21, 2021

What every X should know about Y, an opinionated take on Z considered harmful

MonkeyClub · on Aug 21, 2021

...with an example Lisp implementation written in APL translating into 6502 assembly :)

MichaelGroves · on Aug 21, 2021

Would you say the title terminated your consideration of the article?

legobmw99 · on Aug 21, 2021

No I think if anything seeing it was a response to “xargs considered harmful” made me take the authors side quicker

yudlejoza · on Aug 21, 2021

Of xargs, for, and while, I have limited myself to while. It's more typing everytime but saves me from having to remember so many quirks of each command.

    cat input.file | ... | while read -r unit; do <cmd> ${unit}; done | ...

between 'while read -r unit' and 'while IFS= read -r unit' I can probably handle 90% of the cases. (maybe I should always use IFS since I tend to forget the proper way to use it).

scottlamb · on Aug 21, 2021

That way will bite you when the tasks in question are cheaper than fork+exec. There was a thread just the other day in which folks were creating 8 million empty files with a bash loop over touch. But it's 60X faster (really, I measured) to use xargs, which will do batches (and parallelism if you tell it to).

https://news.ycombinator.com/item?id=28192946

patrickdavey · on Aug 21, 2021

Would you mind expanding with a couple of examples? (E.g. using "foo bar" as a single line or split by whitespace).

I suspect I'll really like your way of doing things, but an example would be very handy.

yudlejoza · on Aug 23, 2021

The example of "foo bar" didn't work with while but inserting tr fixes it:

    echo "foo bar" | tr ' ' '\n' | while read -r var; do echo ${var}; done

For examples in general, I guess something like "cat file.csv" could work. (the difference between using IFS= and not using it is essentially whether we want to preserve leading and trailing whitespaces or not. If we want to preserve, then we should use IFS=).

HMH · on Aug 21, 2021

I always wonder why something like xargs is not a shell built-in. It's such a common pattern, but I dread formulating the correct incantation every time.

I was happy to read that the author comes to the same conclusion and proposes an `each` builtin (albeit only for the Oil shell)! Like that there is no need to learn another mini language as pointed out.

JNRowe · on Aug 21, 2021

If you're a zsh user it offers a version of something like xargs in zargs¹. As the documentation shows it can be really quite powerful in part because of zsh's excellent globbing facilities, and I think without that support it wouldn't be all that useful as a built-in.

I'd also perhaps argue that the reason we don't want xargs to be a built-in is precisely because of zargs and the point in your second paragraph. If it was built-in it would no doubt be obscenely different in each shell, and five decades later a standard that no one follows would eventually specify its behaviour ;)

¹ https://zsh.sourceforge.io/Doc/Release/User-Contributions.ht... - search for "zargs", it has no anchor. Sorry.

masklinn · on Aug 21, 2021

> Shell functions and $1, instead of xargs -I {}

> -n instead of -L (to avoid an ad hoc data language)

Apparently GNU xargs is missing it, but BSD xargs has -J, which is a `-I` which works with `-n`: with `-I` each replstr gets replaced by one of the inputs, with `-J` the replstr gets replaced by the entire batch (as determined by `-n`).

pgtan · on Aug 21, 2021

FWIW AIX also has an apply command

https://www.ibm.com/docs/en/aix/7.2?topic=apply-command

2OEH8eoCRo0 · on Aug 21, 2021

I spent a year using AIX at my previous job and never heard of this or saw anybody use it. Is it new in 7.2? We were far behind on AIX 6.

pgtan · on Aug 21, 2021

No idea how old this command is. Most of the AIX/Linux admins I knew were very bad shell programmers, skills end with awfull for-loops, useless use of cat, and awk '{print $3}'.

reilly3000 · on Aug 21, 2021

I’m unconvinced by the post OP was responding to. It’s a utility, it provides some means to get things done. *nix provides many means of parsing text and running commands, each have their idioms based on their own axioms. It seems as if a composer is lambasting the clarinet because they don’t care for its fingerings. I’ve only used xargs sparingly, can somebody enlighten me as to why it’s bad, aside from the fact that there are other ways to do some things it does?

michaelcampbell · on Aug 22, 2021

> I've used -P 32 to make day-long jobs take an hour! You can't do that with a for loop.

    for file in *; do
      command_using_file &
    done
    wait

?

I use variations on this all the time; pause while load is high, pause while 'x' or more things are running, sleep between invocations, etc.

It may not be as convenient for some cases, but "can't do that..." is not quite correct either.

The post is starting to feel like a hammer/nail argument, IMO.

aaaaaaaaaaab · on Aug 21, 2021

I would recommend using -0 instead of -d, as the latter is not supported on BSD (and macOS) xargs:

    do_something | tr \\n \\0 | xargs -0 ...

derriz · on Aug 21, 2021

I wish this was the default behavior of xargs (the 'tr \\n \\0 | xargs -0' bit). I don't know why xargs splits on spaces and tabs as well as newlines by default and doesn't even have a flag to just split on lines.

Ok filenames can theoretically have newlines in them but I'd be happy to deal with that weird case. I can't recall ever having encountered it in years of using bash on various systems.

Shell pipes would then orthogonally provide the stuff like substitution that xargs does in it's own unique way (that I just can't be bothered learning) - instead you'd just pipe the find output through sed or 'grep -v' or whatever you wanted before piping into xargs.

I guess that's what aliases but I'm too lazy anymore to bother with configuring often short-lived systems all the time.

fl0wenol · on Aug 21, 2021

xargs defaults to all whitespace because it was designed to get around the problem of short argv lengths (like, I'm talking 4k or less on older Unix-y systems, sometimes as low as 255 bytes).

So the defaults went with principle of least surprise, pretending it's like a very long args list that you could theoretically enter at the shell, including quotes.

You could, for example, edit the args list in vi and line split / indent as you please but not impact the end result.

thayne · on Aug 22, 2021

I'm not sure I like the `$1` and shell function pattern. It might avoid the -I minilanguage, but at the cost of "being clever" in a way that takes a minute to wrap your head around. It's a neat trick, but I don't think it would be easy to understand if you are reading the code for the first time.

Karellen · on Aug 22, 2021

I find that using the example of `rm` to discuss whether to pick `find -exec` or `find | xargs` rather strange, given the existence of `find -delete`. Maybe pick a different example operation to automate.

agumonkey · on Aug 21, 2021

I used to have bash fun like `curry { xargs -I {} $1 }` or something like that. Pretty useful to simplify one liners.

lisper · on Aug 21, 2021

Note that the suggested:

rm $(ls | grep foo)

will not work if you have file names that contain spaces.

Shell programming is planted thick with landmines like this.

ViViDboarder · on Aug 21, 2021

The linked article doesn’t suggest this. They explicitly suggest against it.

> Besides the extra ls, the suggestion is bad because it relies on shell's word splitting. This is due to the unquoted $(). It's better to rely on the splitting algorithms in xargs, because they're simpler and more powerful.

GrumpySloth · on Aug 22, 2021

Using "ls" is also problematic, because it will outright skip large classes of characters.

rcpt · on Aug 21, 2021

awk '{ print your_command }' | bash

Never can remember all the -I stuff around xargs

chubot · on Aug 21, 2021

This is like the sed|bash anti-pattern mentioned in the original post, and quoted in the appendix on shell injection.

I wouldn't say "never use it", but I would hesitate to ever put it in a script, vs. doing a one-off at the command line.

rcpt · on Aug 22, 2021

You don't pipe to bash on the first run. Use awk/sed without piping to workshop your commands. Once you've got them right send over to bash.

This is far superior to futzing with xargs interactive or whatever dry run feature they have.

chubot · on Aug 22, 2021

See the string injection comment:

http://www.oilshell.org/blog/2021/08/xargs.html#more-comment...

https://lobste.rs/s/wlqveb/xargs_considered_harmful#c_kwsxtc

MichaelGroves · on Aug 21, 2021

> A lobste.rs user asked why you would use find | xargs rather than find -exec. The answer is that it can be much faster. If you’re trying to rm 10,000 files, you can start one process instead of 10,000 processes!

Fair enough, but I still favor find -exec. I find it generally less error prone, and it's never been so slow that I wished I had instead used xargs.

Also, if you're specifically using -exec rm with find, you could instead use find with -delete.

chubot · on Aug 21, 2021

A benefit I didn't mention in the post (but probably should) is that the pipe lets you interpose other tools.

That is, find -exec is sort of "hard-coded", while find | xargs allows obvious extensions like:

    find | grep | xargs   # filter tasks

    find | head | xargs   # I use this all the time for faster testing

    find | shuf | xargs

Believe it or not I actually use find | shuf | xargs mplayer to randomize music and videos :)

So shell is basically a more compositional language than find (which is its own language, as I explain here: http://www.oilshell.org/blog/2021/04/find-test.html )

bobbylarrybobby · on Aug 21, 2021

You can also use `find -exec` with `'+'` instead of `';'` as the terminator. This will call `rm` on all of the found files in one call.

masklinn · on Aug 21, 2021

I tend to prefer xargs because it works in more contexts e.g. I've got a tool which automatically generates databases but sometimes the cleanup doesn't work. `find -exec` does nothing, but `xargs -n1 dropdb` (following an intermediate grep) does the job. From there, it makes sense to… just use xargs everywhere.

And I always fail to remember that the -exec terminator must be escaped in zsh, so using -exec always takes me multiple tries. So I only use -exec when I must (for `find` predicates).

shoo · on Aug 21, 2021

i agree. `find somewhere -exec some_command {} +` can be dramatically faster. but it does not guarantee a single invocation of `some_command`, it may make multiple invocations if you pass very large numbers of matching files

after spending a bit of time reading the man page for find, i rarely use xargs any more. find is pretty good.

tangent:

another instance i've seen where spawning many processes can lead to bad performance is in bash scripts for git pre-recieve hooks, to scan and validate the commit message of a range of commits before accepting them. it is pretty easy to cobble together some loop in a bash script that executes multiple processes _per commit_. that's fine for typical small pushes of 1-20 commits -- but if someone needs to do serious graph surgery and push a branch of 1000 - 10,000 commits that can can cause very long running times -- and more seriously, timeouts, where the entire push gets rejected as the pre-receive script takes too long. a small program using the libgit2 API can do the same work at the cost of a single process, although then you have the fun of figuring out how to build, install and maintain binary git pre-receive hooks.

andy81 · on Aug 21, 2021

Today I appreciated Powershell

jmholla · on Aug 21, 2021

Can you expand on that? I've never had trouble leveraging xargs and find it aligns well with shell piping.

bialpio · on Aug 21, 2021

Not OP but to me the best thing about PowerShell is that it recognizes that text is not always the best way to output results from commands if you care about creating pipelines. In short, it passes objects around so there's no need for parsing text.

bialpio · on Aug 21, 2021

Two examples from the article translated into PS (sorry, I'm a bit rusty so the second one may not be the shortest possible):

  PS> "alice", "bob" | echo

  PS> Get-ChildItem . -Include "*test.cpp","*test.py" -Recurse | foreach { Remove-Item $_.Name }

No text parsing in sight, and the object attributes can be tab-completed from the shell (e.g. I tab-completed the `$_.Name`).

Arnavion · on Aug 21, 2021

You don't need to `foreach { Remove-Item $_.Name }` because Remove-Item can take the objects returned by Get-ChildItem directly.

Also, expanding the regex into `-Include` parameters is somewhat cheating since `-Include` only takes globs, and it just so happens that that particular regex can be converted into globs.

The general equivalent is:

    gci -re | ?{ $_.Name -match '.*_test\.(py|cc)' } | ri

(I used the shorter aliases because someone will probably read yours and reinforce the stereotype that PS is overly verbose.)

bialpio · on Aug 22, 2021

Thanks, this is definitely closer to the original use of `egrep`! As for aliases, I prefer long forms because I don't need to think what the seemingly random collections of letters mean, and tab-completion / PowerShell ISE makes it mostly a non-issue when writing.

andy81 · on Aug 21, 2021

Thanks, we were thinking of the same thing.

ahawkins · on Aug 21, 2021

Xargs ftw!

t4p4n · on Aug 22, 2021

i found nothing in this blog post useful other than string slicing/splitting. and you forgot put the ultimate flag of this program that is -r.