Hacker News new | past | comments | ask | show | jobs | submit login
Hidden gems of moreutils (jpospisil.com)
228 points by jiripospisil 8 months ago | hide | past | favorite | 59 comments



> `pee` [...] It runs the commands using popen, so it's actually passing them to /bin/sh -c (which on most systems is a symlink to Bash).

Do not assume /bin/sh is Bash!!

On Debian-based systems, including Ubuntu, Dash is the default non-interactive shell. Specifically, Dash does not support common Bashisms.

(Not to mention Alpine, OpenWRT, FreeBSD, ...)

This is a bit of a pet-peeve of mine. If you're dropping a reference to `/bin/sh -c` like the reader knows what that means, then you don't need to tell them that "it's a symlink to Bash." They know their own system better than you.


> Specifically, Dash does not support common Bashisms.

More importantly, Bash should not support Bashims when called as /bin/sh (either).

If you want to use Bashisms just invoke Bash.


Huh. I didn't realize that

https://wiki.debian.org/Shell


Rule of thumb about shells:

  You'll never know for sure what shell is the default, so write your scripts to a minimum shell family, and encode the name of that shell family in the shebang.
The default shell, for either the entire system or an individual user, could be:

  - A Bourne shell
  - A C shell
  - A POSIX shell
  - A "modern" shell, like Bash, Zsh, Fish, Osh
Use the following shebangs to call the class of shell you expect. Start with /usr/bin/env to allow the system to locate the executable based on the current PATH environment variable rather than a fixed filesystem path.

  #!/usr/bin/env sh
  #              ^ should result in a Bourne-like shell, on modern systems, but could also be
  #                a POSIX shell (like Ash), which is not really backwards compatible

  #!/usr/bin/env csh
  #              ^ should result in a C shell

  #!/usr/bin/env bash
  #              ^ should result in a Bash shell, though the only version you should
  #                expect is version 3.2.57, the last GPLv2 version

  #!/usr/bin/env dash
  #              ^ you expect the system to have a specific shell. if your code depends
  #                on Dash features, do this. otherwise write your code using an earlier
  #                version of the original shell family, such as Bourne or POSIX.

If you need to pass a command-line argument to the shell command before interpreting your script, use the fixed filesystem path for the shell. Due to bugs in the way kernels execute scripts, all arguments after the initial shebang path may be sent as a single argument, which is probably not what you (or the shell command) expect. (e.g. use #!/bin/bash --posix instead of #!/usr/bin/env bash --posix as the latter may fail)

Different shells have different implementation details. For example, Bash tends to skew toward POSIX conformance by default, even if it conflicts with traditional Bourne shell behavior. Dash may try to implement POSIX semantics, but also has its own specific behavior incompatible with POSIX.

In order to get specific behavior (like POSIX emulation), you may need to add a flag like --posix, or set an environment variable like POSIXLY_CORRECT. For shells that support it, you can also detect the version of the shell running, and bail if the version isn't compatible with your script.

Here are some references of the differences between shells:

  - Comparison of command shells[1]
  - Major differences between Bash and Bourne shell[2]
  - Practical differences between Bash and Zsh[3]
  - Fundamental differences between mainstream \*NIX shells[4]
[1] https://en.wikipedia.org/wiki/Comparison_of_command_shells [2] https://www.gnu.org/software/bash/manual/html_node/Major-Dif... [3] https://apple.stackexchange.com/questions/361870/what-are-th... [4] https://unix.stackexchange.com/questions/3320/what-are-the-f...


Just checked; macOS comes with Dash in addition to Bourne shell (sh) and Bash.


I use `ts` quite often in adhoc logging/monitoring contexts. Because it uses strftime() under the hood, it has automatic support for '%F' and '%T', which are much easier to type than '%Y-%m-%d' and '%H:%M:%S'. Plus, it also has support for high-resolution times via '%.T', '%.s', and '%.S':

  echo 'hello world' | ts '[%F %.T]'
  [2023-12-30 16:25:40.463640] hello world


Assuming semi-recent bash(1), you can also get away with something like

    while read -r line; do printf '%(%F %T %s)T %s\n' "-1" "${line}"; done
as the right-hand side/reader of a shell pipe for most of what ts(1) offers. ("-1" makes the embedded strftime(3) format string assume the current time as its input).


I recommend zmwangx/ets package, it is the modern version of ts. I'm using it in CI/CD pipeline in gitlab for debugging performance.


Thanks a lot for the link, I needed an improved `ts` for a long time now.


The ‘logger’ command can also be useful.


Not a moreutil, but I recently discovered `pv`, the pipe viewer and it's so useful. Like tqdm (Python progressbar library) but as a Unix utility. Just put it between two pipes and it'll display rate of bytes/lines

Apparently it's neither a coreutil nor a moreutil.

Here's an HN discussion from 2022: https://news.ycombinator.com/item?id=33244768


If you have a long running copy process running but forget to enable progress, you can use the "progress" utility to show the progress of something that is already running.

It supports: cp mv dd tar bsdtar cat rsync scp grep fgrep egrep cut sort cksum md5sum sha1sum sha224sum sha256sum sha384sum sha512sum adb gzip gunzip bzip2 bunzip2 xz unxz lzma unlzma 7z 7za zip unzip zcat bzcat lzcat coreutils split gpg gcp gmv



You can also use pv as you would use cat, e.g.

    pv file.tar.gz.part1 file.tar.gz.part2 | tar -x -z
Just like cat, pv used this way will stream out the concatenation of the passed-in file paths; but it will also add up the sizes of these files and use them to calculate the total stream size, i.e. the divisor for its displayed progress percentage.


> Like tqdm (Python progressbar library) but as a Unix utility.

FYI: tqdm can be used in a shell pipeline as well. It's documented (at least) in their readme: https://github.com/tqdm/tqdm#module


I have also discovered that certain implementations of dd have a progress printing functionality that can be used for similar purposes. You can put a "dd status=progress" in a pipeline and it will print the amount and rate of data being piped!

This dd option is not as nice as pipe viewer but it's handy for when pv isn't around for some reason.


Even if you don't pass this argument, you can poke most implementations of dd(1) with a certain POSIX signal, and they'll respond by printing a progress line.

On Linux, this is SIGUSR1, and you have to use kill(1) to send it.

On BSDs (incl. macOS), though, the signal dd listens for is instead called SIGINFO (which probably makes this make a lot more sense for why a process would have this response to it.) Shells/terminal emulators on these platforms emit SIGINFO when you type Ctrl+T into them!

(For a lot more useful info about this behavior: https://stuff-things.net/2016/04/06/that-one-stupid-dd-trick...)

Bonus fact not mentioned in the above article: dd used in the middle of a pipeline will still "hear" Ctrl+T and print progress, since signals generated by a shell (think: SIGINT from Ctrl+C) are propagated to all processes in the process group started by the command. Test it yourself:

    cat /dev/zero | dd count=10000000 bs=1024 | cat > /dev/null


When did I mentioned one should always point to https://www.vidarholen.net/contents/blog/?p=479

For like >99% of cases where people used dd they would have been better of using a different tool.


...except in the absolute most-common case, as the article mentions: making or restoring backups of disks.

Not because dd is needed for this case, mind you. Rather just because disks are big, and so it's a lot faster to (be able to) specify a big block size. One much larger than the disk's own sector size; one much larger, even, than the current 128KiB default of cat(1)!

Consider: cat's design principle is that it's reading a stream (or series of streams) serially, and writing that concatenated stream serially, as a synchronous back-and-forth process. cat(1) aims to have a low, fixed memory usage: it reads data into a buffer, then writes data from that buffer to the output. It only ever uses the one buffer, and that buffer is static — part of the program's .bss section.

dd(1), meanwhile, is an asynchronous pipeline between the read and write ends. It doesn't care how much memory it uses; it'll use however much is required to do the job you've configured it to do. It doesn't mind submitting a bunch of simultaneous read requests; assembling them out-of-order into a buffer as large as you like; and then writing out the buffer with a bunch of simultaneous write requests. And it doesn't need to fully assemble the "read batch" before turning it into a "write batch" — dd(1) uses a read-write ring-buffer, so as soon as the earliest contiguous read-results are available, they get fed into the write queue.

This design means that dd(1) can take strong advantage of devices like NVMe that use high internal read/write parallelism when slammed with a high IO-queue-depth of read/write requests. Which is exactly what you're doing when you specify a `bsize` to dd(1) that's higher than the disk's sector size.

Try it out for yourself — do something like:

    for p in $(seq 5 20); do
      bs=$(echo 2^$p | bc)
      echo "bs=$bs"
      time dd if=/dev/nvme0n1 of=/dev/nvme1n1 bs=$bs
    done
And, while each one is executing, do an `iostat -x` and look at the `aqu-sz` column. The speed of the copy should peak at exactly the same time that your aqu-sz hits the effective limit of the read-parallelism of the source disk or the write-pararallelism of the target disk.


That should have read when dd is mentioned


Yeah bit me in the butt once on Mac OS usr1 killed dd. When I want progress of dd and didn't define status I mostly use progress now. Also works with many pther utils like cp, xz and all the usual suspects


For copying disks, an even more powerful tool is ddrescue. It works exactly like dd but writes its progress into a map-file that can be used to pause and resume the operation any time. It can also skip certain sectors known to be broken.

Super useful if you have a big slow disk that you have to copy under unreliable conditions. Saved my ass several times with old partially broken hard drives.


It is a really incredible anxiety reducing small tool when you have to transfer large files.


One typo that's easy to make is:

  sort file.txt | sponge > file.txt
(i.e., using redirection rather than passing the path as an argument to sponge)

This is wrong and will not work! I've been bitten by it before.


I use `sponge` and `ts` (mentioned in the article) pretty regularly, and am really happy for them.

I have used `isutf8` a fair amount in the past, but I find it mostly redundant these days (thankfully!)

The other one that I don't use very often, but is absolutely invaluable when I do need it, is `ifne` - "Run command if the standard input is not empty". It's like `-r` from GNU `xargs`, but for any command in a pipeline.


For execsnoop, people running systems with DTrace can find the same:

* https://github.com/jorgev/dtrace-scripts/blob/master/execsno...

On macOS Monterey+ you'll probably have to install the Kernel Debug Kit per:

* https://developer.apple.com/forums/thread/692444

The Linux variant was written Brendan Gregg (who previous did a lot of work on Solaris, where DTrace was created):

* https://github.com/brendangregg/perf-tools/blob/master/execs...

* https://github.com/iovisor/bcc/blob/master/tools/execsnoop.p...


I'd never heard of the :cq command in vim before. Seems useful, but in practice it's so unknown that things like editing the git commit message cannot rely on it and instead check whether the file has been changed. Also, reading its documentation, it probably would be better named :cqall .


I was wondering that too although I don't have access to vim right now. What's the punch line ? EDIT : The difference with :q! is the exit code !

(Yes, and :wall is actually the :update command on all your buffers, that is, unlike :w, buffers are written only if there has been changes. Bad naming is the mother of all pedagocical pain)


As far as I remember, it works with git commit just fine. It's also far from being unknown.


Yeah, I should've written "cannot rely on that alone and also check". I've worked with vi, later vim, for 34 years and read about it here first; ddg'ing for it doesn't give many hits.


I use that often for aborting the current commit or the current git interactive rebase


Any reason not to just :q! ?


:q! works just as well for aborting a commit, that's actually what I started the subthread with. Don't know about rebase.


Interactive rebase will execute the rebase, if you just do :q!.


it's v useful if you want to abort, e.g. when editing an interactive rebase & decide to not go thru w/ it.


I just learned about vidir [1]. Emacs Dired [2] can rename & delete files by editing the buffer directly, and let's say I was thrilled when I saw someone replicated that behavior as a general Unix tool.

[1] https://github.com/trapd00r/vidir

[2] https://www.gnu.org/software/emacs/manual/html_node/emacs/Wd...


I have a personal vendetta against moreutils because it provides a vastly inferior version of parallel compared to GNU Parallel. GNU Parallel alone provides more value than moreutils, all of which can be replaced by UNIX one-liners or Emacs.


You are not the only one. Every parallel user has to fight moreutils


Yeah I use chronic all the time for my cron jobs so they only email me if they fail and I can still print helpful output from them. Love moreutils.


`pee` - no doubt the dev was delighted and amused


> In Vim / Helix you do that with :cq

Never heard of that before. I generally use :q! or ZQ

Is there a difference ?


Yes, the exit code. See e.g. `:help cq` in vim. :q! and ZQ will yield exit code 0, which sometimes is not what you want if you want to ensure some task is properly aborted.


Thanks !


vidir within ranger is really nice. vipe is also pretty cool. Mostly I use vipe for editing my clipboard contents and then sending the modified version back to the clipboard, or occasionally editing some text stream before sending it to my clipboard, such as some grep output I only want some of.


moreutils parallel can also come in handy for quick command parallelization (not to be confused with GNU parallel which serves a similar purpose but can be more complicated)


And GNU parallel is very aggressive about citations which I get but it's also too much


I have switched to using xargs to parallelize things: it has a benefit of being part of posix, and is not annoying about citations like parallel.


The parallelism isn't part of POSIX though (AFAIK), that's an extension by whoever wrote your xargs.

If what you really mean is that it's already installed on every machine you use, fair enough. But it's not strictly portable in some standards-based sense.


That they occupy the same namespace is always very annoying. Instead of just `brew upgrade` I must unlink and later link --overwrite parallel.


In case anyone was wondering, the moreutils tools:

  chronic: runs a command quietly unless it fails
  combine: combine the lines in two files using boolean operations
  errno: look up errno names and descriptions
  ifdata: get network interface info without parsing ifconfig output
  ifne: run a program if the standard input is not empty
  isutf8: check if a file or standard input is utf-8
  lckdo: execute a program with a lock held
  mispipe: pipe two commands, returning the exit status of the first
  parallel: run multiple jobs at once
  pee: tee standard input to pipes
  sponge: soak up standard input and write to a file
  ts: timestamp standard input
  vidir: edit a directory in your text editor
  vipe: insert a text editor into a pipe
  zrun: automatically uncompress arguments to command
from https://rentes.github.io/unix/utilities/2015/07/27/moreutils...

Similarly, there's some lesser-known useful stuff in GNU Coreutils:

https://en.wikipedia.org/wiki/List_of_GNU_Core_Utilities_com...

  paste: Merges lines of files
  expand: Converts tabs to spaces
  seq: prints a sequence of numbers
  shuf: shuffles its input


Confusingly, moreutils parallel is not "GNU parallel". Moreutils parallel is very simple, while the other parallel is very featureful. Linux distributions can deal with the conflict, but bad package managers like homebrew cannot.


"rename" shares the same fate, there are several implantations, completely different from each other


seq is “less known”? I’d assume anyone familiar with shell scripting would know about it.

It’s a great starting point for entertaining children on a terminal:

    seq 99 -1 0 | xargs printf ‘%s bottles of beer on the wall…\n’


Bash kinda broke seq for me since you can just write {99..0}


Or if you need dynamic arguments, you probably want:

    N=9; for ((i=0; i<=N; ++i)); do echo "$i"; done


Why use this form of for when you can just use seq and it works in any shell including fish?


Because capturing the output of `seq` requires spawning a whole separate process (significant for small sequences) and shoving all the data into a single buffer (significant for large sequences) rather than working incrementally.


Less known maybe for most people who read HN, but you're right that a lot of shell scripting folks would know about it.


thanks for the tl;dr

moreutils and similar pops up every year but it's still easy to forget.. they should be part of core distributions nowadays..




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: