Hacker News new | past | comments | ask | show | jobs | submit login
Bash retry function with exponential backoff (gist.github.com)
121 points by geocrasher on Dec 29, 2022 | hide | past | favorite | 47 comments



This won’t work if you retry another function and are relying on set -e to detect errors:

  set -e

  retry() {
    until "$@"
    do
      sleep 1
    done
  }

  f() {
    false
    echo oh no
  }

  retry f
  # oh no
set -e is ignored when a function executes in the context of a predicate.

Shell scripts are great for when you just need to run stuff. If something bad happens you notice it and run another command to ameliorate the error. This is how most interactive shell sessions pan out.

As soon as you need to handle errors programmatically I heartily suggest that you switch to a language with fewer quirks than sh.


You should never use set -e. Because shell scripts use exit status both to indicate failure and as a kind of a boolean, this option has to try to guess your intent by using a bunch of rules that are convoluted and can differ between shells, or even between versions. It's a matter of time before they do something you don't expect. Putting || exit or || return everywhere really is the better option if you can't switch to a better language like Python.

I highly recommend https://mywiki.wooledge.org/BashPitfalls. It should be required reading before anyone writes a shell script.


No no no. Shell scripts without -e is one of the biggest footguns since computers were invented. Second being shell scripts with -e enabled ;) As others mentioned -u and pipefail are also highly recommended.

Even mundane shit like ‘cd’ can fail where you least expect it, requiring || exit on practically every line of the script. Forget one place and boom. Defeating the premise of “it’s just a short “simple” script so let’s not write it in a sane language”.

If you want to write serious scripts and spend weeks scrutinizing every line, NASA style, it is perhaps better to explicitly handle every potential error. But good luck doing that in a large team of junior web developers just glueing stuff together in the build pipeline. The pitfalls with -e linked above are mostly related to functions, if your script reached the size of requiring functions you have kind of lost already.


> Even mundane shit like ‘cd’ can fail where you least expect it, requiring || exit on practically every line of the script. Forget one place and boom.

This is true, but a much better solution is for everyone to use shellcheck. You can require that all shell scripts pass it without warnings and that every shellcheck disable comment has a good explanation attached; the first part can be automated with a pre-commit hook. In return, you get a script that is explicit and predictable rather than relying on a set of rules which you don't understand, and which may come back to bite you later when you change shells or even bash versions.

> spend weeks scrutinizing every line, NASA style

This is a colossal exaggeration. I've switched from set -e to putting || return or || exit everywhere and it barely takes more time after I got in the habit. On the rare occasions I get distracted and forget, shellcheck reminds me. I have my editor configured to run it on the fly and underline lines with warnings.


While shellcheck is a fantastic tool, it doesn't test for lack of error handling generally. In this area it only has a predefined set of known pitfalls, like "cd". Try running this through shellcheck and it will happily let it through.

    curl icanhazip.com > myip.txt
    cat myip.txt
Taking you back to the every line must be scrutinized problem. Maybe you can do it. Maybe i can do it. But does the rest of your large organization of diverse experiences have the discipline to do it? In my experience, from multiple organizations of different sizes - no. Even with shellcheck and mandatory reviews it just doesn't work. It's tiring to for the 100th time remind someone that curl can fail and arguments must be quoted, you worry being labeled as the know-it-all-besserwisser that shoots down even simple scripts, that's assuming the reviewer spots the errors in the first place. What's worse is that the code doesn't immediately tell you if errors has been considered or not, making it very difficult to read existing scripts.

Shellcheck btw has an optional check for the case of functions used in combination with errexit https://www.shellcheck.net/wiki/SC2310


Thanks for sharing where you're coming from. My opinion comes from maintaining shell scripts either alone or in small teams of people who learn quickly. Perhaps the optimal solution is different when the team is large. I still think that manual error handling is best if you're disciplined enough to do it, but I'll make sure to qualify this view next time I'm talking about it.


This is terrible advice. -e doesn’t prevent you from adding more careful error handling but it does catch a significant majority of the problems which will, with absolute certainty, happen because nobody ever adds “|| exit” everywhere it’s necessary.

-e and shellcheck should be mandatory for any script which can’t be written in Python. Once you’re over a screen or two of code, it’s almost always shorter to write in Python anyway and it’ll be easier to understand.


If you already use shellcheck and know to add || exit, there's no need to rely on a half-solution that may fail in unintuitive and hard to predict ways. Counting on convoluted "automagical" rules to guess your intent correctly every time is how you write unreliable programs. set -e just gives you a false sense of security.

> Once you’re over a screen or two of code, it’s almost always shorter to write in Python anyway

That I agree with.


I would argue that it’s better to start with -e and deal with the handful of edge cases where something like grep is expected to return a non-zero value rather than have to remember to check every command. We now have half a century of evidence suggesting that won’t work reliably for all but the most conscientious shell scripters.


That quote from Jurassic Park comes to mind.

  As soon as you need to handle errors programmatically I heartily
  suggest that you switch to a language with fewer quirks than sh.
Basically that. `set -e` is evil.


I think your script chugging along in the face of a failure and running whatever commands that probably relied on what happened before it blindly is probably more evil, though.


set -e creates two problems. The first is that what is considered "failure" by the shell is not consistent or intuitive. The second is that what each of your tools consider failure is also not necessarily intuitive. You're getting a false sense of security at best.

E.g. grep considers no matches worthy of a non-zero (a.k.a. error) exit code while df considers an invalid block size an error worthy of a exit code of zero. Ostensibly printf(1) exits with non-zero on an error, but a format specifier with insufficient arguments is not an error.


So basically, Go? ;)


No, it's not. Nor did the parent indicate such.

set -euo pipefail should be the default so that the script exits on at least some unexpected state/command results. If you want to do more then that with various fallbacks etc... then use another language.


The problem with set -e is that you end up failing in inconsistent and non-intuitive ways. e.g.

https://stackoverflow.com/questions/4072984/how-do-i-get-the...

https://serverfault.com/questions/143445/what-does-set-e-do-...

pipefail is worse because tools like grep return non-zero for things that aren't inherently errors. Error handling in sh-like shells is archaic enough that if you're reaching for it you should strongly consider reaching for a different language.


you're missintepreting what i said.

I'm saying that a script which would have the potential to fail unintuitively from `set -eup pipefail` probably shouldn't be written in shell and most definitely shouldn't continue to execute once an unexpected state has occured.

You're right that there are a lot of cases in which non-zero exitcodes are the expected behaviour. But if you're accepting these commands with non-zero exitcodes then you're already doing error handling by verifying the output of the commands (at least hopefully), which was the original criteria from the OP for writing the script in another language.


I posted this because it worked perfectly for what I needed it for: a watchdog script on a cron job that issuis a PING to a process that occasionally times out. I'm sure it has a million flaws. Everything in Bash does, depending on who you ask. I'm certain there are a thousand reasons it won't work for somebody's application. It worked for mine.


Changed a bit in case of wrong or weird input. (The quotes on the right side of assignments aren't necessary but I find it to be a good practice to use quotes everywhere.)

  #!/usr/bin/env bash

  function retry {
      local retries="$1"
      shift
  
      case "$retries" in
          ""|*[!0-9]*)
              echo "retry: First argument must be a number, got '$retries'" 1>&2
              return 1
          ;;
      esac
  
      local count=0
      until "$@"; do
          local exit="$?"
          local wait="$((2 ** count))"
          count="$((count + 1))"
          if [ "$count" -lt "$retries" ]; then
              echo "Retry $count/$retries exited $exit, retrying in $wait seconds..." 1>&2
              sleep "$wait"
          else
              echo "Retry $count/$retries exited $exit, no more retries left." 1>&2
              return "$exit"
          fi
      done
      return 0
  }

  retry "$@"


This technique is useful for running things that normally run quickly, but sometimes need more time.

The technique is used extensively in the GNU coreutils test suite, as detailed in the "performance" section at https://www.pixelbeat.org/docs/coreutils-testing.html


Pure exponential back off can result in very long delays. It is often useful to truncate the sleep time, and for most contentious jobs also some jitter.


This function is called with an argument that limits the number of retries, so unlimited exponential backoff isn't an issue. Jitter is an interesting idea.


Truncation/capping the sleep time is useful even under a finite number of retries?


Yeah a plateau and repeat limit are useful.

Also fibonacci could be considered.


I would add some random jitter to the timing of the backoff. If the fail is due to two such processes getting themselves into some sort of deadlock and they both fail at the same time, they might retry at the same time if at the same part of the retry cycle and fail again for the same reason.


`until "$@"; do` is pretty crafty.


Crafty or crufty?


Could you elaborate? What does it do?


bash’s ‘if’ command will executes its arguments and enters the clause if the exit code is non-zero. Usually we use “[“ which is the test built-in to accomplish this (like “if [ $n -gt 5 ]”), which works the same way. But here the script uses “$@“, which is the rest of args “splat” in bash. The script passes those args to the “if” command which will thus execute “$@“ and enter the inner clause if that command returns non-zero. This is exactly what we want, since that inner clause is the retry logic. Just fun and weird bashism in the wild.

(Sorry if there’s minor errors here, on a phone and going off of memory)


In this day and age I'm surprise there isn't something like this built into the standard library of every language. Or at least Python.


Inded. At least there is a very good decorator in pypy: https://pypi.org/project/retry/ But having it in the stdlib would make a ton of sense.


Here is similar in pwsh, it retries script block while exception is thrown, until no exception or timeout:

https://github.com/majkinetor/posh/blob/master/MM_Sugar/Wait...


I was today years old when I learned that bash can do simple arithmetic with $(($i + 1)). Thanks!


If you just want to add/subtract 1, in bash you can do just '((i++))', i.e.:

  [mmh@x670]$ i=0
  [mmh@x670]$ ((i++)); echo $i
  1
  [mmh@x670]$ ((i++)); echo $i
  2
Beware that the above is very -bashy- and purists will rip your head off for using it.


There is a gotcha here if you use:

  set -o errexit
Because *any bash arithmetic expression that evaluates to zero returns the exit code of 1*.

So this works:

  #!/usr/bin/env bash
  set -o errexit
  i=0
  ((++i))
  echo $i # 1 will be printed to stdout
But this doesn't:

  #!/usr/bin/env bash
  set -o errexit
  i=0
  ((i++)) # terminates here with $? of 1
  echo $i
The relevant doc is for `let`:

       let arg [arg ...]
              Each arg is an arithmetic expression to be evaluated (see
              ARITHMETIC EVALUATION above).  If the last arg evaluates
              to 0, let returns 1; 0 is returned otherwise.
https://www.gnu.org/software/bash/manual/bash.html#index-let


Yeah it's best to just use POSIX shell arithmetic, and not be clever, and not the other 2 ways bash has to do it! THat is:

    i=$((i + 1))
The (( )) construct, without the $, is different, and it's also not POSIX. It has the gotchas around the exit code, which a plain assignment doesn't have.


Also, the wooledge wiki is an amazing resource for Bash/shell scripting. To learn more magic see: https://mywiki.wooledge.org/ArithmeticExpression


Whoa. No dollar signs at all?

I’m at Lowes — does ((i+1)) work? Or is it just for incrementing / decrementing (which is still very useful!)

EDIT: thinking this over, does this work for variable substitution too? E.g. ((i)) being equivalent to $i. Not that you’d necessarily want to…


`(( ))` is an arithmetic evaluation block. Its content has to be an arithmetic expression. Arithmetic expressions don't require `$` before simple variable names and some more complex expressions like array indexing.

`(( i + 1 ))` will evaluate the result of adding one to `$i` and then throw away the result. It doesn't do anything useful (other than having a different exit code depending on whether the expression evaluated to 0 or not).

`$(( ... ))` evaluates the expression and then returns its value. ie `i=$(( i + 1 ))` will increment `$i`, just like `(( i++ ))` did.


To add arbitrary numbers, you'd have to do an assignment, like this:

  [mmh@x670]$ ((i=i+2)); echo $i
  6
  [mmh@x670]$ ((i=i+2)); echo $i
  8


It does also matter where the plus signs are:

  $ i=0
  $ echo $((++i))
  1
  $ echo $((i++))
  1
  $ echo $((++i))
  3
  $
((++i)) does add 1 before the command is run, and ((i++)) does add 1 after the command is run.


You can also use the builtin command `let`:

    let count=0
    until "$@"; do
        let wait=2**count
        sleep $wait
        let count=count+1
    done


Or with two brackets instead of four parens:

$[i+4]

And, btw, res=$(command) instead of those pesky backticks res=`command` ,-)


Or, more simply, $((i + 1)).

And it's not only bash; it's feature of POSIX shell.


Or as they did in ancient times $[i+1]


> The old format $[expression] is deprecated and will be removed in upcoming versions of bash.


Don't forget about:

   declare -i i
   i+=1


You need some randomness in how long you'll wait.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: