Maybe some controversial advice: Go ahead, fall in these pits.
I write my fair share of shell scripts and I've hit practically every one of these snags in the past. However, for the majority of tasks I perform with bash, I genuinely don't care if I support spaces in filenames, or if I throw away a little efficiency with a few extra sub-shells, or if I can't test numbers vs strings or have a weird notion of booleans.
Your scripts are going to have bugs. The important question is: What happens when they fail?
Are your scripts idempotent? Are they audit-able? Interruptible? Do you have backups before performing destructive operations? How do you verify that they did the right job?
For example, if your shell scripts operate only on files under version control, you can simply run a diff before committing. Rather than spent a bunch of time tracking down a word expansion bug, you can simply rename that one file that failed to not include a space in its name.
I live and die by the shell. I'm constantly composing little one-liners, and keep an absurdly long Bash / zsh history to draw from. There are places the obvious answer is almost always "how about you just write a shell script?"
That said, I long ago reached a place where I realized that, while shell scripting is entertaining, I'd much rather write anything more than a handful of lines in a general purpose programming language. Perl, Python, Ruby, whatever - even PHP involves far less syntactic suffering and general impedance than Bash. It's not that I'm exceptionally worried about correctness in stuff that no one besides me is ever going to use, it's just that once you're past a certain very low threshold of complexity, the agony you spend for a piece of reusable code is so much less. Even just stitching together some standard utilities, there are plenty of times it'll take a tenth as long and a thousandth as much swearing to just write some Perl that uses backticks here and there or mangles STDIN as needed.
> Are your scripts idempotent? Are they
> audit-able? Interruptible? Do you have
> backups before performing destructive
> operations? How do you verify that they
> did the right job?
Every single one of these questions is easier to answer if you're using a less agonizing language than Bash and its relatives.
> Every single one of these questions is easier to answer if you're using a less agonizing language than Bash and its relatives.
I disagree. While the set of things that are "hard" to do is probably larger in shell than the alternatives, the specific questions posed by the grandparent are hard in any language. They all boil down to "how can I correctly do something which has side effects (on external state)?"
Statefulness itself is a pain, and shell is in some sense the ultimate language for simply and flexibly dealing with external state.
Simplicity: the filesystem is an extremly simple and powerful state representation. Show me a language that interacts with the fs more concisely than
tr '[A-Z]' '[a-z]' < upper.txt > lower.txt
Flexibility: if shell can't do it, just use another program in another language that can, like `tr` in the above example. What other language enables polyglot programming like this? Literally any program in any language can become a part of a shell program.
> it's just that once you're past a certain very low threshold of complexity, the agony you spend for a piece of reusable code is so much less.
Here's where I admit I was playing devil's advocate to an extent, because I fully agree with you here. I write lots of shell scripts. I never write big shell scripts. Above some length they just get targeted for replacement in a "real" language, or at the very least, portions of them get rewritten so they can remain small.
Empirically, it also seems true that shell is harder for people to grasp, harder to read, and harder for people to get right. These are real costs that have to be figured in.
PS. Speaking of shell brennen, we should be working on our weekend project. :)
That's the biggest problem: some things are very simple, but other things fall off a cliff. For example, as a related task I ran into recently: how do you replace FOO with the contents of foo.txt? The natural way would be expanding it into a command line, but at least with sed that's no good even for nice short text files because / and \n are special. You can use a sed command to read a file which I didn't know existed until I looked it up, but it apparently has the delightful feature that "If file cannot be read for any reason, it is silently ignored and no error condition is set." You can use perl... you can use perl to easily do a lot of things that are really hard to do otherwise (including things as simple as matching a regex and printing capture groups), but at least to me it feels really awkward to wrong to mix two different full-fledged languages. Maybe I should just get over that, but I wish the whole thing were more coherent.
Interesting problem. Some quick head-scratching and googling didn't turn up anything useful on merging templates with awk and sed... then it hit me --- m4 is used for that:
Interesting solution; I should learn to use m4 for various tasks. Probably would have already if I didn't have such a negative visceral reaction to autotools :)
As for "capture groups", you can use lex. I wrote a "code generator" shell script to produce .l files and another script that compiles .l files to one-off utilities.
There is perhaps more coherence to the whole thing than you are aware of. Whether "Linux distros" or "OSX" have maintained that coherency I do not know.
Sounds useful, but it's not portable, and doesn't work on OS X. I suppose I could just switch to GNU sed, since I mostly care about interactive use, but thus far I haven't done so.
I totally agree. Sometimes the strength of making a 'quick bash script', is that you are making a quick bash script. I have made some pretty strong, well tested projects in bash before including one that was a big part of an open-source qmail project.
Sometimes though you just need to get stuff done with the least amount of fuss, without worrying about the extreme edge-cases which the majority of the webpage attached to this story talk about. Heck, it's probably the majority of bash work I'd say that ends up like that.
On the other hand, if you examine the given examples, you'll see that very often the "correct" way isn't really longer or harder to type. If you make a point of sticking to the correct way, it will eventually become automatic, and there'll be no fuss and worry. This might end up saving some sorry ass later on.
And having learned the correct way, you'll instantly see it when a script you review is doing something in a way that will eventually bite someone.
There's no downsides to learning and doing things right. Of course it does take some extra time and effort at the start, as everything..
Sure, there are extreme examples that can be really hard to handle portably and safely if you're doing something more complicated (embedded newlines in filenames come to mind). So in the end some corner cutting is often inevitable :-)
> There's no downsides to learning and doing things right.
There is never "no downsides".
For one example, there is "feedback fatigue". It's easy to pick on syntax or "small" semantics during a code review, but that's not nearly as useful as analyzing the big picture. I have been party to many code reviews that involved a dozen nit-picky stylistic comments. The reviewer feels like they did their job, the reviewee feels like they have appeased the reviewer, and so the now style-guide-correct code gets a half-hearted "LGTM" and is merged. That code looks great... And it handles filenames with spaces in them!.. But does the totally wrong thing.
The Unix shell may be a highly powerful interactive programming environment, but it's sure hard to think of anything that comes anywhere close to sucking as badly. With the shell and the standard Unix commands, some things that are hard in other languages are easy, and most of the things that are easy in other languages are hard to impossible... I'd love to see a clean slate replacement for the shell that still feels Unix-like and retains most of its existing benefits.
(I suspect PowerShell would be a good environment to take design cues from or even port, but I've never used it so I can't say for sure.)
> I'd love to see a clean slate replacement for the shell that still feels Unix-like and retains most of its existing benefits.
Have you looked at scsh, the Scheme Configurable Shell? It's a nice clean language with a REPL (well, it's just scheme) that has syntax for all the usual things you'd want in a unix shell -- pipelines, redirections, environment, signals, etc.
PowerShell is nice and in principle seems more Unixy than the normal Unix core utils in that cmdlets really do only one thing and that one well (most of ls' options for example can be solved by simply combining Get-ChildItem with Sort-Object or one of the Format-* cmdlets).
I think the problem is that the terrible shell semantics are inextricable from the larger semantics of using Unix. So you could replace sh with something less dumb-stupid, but you'd still be interacting with the garbage that is the shell environment. You could then replace the latter, but at that point, well, you're off into the weeds.
I hold my nose and write shell, even as I look over at, for instance, scsh, and think ... yeah.
> I'd love to see a clean slate replacement for the shell that still feels Unix-like and retains most of its existing benefits.
I've been working on one, in my spare time for about a year. I can't tell you how exciting it is. It's still a ways from being ready for actual use, but I'll have a website up in a couple of weeks to showcase the approach, and will surely post to HN when it's up.
If you already have a machine serving stuff at home, adding a remote server is only time, money, and effort spent. Is it worth it? Not all sites are high in volume, and even well prepared sites on decent lines can go down if it hits the HN front page.
People do occasionally bitch and moan about slow downloads from my home host, but nobody ever offered money for a pain-free alternative.
Well it is the main reference but it won't sustain the "HN effect", thanks for the mirror link. I typically use http://wiki.bash-hackers.org/doku.php which also provides a mirror and some extras.
Apparently a mod changed the URL (to relieve a not so powerful host). The original URL (and thus the one you should be bookmarking/remember) was http://mywiki.wooledge.org/BashPitfalls
So why do we put up with classic command line tools in general that are so full of horrible, counterintuitive pitfalls? Is it just tradition? Backwards compatibility?
The "Unix should be hard" crew has gotten a lot quieter in the last ten years with the rise of Ubuntu and other relatively user-friendly distros, but I feel like there's still an underlying current of elitism there; people are proud of mastering these bizarre, arcane methods, and they're offended that someone else might be able to accomplish just as much without doing half as much work.
Would someone recommend an automatic bash style checker, such as a 'linter'? Perhaps something along the lines of Chef's Food Critic? http://acrmp.github.io/foodcritic/
Ah yes, the ultimate reference from Freenode #bash, after you learn from Greycat's wiki you won't simply Google/DDG "bash tutorial" again and just head straight here.
I find it sad and amusing that we're writing a ton of mission-critical code in this language that has an incredible number of obscure quirks. Yes, most of these pitfalls are directly connected to the semantics of Unix, but I wish someone made a concerted effort to get rid of them in an otherwise evolutionary way.
Bash is for most things, one of the easier languages I have dealt with, being even beyond python, etc. That is though, for shell type scripting.
There's many problems with it, but the only one I've run into that keeps it from being more useful is that there are no multi dimensional arrays built-in. There are super hacky ways I have seen them implemented, but by default it's something I basically never am able to turn to when scripting in bash and have to turn to other languages, even when the particular task I was working with would be mostly simpler in bash.
That said, there are associative arrays in bash these days.
The following does not work on files with spaces according to the article:
for i in $(ls *.mp3); do
some command $i
done
So does that mean, that "for" will do something per word of the output of $, rather than per line of output of it?
What to do if I want to do something for every line? What for example if I really want the output of ls, find (or any other command you can put int he $()) and loop through that line per line, even if some output has spaces?
There are a couple of ways you can do it. Don't do this with `ls` though; there are better ways to get that info in scripts.
# Loops over lines, in a subshell
prints_lines | while read -r line; do
some_command "$line"
done
# Loops over lines, in the current shell
while read -r line; do
some_command "$line"
done < <(prints_lines)
# If for some reason you really want a for loop
oldIFS=$IFS
IFS=$'\n' lines=($(prints_lines))
IFS=$oldIFS
for line in "${lines[@]}"; do
some_command "$line"
done
> So does that mean, that "for" will do something per word of the output of $, rather than per line of output of it?
Correct. The argument to "for" is a list of words.
> What to do if I want to do something for every line?
Use a while loop.
find /some/dir/ -type f |
while read -r line; do
; # something with $line
done
PS. You should almost always use `find` instead of `ls` in shell scripts. Given a pattern, `ls` will exit non-zero if nothing matches it, and you should be treating non-zero exits like you would exceptions in other languages.
One thing to be careful of when doing "while read..." is that a new shell is started on each iteration, so you cannot for example set a variable within the loop that you can use later in the script, as its value will be lost when the shell process exits.
printf "\n\n\n" | while read i; do a="x$a"; echo "$a"; done
x
xx
xxx
The accumulator value even carries over after the while loop:
printf "\n\n\n" | ( while read i; do a="x$a"; echo "$a"; done ; echo "$a" )
x
xx
xxx
xxx
(Technically, whether or not the loop body is executed in a subshell may be implementation dependent. Haven't looked at the POSIX shell spec in a while, but I seem to remember an old ksh that actually used subshells. At any rate, none of the modern sh's and bash force a subshell.)
What is true, however, is that a pipeline will execute in a subshell. Maybe that's what you're getting at here, and it is an important caveat.
a=y; printf "\n\n\n" | while read i; do a="x$a"; echo "$a"; done; echo "$a"
xy
xxy
xxxy
y
Ah, ok. True, when I have had this issue it was after doing something like 'grep "pattern" file | while read ...'. I did not realize it was the pipe that caused this.
I know someone will sooner or later propose that we ban spaces and special characters in names. Let me just put my two cents forward.
We should absolutely ban special characters from names. Specifically, all whitespace, the colon, semicolon, forward slash, backward slash, question mark, star, ampersand, and whatever else I'm missing that will confuse the shell. Also files cannot start with a dash.
However, people should be able to name files with these characters. So I propose that these characters in filenames be percent-encoded like they would be in a URL. Specifically, the algorithm should be
1. Take the file name and encode it as UTF-8. Enforce some sort of normalization.
2. Substitute each problematic byte with equivalent percent-encoded form. This does not touch bytes over 0x80 - they are assumed non-problematic.
3. Write the file in the file system under that name.
4. When displaying files, run the algorithm in reverse.
In the general case files like "01 - Don't Eat the Yellow Snow.mp3" would simply become 01%20-%20Don't%20Eat%20the%20Yellow%20Snow.mp3 in the filesystem and cause absolutely no further problems. To make it completely backwards-compatible we should also add the following rule: If a filename includes a problematic byte or a percent-encoded byte higher than 0x80, then it is assumed to be raw and will not undergo percent decoding.
Basically, I propose that every program which receives free text input for a file name percent-encode the filenames before writing them to the filesystem and decode them for display. Everything else remains unchanged.
Why this will not work:
Requiring programmers to keep track of two filenames instead of just one is rather a lot of work. File APIs will have to take both encoded and non-encoded forms and encode the non-encoded form, creating problems when people inadvertently use the wrong function with a name, either double-encoding it or not encoding it and leading to "this file does not exist" errors.
It will be possible to create two files with different names on disk which are nonetheless shown with the same name to the user.
Why it is ugly:
We're taping over a deficiency of an ancient language by inflicting pain on programmers.
Double-encoded filenames? MADNESS.
Why I like it:
I'll be able to have ?, * and : in filenames in windows.
> Substitute each problematic byte with equivalent percent-encoded form. This does not touch bytes over 0x80 - they are assumed non-problematic.
You know what's crazy? Currently, in Unix, control characters are allowed in filenames. Like, \t and \n and \b and even \[. Those shouldn't be allowed, percent-escaped or not. Everything else you said is sensible.
Technically NTFS allows those too. The filesystem, being a very low-level tool, hardly thinks of the upper layers and what pain it might inflict there. Its purpose is to store blobs under a name and retrieve them upon request. Since a char[] (or wchar_t[]) looks enough like a name that's what it uses.
That being said, enforcing such restrictions in upper layers brings pain as well, because suddenly you can have files that you cannot delete anymore (happens sometimes on Windows).
True; there's no reason that the filesystem should be storing anything other than char[]. The filesystem is a serialized domain, and char[] buffers are for storage and retrieval of serialized data. But that also means that each filesystem should explicitly specify a serialization format for what's stored in that char[] -- hopefully UTF-8.
However, the filesystem should really be where that serialized representation begins and ends. The filesystem should be interacting with the VFS layer using runes (Unicode codepoints), not octets.
And then, given that all filesystems route through the VFS, it can (and should) be enforcing preconditions on those runes in its API, expecting users to pass it something like a printable_rune_t[]. (Or even, horror of Pascalian horrors, a struct containing a length-prefixed printable_rune_t[].)
And for the situation where there's now files floating around without a printable_rune_t[] name -- this is why NTFS has been conceptually based around GUIDs (really, NT object IDs) for a decade now, with all names for a file just being indexed aliases. I wonder when Linux will get on that train...
Well, history sadly dictates that the interface to the upper layers it based around code units because those have always been fixed-length. Unicode came to late to most operating systems to really be ingrained in their design and where it was (Windows springs to mind) it all got a turn for the worse with the 16-to-21-bit shift in Unicode 2 with Unicode-by-default systems being no better than 8-bit-by-default systems had been a decade earlier.
That NTFS uses GUIDs internally to reference streams is news to me, though. But I think on Unix-like systems the equivalent would be inodes, I guess, right?
Percent-escaped control characters are fine. They'll just show up in a gui file manager as a <?> symbol, and on the shell as %02 for instance. The shell will never parse percent-encoded filenames and gui file managers don't interpret control characters.
Non-percent-encoded control chars are strictly verboten. The VFS layer should contain a ban list of bytes (or codepoints) not allowed as part of a filename. It won't be a large list, just every nonprintable character from ASCII, every blank character (space, tab, newline, vertical tab, carriage return etc.), the characters /\;:?* - and that's it. This list should cover everything that might be problematic in windows OR linux OR MacOS. For full compatibility, we must also add the %uxxxx and %Uxxxxxxxx percent escapes for arbitrary unicode codepoints (I can sense that it might make sense to also escape all the unicode spaces, combining characters and the like, to make file manipulation from the shell easier).
It sounds sort of sensible, but we're dealing with two layers of encoding here, leading to three byte sequences.
1. You have a string the user entered. That's just a generic name which can be anything.
2. You take that string and substitute "problematic" characters with their percent encoded form. For example, every space becomes %20, non-breakable space might become %ua0, or it might be left alone
3. You now have a string of unicode codepoints, which are all "clean". This is encoded yet again to a sequence of bytes that are stored by the filesystem.
At least the second coding is done by the system, either by the standard file manipulation routines, or by the filesystem itself.
But it is the first one that seems infeasible. It has to be done at a layer above the standard "open" function and I can see developers being very confused on what and how to escape.
You know, maybe the answer might be not to have every other program do all this complicated dancing, but for the shell itself to escape filenames when it reads them. So when you say "cmd file%20with%20space", cmd is called with argument one set to "file with space". And when ls or find lists files, bad characters can be replaced with their percent-encoded forms. And xargs can unescape them.
I write my fair share of shell scripts and I've hit practically every one of these snags in the past. However, for the majority of tasks I perform with bash, I genuinely don't care if I support spaces in filenames, or if I throw away a little efficiency with a few extra sub-shells, or if I can't test numbers vs strings or have a weird notion of booleans.
Your scripts are going to have bugs. The important question is: What happens when they fail?
Are your scripts idempotent? Are they audit-able? Interruptible? Do you have backups before performing destructive operations? How do you verify that they did the right job?
For example, if your shell scripts operate only on files under version control, you can simply run a diff before committing. Rather than spent a bunch of time tracking down a word expansion bug, you can simply rename that one file that failed to not include a space in its name.