One tip I have to make large-ish awk programs readable is to name the columns in the BEGIN section. Then, you'd use $colname instead of $1, $2, etc. for instance:
You can also put a similar code block at the start of a general processing entry. This applies on both flat (uniform record) and hierarchical (multiple record-type) data.
E.g.:
{
name = $1
dob = $2
grade = $3
# ...
# Do stuff with name / dob / grade, etc.
}
If the data are structured, so that there are multiple record types (typically defined by prefix or some other regex) you can put variable assignments within each block.
If you want to do that, use vnlog instead. You're 90% there already. It formalizes the idea of a header to label columns, and allows normal unix data-processing tools to just work with those labels. Your awk use case is quite literally "vnl-filter --eval"
Good on the author for listing other similar programs. I have been testing one of them, datamash. Its documentation claims it is faster than uniq in some situations.
When I wrote my introduction to JQ someone mentioned JQ was tricky but super-useful like AWK. I nodded along with this, but actually, I had no idea how Awk worked.
So I learned how it worked and wrote this up. It is a bit long, but if you don't know Awk that well, or at all, I think it should get the basics across to you by going step by step through examining the book reviews for The Hunger Games trilogy.
Let me know what you think. And also let me know if you have any interesting Awk one-liners to share.
The funny thing is, by and large my only use case for awk is to print out whitespace delimited columns where the amount of whitespace is variable. Surprisingly hard to do with other Unix tools.
The syntax isn't nearly as nice, but Perl can be handy if you're doing something more after splitting into columns. And it's usually already there / installed, like awk. For just columns:
$ printf "a b c d e\n1 2 3 4 5" | perl -lanE 'say "$F[2] $F[4]"'
c e
3 5
It surprized me that AWK had dictionaries and no declaration of vars that make it feel like a modern scripting langauge even though it was written in the 70s.
It turns out though that this is because Perl and later Ruby were inspired by AWK and even support these line by line processing idioms with BEGIN and END sa well.
ruby -n -a -e 'puts "#{$F[0] $F[1]}"'
ruby -ne '
BEGIN { $words = Hash.new(0) }
$_.split(/[^a-zA-Z]+/).each { |word|
$words[word.downcase] += 1 }
END {
...
I think it's pretty obvious that awk syntax is ultimately the main inspiration for JavaScript syntax, with optional semicolon as stmt terminator, regexp literals, for (x in y), the function keyword, a[x] associative array accessors, etc.
A long while ago I wrote up a little processor to determine field lengths in a given file - I forgot the original reason. ( https://github.com/sullivant/csvinfo )
However, I feel I really should have taken the time to learn Awk better as it could probably be done there, and simply! (It was a good excuse to tinker with rust, but that's an aside.)
I'll mark this on my GitHub when I get back on a computer, I take public datasets and make graphs and transforms and reports. The big survey companies have weird data records and having to write a parser is my least favorite part. I think other people who ingest my content don't appreciate the effort, but that's a near universal feeling I think, heh.
I really appreciate you writing this guide. As a long time Linux user, I've always wanted to learn AWK, but it seemed too daunting. Three minutes into your guide and I immediately saw how I could use it in my day-to-day usage.
I blame GNU's man page. I was in the same situation for the longest time, but stumbled over a man page for a simpler implementation of awk (plan9's, in my case) and learned it in 10-15 minutes (not claiming I understood it more than partially in that time of course, but enough to write my own small programs).
Since then I've made a point of finding man-pages from other systems whenever the manual for a GNU tool is a bit daunting. It tends to lower the learning threshold quite a lot, honestly.
$ man gawk | wc
1568 13030 94207
$ man -l /usr/share/man/man1/awk.1plan9.gz | wc
214 1579 10956
Not trying to detract from this great guide. Just a general tip :)
And for the flash card type of learners it is good to see the "HANDY ONE-LINE SCRIPTS FOR AWK" page is still available. See the links in the Credits section at the bottom for more great reading:
Having relied heavily on the (unofficial, non-GNU) gawk manpage extensively (it's quite good), I instantly started learning very useful features reading the GNU docs. (I still need to fully internalise those). Yes, the full manual is very much better than the manpage.
(Also recommend The AWK Programming Language mentioned here, though I'd suggest the GNU manual adds to that as well.)
I'm always happy when I see posts that promote AWK. It's a very underappreciated tool in my opinion. I was a Linux user for 20 years before I got familiar with it. AWK is super powerful for text processing, and I like that it's included in Busybox for use on the embedded systems that I design.
For any complex text processing, it's way better and more robust than having a super long pipeline of a bunch of sed/grep.
Most recently, I used awk in a script that parses /proc/mount to grab the mountpoint of a partition, or print something different if the partition isn't mounted. Doable with a bunch of sed/grep and some shell logic? Definitely. But easier and cleaner in AWK, and equally easy to inline in a shell-script.
I've never bothered to learn much AWK, but that's mostly because Perl is my bread and butter language and has been for 20 years, and focusing on knowledge of that seemed a better investment (especially since with a few judicious flags, Perl is a passable AWK replacement even for very small one liners).
That said, if you just want to supplement your knowledge of other shell tools and pull out something that can do some obvious text munging, AWK has always looked attractive for the task to me.
sorry, why this is different to install perl? they look like still need download. I s this because they were contained in larger tool box so more easier get awk by other reasons?
p.s., thanks, I don't know I had have awk in git bash...
This comment reminds me of the joke about the sewage worker, who is chatting about his job and says 'to you it maybe just shit, but to me its my bread and butter'.
Internet always seems the similar, pipes|garbage in| and munging on urban dictionary.....
I do a lot of work with structured data--json, yaml, etc. For me, this is how I feel about jq. One of my favorite use-cases is querying Kubernetes resources. E.g., `kubectl get secret <secret-name> -o json | jq -r '.data | map_values(@base64d)'` (fetch a secret and decode all of its values).
Came here to say this. Glad to see /bin getting respect.
To anyone processing huge quantities of text and text files, someone very likely had the same problem you faced back in the 1980's and there's a Unix/GNU tool for it already.
I was introduced to *nix from processing very large text files that text editors I was familiar with choked and died. Someone showed me sed/awk/grep, and it took seconds to process when other GUI editors couldn't open the file. Never looked back.
For some purposes, awk+xargs can replace hours of work to write a tool to automate some process. It's my go-to for ops work that I don't expect to live very long and just needs to _happen_.
If one considers the idea of map reduce to be taking a set of data and ending up with a subset that is relevant, I've used tons of simple things to do that, and never Hadoop.
I think parsing logs to find pain areas or potential exploit/exfil is a map reduce job, for instance, and grep or awk can manage that just fine.
Not having to parse the output at all is even better. I really like the way Powershell can pass structured data like this.
I'm a huge Linux/Unix fan but sometimes a rethink really works out. I hope Linux will get something similar. I know Powershell is available for Linux but without an adapted userland there's not much benefit
I have used it for many thing such as extract bibtex entries and checking for missed definition of abbreviations, converting remind entry to ICS format etc. Recent example is to print a calendar like cal but with calendar dates highlighted.
There are things I've come to dislike and avoid when programming in general:
- Avoid programming in strings (especially in Bash, where nested quotes are full of pitfalls)
- Avoid magic switches that change behavior (like -F)
- Avoid terse or cryptic variable names (like $NF)
- Avoid terse and magical syntax (sorry Perl, happy to leave you behind me)
- Avoid programs that are hard to read
- Avoid programs that are difficult to debug while writing them
- Avoid programs that ignore types
For these reasons, I prefer to avoid awk for anything except the most trivial of tasks. I think the prevalence of scripting languages and the speed of execution and debugging today has made awk not as necessary as it may have been in the 70s. And as to the first point, I'm aware you can write awk scripts in files, and I feel like if your script has gotten complex enough that you need a file, you're creating something unmaintainable and unreadable that would be better suited in a different programming language.
Edit: I should add this article is great and a good introduction to awk, regardless of my personal taste for the tool.
I've been doing systems work for 20 years. Here's why most of those things are actually good:
- Strings are subtly complex, but strings are not variables. You can assign a string, and later handle it as a variable, and not deal with any of the specifics of string-iness. Likewise, you can take a variable, and later treat it as a string (for loosely or not-typed variables).
- Magic switches are not magic, they are options. Virtually every program takes options. Sometimes they impact a lot of things, sometimes a little. Only the context determines how much is "too much".
- Terse/cryptic variables allow you to write complex expressions in a compact form. This allows you to read more in a small space, making it easier to reason about or form complex expressions. Human languages are flush with these, as is mathematics. But you have to balance the terse, cryptic and magical with guilelessness, or it becomes a mess.
- Terse and magical syntax is, again, a feature, not a bug. Using magical syntax I can do in a few characters what would take me many lines with a traditional language, and as we all know, increased number of lines correlates to bugs, in addition to simply making it harder to grok.
- Types aren't ignored, but they may be very loosely enforced. If you want to write a quick program to get something done, typing is a curse. If you want to write a very thorough program, typing is a blessing. In many cases, loosely or untyped programs actually work better than their typed cousins, because they allow for more unexpected behaviors without failing. Failing early and often may be a modern trend, but... it literally means things fail more, and this is often not desirable.
Caveats:
- Programs that are hard to read do indeed suck, and it takes lots of experience to make some kinds of programs easier to read. But that's not an indictment of the program, it's an indictment of the person who wrote it. We don't indict English when somebody writes a document that's impossible to comprehend.
- Interestingly, some of the more popular languages are the worst to debug. Perl is probably one of the easiest languages to debug, not inconsequently because of how good the interpreter is at suggesting to the user what the actual problem was and almost exactly how to fix it.
There is a lot of wisdom in the things you avoid, however I would ask one question, "How often do you use it?"
For me, the best systems are those that can be wordy and prescriptive but as you get to know them you can use more short hand so they "get out of the way" as it were. A good example of that philosophy is keyboard short cuts. When I'm learning a program I'm happy to pause and sling the mouse around to find the thing I need in the labeled menu stack with an appropriate name which also tells me what the keyboard short cut is for that thing. Then as I get better I can just use the short cut and my workflow gets faster. Once I've internalized the keymap my flow is held up by how fast I can think, not by how fast I can take my hand off the keyboard, move the mouse, click and then put it back on the keyboard.
Awk is one of those things that once you internalize what it can do, you can use it for a lot of stuff, and you can do it quickly.
Yep, anything more than 1 line would be a python script. The author vaguely tries to answer that but doesn't really give anything more satisfying than "you can rewrite it in python" and "why not?".
And if it's something you don't know and are experimenting with, you can also open ipython and play around with the data (kinda like he's doing throughout the article) and keep your variables between each run without having to write intermediate values to disk or keep piping over and over.
Let alone the huge stdlib and 3rd party lib you have access to if needed.
The thing that prevents awk from being a major part of my daily routine is that it (amazingly) has poor CSV support. Consider the following:
col1,col2,col3
1,2,3
4,"hello, \"world\"",6
"7 buckets",,9
To get the usual awk experience with this very common file format, exactly the type of thing you want to parse with awk, you first need to install gawk, then use a big FPAT regex that needs to be adjusted for any new CSV variant.
I would love to see awk with "CSV mode", where it intelligently handles formats like this if you just pass a flag. I think awk would do well to differentiate itself with excellent 2d dataset parsing functionality, but at least catchup up to the average scripting language would be great.
I'm half expecting someone to say "just pass -csv it does what you want" and if so I'll be very excited.
Who's writing a library? Just use xsv or miller to extract the bits you want from the CSV, change the delimiter or escapes to something more convenient, etc., then feed that to awk or other CSV-unaware text processors.
I was agreeing with your point about regexes, that it's good to avoid trying to deal with all the corner cases yourself when you're just trying to write a small script.
Ah, understood! CSV is funny, it seems like a more trivial thing than it really is, and its human readability sort of invites broken approaches in a way that something like Parquet would not.
XML is somewhere in the middle--I've seen some horrible abuses of CDATA sections way back when--but at least there are accepted ways to prove what's invalid.
There is a small program I wrote called csvquote[1] that can be used to sanitize input to awk so it can rely on delimiter characters (commas) to always mean delimiters. The results from awk then get piped through the same program at the end to restore the commas inside the field values.
In my day job I use Java, and all the items you list are good ideas to maintain my sanity.
But I sometimes use Awk scripts to quickly analyse a log file to get information you need or to generate a configuration file or something.
I think of the strict strongly-typed nature of a language like Java as equivalent to the formal language a lawyer might use in a legal document: There shouldn't be any place for ambiguity or misunderstanding.
Programming in Awk, on the other hand, is more like having an informal conversation with an old friend; those cryptic variable names is like the slang words that both of you understand and the magical syntax is like the dialect you share that may sound a bit strange to outsiders.
I think there is a place for both approaches in the broad space of computing.
For those kinds of tasks I use Awk to process the data into a SQLite database. Then I do the queries on that since it’s easier and more advanced things (grouping, having) are much easier declaratively.
I used to love Awk! I still do, even if I don't use it much any more.
Awk has a reputation for being hard to read (as noted in stevebmark's comment), but when I was using it actively, I tried to treat it as a serious programming language and write readable programs in it.
Several years ago I tracked down a couple of my old Awk programs from around 1990 and posted them here:
This was probably the first program that made me really impressed with Awk. People were writing rather complicated Shaney implementations in C, and I thought, "this could be really simple in Awk." And it was!
LJPII.AWK is the Awk program I'm most proud of. This was in the days when we had tiny screens and no multiple monitors and you always printed out your code to read it. In my circles we also fond of inserting "separator lines" between functions, in various formats such as this one:
// - - - - - - - - - - - - - - - - - -
So I wrote LJPII to print source code in "two up" format (two pages side by side in landscape mode) on my LaserJet II. It also converted the separator lines into graphical boxes, and tried to avoid splitting a function across multiple pages. It wasted some paper but made nicely readable printouts.
I wish I still had some of my old printouts, but they are long gone. One of these days I will have to see if I can update the code to work with the LaserJet emulation in my Brother printer! (It should mostly work, but I wrote this in the old Thompson Awk for DOS, so there are a couple of non-standard things in it.)
Looking at the code again, it's amusing to see some old Windows Hungarian notation which was popular/notorious back then, for example an "f" prefix for a boolean (flag) value, and "af" prefix for an array of flags.
Hungarian aside, I tried to make this code as readable as I could.
Random fun fact! Someone who used to be an avid Awk programmer is Will Hearst (William Randolph Hearst III). It's been many years since I talked with him, so no idea if he still does any Awk programming.
Actually, AWK is a domain specific programming language. When you start treating AWK as such then you can really gain an appreciation for it. I too treated it as a dumb one liner relegated to ingesting cryptic regexp one liners in shell scripts. After reading the original AWK book it completely changed my outlook on the language. I had no idea you could define functions or perform basic math so one could use it for very basic tabular operations such as spread sheets. AWK can even be used as a standalone language outside of shell scrips by writing a program, insert a shebang on the first line calling awk, and mark the file as executable.
shebangs and more complex scripts are covered in the article.
But yes, I agree that the original AWK book is really good. After covering some basics and the language reference, it has some fun projects that you can build with AWK.
At some point in every bioinformatics lecture i always manage something akin to: "Learn awk! (or perl) You'll need it. Your data will come from various disparate sources, and you need to get them into some well-defined useful format from the get go."
Honest question: Is Awk worth learning to someone who already knows Perl?
Every time I read about Awk, I feel very intrigued, but AFAIK, everything Awk can do, Perl can just as well. Is there a compelling reason to learn Awk if you know Perl beyond pure curiosity?
While you can write on one line of Perl anything that you can write in one line of Awk, the Perl line will be longer.
So for the simple tasks that you would do from the command line or in a shell script, you can usually save time when writing for Awk.
For complex tasks, where you need more complex control than the implicit loop of Awk and you also need to define various extra variables besides those pre-defined, Awk is no longer more concise.
Then obviously Perl or other scripting languages become more convenient.
To give a quantitative value to what I mean by "more" concise, I have once written a very simple script to find some hash collisions in some files, both in Awk and in Perl.
The Awk variant had 350 bytes, while the identical Perl variant had 450 bytes.
The extra length in Perl was due to "$" prefix on all variables, an extra "split()" and an extra "while()", which were not needed in Awk.
For slightly simpler versions of those scripts executed as one-liners, the difference in length was even much more in favor of Awk, because Perl needed extra command-line options to execute the script in the command line, while that is the default behavior of Awk.
Perl is not generally faster than awk. Obviously there are cases where perl is faster, but for the text smashing that awk is meant for, perl usually much slower, particularly than mawk or recent gawk versions. mawk in particular is shockingly fast.
Well, that's what I meant by "pure curiosity". ;-) Which is a perfectly valid reason to learn a language, of course, possibly the best reason one can have.
But I guess when I need a quick and dirty one-shot script, I'll stick to Perl for the time being. It's not officially standard like Awk, but it is available in the default installation of pretty much every open source Unix-like system I use except FreeBSD where it's one of the first packages I install.
Hypothetically, Perl is also available on systems that are not very Unix-like, such as OS/400 (or "IBM i" as it is called these days, apparently) or VMS. (Hypothetically in the sense that I have never had direct contact with one of these systems and have no expectation I ever will, sadly. ... Well, maybe I should count myself lucky - either way, I will probably never know for sure.)
I actually came from learned Perl before I learned Awk, and the reason I switched is because Awk is a much smaller language, so I can keep it all in my head.
I just about never have to refer to the man page. I also found that I can go for months without writing any Awk and then knock out a quick script without having to relearn anything.
I'll concede that there are things for which Awk just isn't powerful enough and perhaps I was never that good with Perl to begin with, so if you're already well versed in Perl YMMV
Nice article. Seems we went through a very similar progression! :-D
If anyone is interested in learning more, I built a conference talk to teach awk, and a set of exercises also that has gotten pretty positive feedback:
Its common as in the OP to see awk recommended for something as simple as extracting a column from tab or space-separated values. IMO, its quite a bit of typing to do on the fly at a command prompt. Performance-wise, it could be significantly slower that other utilities that are equally as ubiquitous as awk.
echo one two three|awk '{print $2}'
Are there other ways to do this. Are they faster.
cat > awc
#!/bin/sh
test $# -eq 1||exit
exec tr \\40 \\11|exec cut -f "$1"|exec tr \\11 \\40
^D
echo one two three|awc 2
Test it on a file to see if it is faster than awk.
I've never gone further than thinking about it, but I've always been curious as to how simple it would be to use Awk as an interpreter for a really simple Tcl-like language:
set a 1
set b 2
define add (n,m) $n + $m
set result [add a b]
I think it would be simple enough to come up with some Awk pattern/actions to parse the above and execute the commands.
I find it amusing that AWK is coming back. I used it extensively back on the day, but let it go when I picked up Perl 4 and then Perl 5.
So Perl is no longer king for unix scripting. It was replaced by other languages; but it seems like there is a niche that they were not able to fill since AWK is back.
Thanks for this tutorial and everyone else that posted some great tips and links.
I find myself needing to use awk once in a blue moon and every time it eats a lot of my time. I hope I remember your tutorial next time I need it.
This was one of the best awk tutorials I've read its very concise and digestible. I sometimes use awk but the more complex things get the more i feel like i cannot use it. This tutorial made me feel otherwise
so this isn't related to the article so much, but to something the article reminded me about: why do people use /usr/bin/env to find a program rather than setting the PATH within the script to a known-good value then using that to locate things?
the path that /usr/bin/env returns is (essentially) a global variable that can change underneath you, right? I mean that just screams "variable that may be changed by others" to me.
The /usr/bin/env trick will work on a wide range of systems, in which even common utilities might have numerous locations: /bin, /sbin, /usr/bin, /usr/bin/local, /opt, or others. If you're writing scripts for portability and ohers, this has value.
That said, /usr/bin/env fails on Android/Termux AFAIU.
Well, this is OK I guess. But if you really want to learn Awk you want the book "The AWK Programming Language", mostly written by Brian Kernighan (he's the K in AWK and in K&R), and as usual for all of his books, it's brilliant.
BEGIN{ item_type = 1; item_name = 2; price = 3; sale = 4; #etc }
Now, in place of $1, you'd say $item_type which significantly improves overall readability of the code.