Hacker News new | past | comments | ask | show | jobs | submit login
Understanding Awk (earthly.dev)
616 points by todsacerdoti on Sept 30, 2021 | hide | past | favorite | 121 comments



One tip I have to make large-ish awk programs readable is to name the columns in the BEGIN section. Then, you'd use $colname instead of $1, $2, etc. for instance:

BEGIN{ item_type = 1; item_name = 2; price = 3; sale = 4; #etc }

Now, in place of $1, you'd say $item_type which significantly improves overall readability of the code.


You can also put a similar code block at the start of a general processing entry. This applies on both flat (uniform record) and hierarchical (multiple record-type) data.

E.g.:

  {
     name = $1
     dob = $2
     grade = $3
     # ...

     # Do stuff with name / dob / grade, etc.
  }
If the data are structured, so that there are multiple record types (typically defined by prefix or some other regex) you can put variable assignments within each block.

   /^rectype1/ { var1 = $1; var2 = $2, ... }
   /^rectype2/ { varA = $1; varB = $2, ... }
I prefer to leave BEGIN blocks for defining constants or tables and such.


I prefer using colname = n (as opposed to colname = $n) so I can use $colname so as to distinguish the column names from local variables.


I've also used this to address columns by name for files with lots of columns that I'm too lazy to count: https://unix.stackexchange.com/a/359699


If you want to do that, use vnlog instead. You're 90% there already. It formalizes the idea of a header to label columns, and allows normal unix data-processing tools to just work with those labels. Your awk use case is quite literally "vnl-filter --eval"

https://github.com/dkogan/vnlog/


Good on the author for listing other similar programs. I have been testing one of them, datamash. Its documentation claims it is faster than uniq in some situations.


Nice tip, so basically like excel with tables


Thanks for sharing this. I'm the author.

When I wrote my introduction to JQ someone mentioned JQ was tricky but super-useful like AWK. I nodded along with this, but actually, I had no idea how Awk worked.

So I learned how it worked and wrote this up. It is a bit long, but if you don't know Awk that well, or at all, I think it should get the basics across to you by going step by step through examining the book reviews for The Hunger Games trilogy.

Let me know what you think. And also let me know if you have any interesting Awk one-liners to share.


The funny thing is, by and large my only use case for awk is to print out whitespace delimited columns where the amount of whitespace is variable. Surprisingly hard to do with other Unix tools.

Neat discussions around that sort of thing at least here: https://news.ycombinator.com/item?id=23427479


The syntax isn't nearly as nice, but Perl can be handy if you're doing something more after splitting into columns. And it's usually already there / installed, like awk. For just columns:

  $ printf "a b  c d   e\n1 2  3 4 5" | perl -lanE 'say "$F[2] $F[4]"'
  c e
  3 5


It surprized me that AWK had dictionaries and no declaration of vars that make it feel like a modern scripting langauge even though it was written in the 70s.

It turns out though that this is because Perl and later Ruby were inspired by AWK and even support these line by line processing idioms with BEGIN and END sa well.

    ruby -n -a -e 'puts "#{$F[0] $F[1]}"'

    ruby -ne '
    BEGIN { $words = Hash.new(0) }

    $_.split(/[^a-zA-Z]+/).each { |word| 
    $words[word.downcase] += 1 }

    END {
        ...


I think it's pretty obvious that awk syntax is ultimately the main inspiration for JavaScript syntax, with optional semicolon as stmt terminator, regexp literals, for (x in y), the function keyword, a[x] associative array accessors, etc.


they spend a lot of time to make one line perl can handle most function of awk, sed et al.


A long while ago I wrote up a little processor to determine field lengths in a given file - I forgot the original reason. ( https://github.com/sullivant/csvinfo )

However, I feel I really should have taken the time to learn Awk better as it could probably be done there, and simply! (It was a good excuse to tinker with rust, but that's an aside.)


For some idea, a one liner to find the (last) longest username and length in /etc/passwd:

  $ awk -F: '{len=length($1);if(len>max){max=len;user=$1}}END{print user,max}' /etc/passwd


Thanks for that reply! It's good to work with an example.


I'll mark this on my GitHub when I get back on a computer, I take public datasets and make graphs and transforms and reports. The big survey companies have weird data records and having to write a parser is my least favorite part. I think other people who ingest my content don't appreciate the effort, but that's a near universal feeling I think, heh.


If i don't use awk, i throw tr -s ' ' into the pipeline, and then the delimiter is a single space, so you can just cut.


That will collapse multiple spaces, but won't handle a mix of spaces and tabs, which awk will handle.


choose from your link does look nice for simple column selection.

   echo -e "foo   bar   baz" | choose -1 -2
vs awks

   echo -e "foo   bar   baz" | awk '{ print $2, $3}'
I love the effort people are putting into reinventing the core unix tools.

I think I'll stick with Awk for now though.


The problem with new tools is

$ choose

bash: choose: command not found...


  ls -l | tr -s ' ' | cut -d ' ' -f 5


Exactly! Exactly! And now fix it to work with tabs :-)


And leading whitespace. Compare:

  $ printf " one two  three"  | tr -s ' ' | cut -d ' ' -f 1

  $ printf " one two  three"  | awk '{print $1}'
  one


  ps ax | sed 's/^\s\+//; s/\s\+/ /g;' | cut -d ' ' -f 4


  echo -e '1\t2\t3\t4\t5' | expand -t 1 | cut -d ' ' -f 3


I really appreciate you writing this guide. As a long time Linux user, I've always wanted to learn AWK, but it seemed too daunting. Three minutes into your guide and I immediately saw how I could use it in my day-to-day usage.


I blame GNU's man page. I was in the same situation for the longest time, but stumbled over a man page for a simpler implementation of awk (plan9's, in my case) and learned it in 10-15 minutes (not claiming I understood it more than partially in that time of course, but enough to write my own small programs).

Since then I've made a point of finding man-pages from other systems whenever the manual for a GNU tool is a bit daunting. It tends to lower the learning threshold quite a lot, honestly.

    $ man gawk | wc
       1568   13030   94207
    $ man -l /usr/share/man/man1/awk.1plan9.gz | wc 
        214    1579   10956
Not trying to detract from this great guide. Just a general tip :)


I tend to use cat-v since you can check really old versions which tend to be far simpler.

http://man.cat-v.org/unix_7th/1/awk


Thank you! It took me longer to write then I expected it would. I was originally just going to do some small examples of each idea.

But once I got the idea of aggregating the book review data from amazon I felt I had to see it through.


As someone who's never used awk before, I really enjoyed this write-up and I think it was very well written!


chiming in, I had a feeling that the article and the comments here would contain some jewels and both have exceeded expectations.


Lots of great AWK tutorials in here that are more in depth, but I'll share another. I always go back to Brian Kernighan's personal help file:

https://www.cs.princeton.edu/courses/archive/spring19/cos333...

Brian Kernighan has a knack for explaining languages very precisely and elegantly.


And for the flash card type of learners it is good to see the "HANDY ONE-LINE SCRIPTS FOR AWK" page is still available. See the links in the Credits section at the bottom for more great reading:

https://www.pement.org/awk/awk1line.txt

That author also edited the "USEFUL ONE-LINE SCRIPTS FOR SED" page:

https://www.pement.org/sed/sed1line.txt


Can't recommend the gawk manual enough, and "The awk manual" enough

https://www.gnu.org/software/gawk/manual/gawk.pdf

and

http://www.cs.unibo.it/~sacerdot/doc/awk/nawkA4.pdf

enough


The original language specification, written by the authors, is now free online. Chapter 2 covers the whole language in a little over 40 pages.

https://archive.org/download/pdfy-MgN0H1joIoDVoIC7/The_AWK_P...


have a copy on my bookshelf! Didn't have a pdf though nice.

The gawk one is useful if you're into some of the gnuism specifics


Severely underrated comment.

Having relied heavily on the (unofficial, non-GNU) gawk manpage extensively (it's quite good), I instantly started learning very useful features reading the GNU docs. (I still need to fully internalise those). Yes, the full manual is very much better than the manpage.

(Also recommend The AWK Programming Language mentioned here, though I'd suggest the GNU manual adds to that as well.)


I'm always happy when I see posts that promote AWK. It's a very underappreciated tool in my opinion. I was a Linux user for 20 years before I got familiar with it. AWK is super powerful for text processing, and I like that it's included in Busybox for use on the embedded systems that I design.

For any complex text processing, it's way better and more robust than having a super long pipeline of a bunch of sed/grep.

Most recently, I used awk in a script that parses /proc/mount to grab the mountpoint of a partition, or print something different if the partition isn't mounted. Doable with a bunch of sed/grep and some shell logic? Definitely. But easier and cleaner in AWK, and equally easy to inline in a shell-script.


I've never bothered to learn much AWK, but that's mostly because Perl is my bread and butter language and has been for 20 years, and focusing on knowledge of that seemed a better investment (especially since with a few judicious flags, Perl is a passable AWK replacement even for very small one liners).

That said, if you just want to supplement your knowledge of other shell tools and pull out something that can do some obvious text munging, AWK has always looked attractive for the task to me.


The problem is that awk is in POSIX, and perl is not.

There are two common sources of awk for Windows, for example, that drop one exe to provide the interpreter:

http://unxutils.sourceforge.net/

https://frippery.org/busybox/

Perl simply wasn't designed to do that.


https://sourceforge.net/projects/ezwinports/files/

Has a recent version of GAWK (Gnu AWK).

I think if you install git for Windows, you get AWK as well.


> I think if you install git for Windows, you get AWK as well.

Yes, but you also get perl there too.


sorry, why this is different to install perl? they look like still need download. I s this because they were contained in larger tool box so more easier get awk by other reasons? p.s., thanks, I don't know I had have awk in git bash...


But perl is available by default in almost every free *nix, and for most people, Windows isn't a requirement


In HP-UX 10.20, there was a single perl4 binary that dates from 1993 in /usr/contrib/bin (I checked our K380).

Had Perl held to a single binary, it could be inside busybox.

Perl cannot be inside busybox, and that hasn't been possible since 1993.


Usually not a requirement but it's a "nice to have" sometimes.


This comment reminds me of the joke about the sewage worker, who is chatting about his job and says 'to you it maybe just shit, but to me its my bread and butter'.

Internet always seems the similar, pipes|garbage in| and munging on urban dictionary.....


I do a lot of work with structured data--json, yaml, etc. For me, this is how I feel about jq. One of my favorite use-cases is querying Kubernetes resources. E.g., `kubectl get secret <secret-name> -o json | jq -r '.data | map_values(@base64d)'` (fetch a secret and decode all of its values).


Came here to say this. Glad to see /bin getting respect.

To anyone processing huge quantities of text and text files, someone very likely had the same problem you faced back in the 1980's and there's a Unix/GNU tool for it already.


I was introduced to *nix from processing very large text files that text editors I was familiar with choked and died. Someone showed me sed/awk/grep, and it took seconds to process when other GUI editors couldn't open the file. Never looked back.


For some purposes, awk+xargs can replace hours of work to write a tool to automate some process. It's my go-to for ops work that I don't expect to live very long and just needs to _happen_.

Also, happy 1337 karma day :).


> awk+xargs can replace hours of work

Including machine hours of work.

Wasn't there a famous story of replacing a Hadoop cluster with an awk script (which was a couple orders of magnitude faster)?

Oh yes, there was: https://news.ycombinator.com/item?id=17135841


In fairness it's xargs that is providing the command parallelization, not awk, but I agree both combined are a good match.


If one considers the idea of map reduce to be taking a set of data and ending up with a subset that is relevant, I've used tons of simple things to do that, and never Hadoop.

I think parsing logs to find pain areas or potential exploit/exfil is a map reduce job, for instance, and grep or awk can manage that just fine.


Not having to parse the output at all is even better. I really like the way Powershell can pass structured data like this.

I'm a huge Linux/Unix fan but sometimes a rethink really works out. I hope Linux will get something similar. I know Powershell is available for Linux but without an adapted userland there's not much benefit


I have used it for many thing such as extract bibtex entries and checking for missed definition of abbreviations, converting remind entry to ICS format etc. Recent example is to print a calendar like cal but with calendar dates highlighted.


Yep awk is lovely and well worth the time to learn.

This is probably not important for embedded but doesn’t a pipeline of small scripts (which could be in awk) give you better threading support?

Xargs, GNU parallel or even make then scale that out really quickly.


There are things I've come to dislike and avoid when programming in general:

- Avoid programming in strings (especially in Bash, where nested quotes are full of pitfalls)

- Avoid magic switches that change behavior (like -F)

- Avoid terse or cryptic variable names (like $NF)

- Avoid terse and magical syntax (sorry Perl, happy to leave you behind me)

- Avoid programs that are hard to read

- Avoid programs that are difficult to debug while writing them

- Avoid programs that ignore types

For these reasons, I prefer to avoid awk for anything except the most trivial of tasks. I think the prevalence of scripting languages and the speed of execution and debugging today has made awk not as necessary as it may have been in the 70s. And as to the first point, I'm aware you can write awk scripts in files, and I feel like if your script has gotten complex enough that you need a file, you're creating something unmaintainable and unreadable that would be better suited in a different programming language.

Edit: I should add this article is great and a good introduction to awk, regardless of my personal taste for the tool.


I've been doing systems work for 20 years. Here's why most of those things are actually good:

- Strings are subtly complex, but strings are not variables. You can assign a string, and later handle it as a variable, and not deal with any of the specifics of string-iness. Likewise, you can take a variable, and later treat it as a string (for loosely or not-typed variables).

- Magic switches are not magic, they are options. Virtually every program takes options. Sometimes they impact a lot of things, sometimes a little. Only the context determines how much is "too much".

- Terse/cryptic variables allow you to write complex expressions in a compact form. This allows you to read more in a small space, making it easier to reason about or form complex expressions. Human languages are flush with these, as is mathematics. But you have to balance the terse, cryptic and magical with guilelessness, or it becomes a mess.

- Terse and magical syntax is, again, a feature, not a bug. Using magical syntax I can do in a few characters what would take me many lines with a traditional language, and as we all know, increased number of lines correlates to bugs, in addition to simply making it harder to grok.

- Types aren't ignored, but they may be very loosely enforced. If you want to write a quick program to get something done, typing is a curse. If you want to write a very thorough program, typing is a blessing. In many cases, loosely or untyped programs actually work better than their typed cousins, because they allow for more unexpected behaviors without failing. Failing early and often may be a modern trend, but... it literally means things fail more, and this is often not desirable.

Caveats:

- Programs that are hard to read do indeed suck, and it takes lots of experience to make some kinds of programs easier to read. But that's not an indictment of the program, it's an indictment of the person who wrote it. We don't indict English when somebody writes a document that's impossible to comprehend.

- Interestingly, some of the more popular languages are the worst to debug. Perl is probably one of the easiest languages to debug, not inconsequently because of how good the interpreter is at suggesting to the user what the actual problem was and almost exactly how to fix it.


I take it you LOVE ada :-)

There is a lot of wisdom in the things you avoid, however I would ask one question, "How often do you use it?"

For me, the best systems are those that can be wordy and prescriptive but as you get to know them you can use more short hand so they "get out of the way" as it were. A good example of that philosophy is keyboard short cuts. When I'm learning a program I'm happy to pause and sling the mouse around to find the thing I need in the labeled menu stack with an appropriate name which also tells me what the keyboard short cut is for that thing. Then as I get better I can just use the short cut and my workflow gets faster. Once I've internalized the keymap my flow is held up by how fast I can think, not by how fast I can take my hand off the keyboard, move the mouse, click and then put it back on the keyboard.

Awk is one of those things that once you internalize what it can do, you can use it for a lot of stuff, and you can do it quickly.


Yep, anything more than 1 line would be a python script. The author vaguely tries to answer that but doesn't really give anything more satisfying than "you can rewrite it in python" and "why not?".

And if it's something you don't know and are experimenting with, you can also open ipython and play around with the data (kinda like he's doing throughout the article) and keep your variables between each run without having to write intermediate values to disk or keep piping over and over.

Let alone the huge stdlib and 3rd party lib you have access to if needed.


The thing that prevents awk from being a major part of my daily routine is that it (amazingly) has poor CSV support. Consider the following:

col1,col2,col3

1,2,3

4,"hello, \"world\"",6

"7 buckets",,9

To get the usual awk experience with this very common file format, exactly the type of thing you want to parse with awk, you first need to install gawk, then use a big FPAT regex that needs to be adjusted for any new CSV variant.

I would love to see awk with "CSV mode", where it intelligently handles formats like this if you just pass a flag. I think awk would do well to differentiate itself with excellent 2d dataset parsing functionality, but at least catchup up to the average scripting language would be great.

I'm half expecting someone to say "just pass -csv it does what you want" and if so I'll be very excited.


There is an answer to CSV mode a bit further down the page

https://news.ycombinator.com/item?id=28708145

...but if your files are CSV, there is a CSV extension for gawk

    @include "csv"
    BEGIN { CSVMODE = 1 }


Well there you go, for the sake of my pride at least it's an extension.

It's funny searches for awk CSV seem to yield a bunch of SO questions where the answers are increasingly cumbersome regexes instead of this extension.

Of course, you can't count of this extension being widely installed, but it's great for my own desktop.


that's because the extension only works in gawk. its not portable anywhere else.


You can just use https://github.com/Nomarian/Awk-Batteries/blob/master/Units/... and use as so

    awk -f ./ucsv.awk -e '{print $5}'
Also this

> 4,"hello, \"world\"",6

Is incorrect per https://tools.ietf.org/html/rfc4180 so you should just fix it with a sed -i 's/\\"/""/g' and then just parse as normal.

https://github.com/Nomarian/Awk-Batteries/wiki/Formats


'miller' and 'xsv' are pretty good tools for wrangling CSV. (And regexp is kind of a terrible tool for it, too many edge cases.)


Yeah, I don't want to have to write a CSV library each time, that's what I'm trying to get at.

I just end up using Python/Perl but I do have a soft spot for awk so it would be cool if good support was built-in.


Who's writing a library? Just use xsv or miller to extract the bits you want from the CSV, change the delimiter or escapes to something more convenient, etc., then feed that to awk or other CSV-unaware text processors.


I was agreeing with your point about regexes, that it's good to avoid trying to deal with all the corner cases yourself when you're just trying to write a small script.


Ah, understood! CSV is funny, it seems like a more trivial thing than it really is, and its human readability sort of invites broken approaches in a way that something like Parquet would not.

XML is somewhere in the middle--I've seen some horrible abuses of CDATA sections way back when--but at least there are accepted ways to prove what's invalid.


There is a small program I wrote called csvquote[1] that can be used to sanitize input to awk so it can rely on delimiter characters (commas) to always mean delimiters. The results from awk then get piped through the same program at the end to restore the commas inside the field values.

In principle:

  cat textfile.csv | csvquote | awk -f myprogram.awk | csvquote -u > output.csv
Also works for other text processing tools like cut, sed, sort, etc.

[1] https://github.com/dbro/csvquote


I use awk for one-liners, no more.

Looking at my command history, I mostly use awk to extract a field like this:

   <something> | awk '{print $3}'
(I know "cut" is supposed to do the same thing, but it was never reliable for me - maybe tabs/spaces?)


Here is a GAWK program of mine that implements outgoing SMTP. While not a one-liner, this is much shorter and less tedious than trying to do it in C.

    $ cat /bin/awkmail
    #!/bin/gawk -f

    BEGIN { smtp="/inet/tcp/0/smtp.yourco.com/25";
    ORS="\r\n"; r=ARGV[1]; s=ARGV[2]; sbj=ARGV[3]; # /bin/awkmail to from subj < in

    print "helo " ENVIRON["HOSTNAME"]        |& smtp;
    smtp |& getline j; print j
    print "mail from: " s                    |& smtp;  smtp |& getline j; print j
    if(match(r, ","))
    {
     split(r, z, ",")
     for(y in z) { print "rcpt to: " z[y]    |& smtp;  smtp |& getline j; print j }
    }
    else { print "rcpt to: " r               |& smtp;  smtp |& getline j; print j }
    print "data"                             |& smtp;  smtp |& getline j; print j

    print "From: " s                         |& smtp;  ARGV[2] = ""   # not a file
    print "To: " r                           |& smtp;  ARGV[1] = ""   # not a file
    if(length(sbj)) { print "Subject: " sbj  |& smtp;  ARGV[3] = "" } # not a file
    print ""                                 |& smtp

    while(getline > 0) print                 |& smtp

    print "."                                |& smtp;  smtp |& getline j; print j
    print "quit"                             |& smtp;  smtp |& getline j; print j

    close(smtp) } # /inet/protocol/local-port/remote-host/remote-port


Cheap fix: the space after MAIL FROM: and RCPT TO: is not standard compliant.


That's a beautiful use of the language, it reminds me of some of the awk CGI efforts out there.

For example: https://www.gnu.org/software/gawk/manual/gawkinet/html_node/...


IMHO, that's too big for awk, why not python?

for example:

    #!/usr/bin/python
    import smtplib
    from email.mime.text import MIMEText

    msg = 'hi'
    subj='read this!'

    smtp_server='mail.example.com'
    smtp_from='me@example.com'
    smtp_to='you@example.com'

    m = MIMEText(msg)

    m['To'] = smtp_to
    m['From'] = smtp_from
    m['Subject'] = subj

    s = smtplib.SMTP(smtp_server)
    s.sendmail(smtp_from, [smtp_to], m.as_string())
    s.quit()

of course, you seem to think in gawk so if that works for you that's what you should continue doing!

by the way, I hacked this example from another script which attached a logfile:

    with open(arg.logfile) as f:
        log_contents = f.read()

    m = MIMEText(log_contents)
you can also use:

    from email.mime.image import MIMEImage
    from email.mime.text import MIMEText
    from email.mime.multipart import MIMEMultipart
and then:

    m = MIMEMultipart()
    m.attach(MIMEText('\n\n%s\n\n'%xkcd_img_title))
    m.attach(MIMEImage(xkcd_img))


Your script doesn't even do the same thing. You are importing a library that implements SMTP, which is missing the point.

The AWK script doesn't need libraries, so it can actually be useful in places where you have awk but not Python.


Consider the input a b

Awk will treat it as having two columns (by default), while cut will treat each space as it’s own column.

Awk is also a little nicer for whitespace; cut makes specifying the delimiter (with say “-d\ “) a little more vexing.


In my day job I use Java, and all the items you list are good ideas to maintain my sanity.

But I sometimes use Awk scripts to quickly analyse a log file to get information you need or to generate a configuration file or something.

I think of the strict strongly-typed nature of a language like Java as equivalent to the formal language a lawyer might use in a legal document: There shouldn't be any place for ambiguity or misunderstanding.

Programming in Awk, on the other hand, is more like having an informal conversation with an old friend; those cryptic variable names is like the slang words that both of you understand and the magical syntax is like the dialect you share that may sound a bit strange to outsiders.

I think there is a place for both approaches in the broad space of computing.


For those kinds of tasks I use Awk to process the data into a SQLite database. Then I do the queries on that since it’s easier and more advanced things (grouping, having) are much easier declaratively.


Yes! Another recent thread recently discussed best practice and whether something like that exist. I believe this is a good example.


I used to love Awk! I still do, even if I don't use it much any more.

Awk has a reputation for being hard to read (as noted in stevebmark's comment), but when I was using it actively, I tried to treat it as a serious programming language and write readable programs in it.

Several years ago I tracked down a couple of my old Awk programs from around 1990 and posted them here:

https://github.com/geary/awk

SHANEY.AWK is an implementation of the infamous Mark V. Shaney:

https://www.clear.rice.edu/comp200/09fall/textriff/sci_am_pa...

This was probably the first program that made me really impressed with Awk. People were writing rather complicated Shaney implementations in C, and I thought, "this could be really simple in Awk." And it was!

LJPII.AWK is the Awk program I'm most proud of. This was in the days when we had tiny screens and no multiple monitors and you always printed out your code to read it. In my circles we also fond of inserting "separator lines" between functions, in various formats such as this one:

  // - - - - - - - - - - - - - - - - - -
So I wrote LJPII to print source code in "two up" format (two pages side by side in landscape mode) on my LaserJet II. It also converted the separator lines into graphical boxes, and tried to avoid splitting a function across multiple pages. It wasted some paper but made nicely readable printouts.

I wish I still had some of my old printouts, but they are long gone. One of these days I will have to see if I can update the code to work with the LaserJet emulation in my Brother printer! (It should mostly work, but I wrote this in the old Thompson Awk for DOS, so there are a couple of non-standard things in it.)

Looking at the code again, it's amusing to see some old Windows Hungarian notation which was popular/notorious back then, for example an "f" prefix for a boolean (flag) value, and "af" prefix for an array of flags.

Hungarian aside, I tried to make this code as readable as I could.

Random fun fact! Someone who used to be an avid Awk programmer is Will Hearst (William Randolph Hearst III). It's been many years since I talked with him, so no idea if he still does any Awk programming.


> Awk is a record processing tool

Actually, AWK is a domain specific programming language. When you start treating AWK as such then you can really gain an appreciation for it. I too treated it as a dumb one liner relegated to ingesting cryptic regexp one liners in shell scripts. After reading the original AWK book it completely changed my outlook on the language. I had no idea you could define functions or perform basic math so one could use it for very basic tabular operations such as spread sheets. AWK can even be used as a standalone language outside of shell scrips by writing a program, insert a shebang on the first line calling awk, and mark the file as executable.


shebangs and more complex scripts are covered in the article.

But yes, I agree that the original AWK book is really good. After covering some basics and the language reference, it has some fun projects that you can build with AWK.


At some point in every bioinformatics lecture i always manage something akin to: "Learn awk! (or perl) You'll need it. Your data will come from various disparate sources, and you need to get them into some well-defined useful format from the get go."


Honest question: Is Awk worth learning to someone who already knows Perl?

Every time I read about Awk, I feel very intrigued, but AFAIK, everything Awk can do, Perl can just as well. Is there a compelling reason to learn Awk if you know Perl beyond pure curiosity?


For simple tasks, Awk is more concise than Perl.

While you can write on one line of Perl anything that you can write in one line of Awk, the Perl line will be longer.

So for the simple tasks that you would do from the command line or in a shell script, you can usually save time when writing for Awk.

For complex tasks, where you need more complex control than the implicit loop of Awk and you also need to define various extra variables besides those pre-defined, Awk is no longer more concise.

Then obviously Perl or other scripting languages become more convenient.


To give a quantitative value to what I mean by "more" concise, I have once written a very simple script to find some hash collisions in some files, both in Awk and in Perl.

The Awk variant had 350 bytes, while the identical Perl variant had 450 bytes.

The extra length in Perl was due to "$" prefix on all variables, an extra "split()" and an extra "while()", which were not needed in Awk.

For slightly simpler versions of those scripts executed as one-liners, the difference in length was even much more in favor of Awk, because Perl needed extra command-line options to execute the script in the command line, while that is the default behavior of Awk.


Perl is faster and more capable than Awk, and is pretty much a superset.

So don't bother as long as Perl is ubiquitous.


Perl is not generally faster than awk. Obviously there are cases where perl is faster, but for the text smashing that awk is meant for, perl usually much slower, particularly than mawk or recent gawk versions. mawk in particular is shockingly fast.


That's a tough one. Probably not in the sense "I will replace Perl with AWK" but more in a sense of "I want to learn AWK to learn AWK."

It is installed everywhere pretty much so it's nice to know. I keep thinking the same since I know Perl.


> "I want to learn AWK to learn AWK."

Well, that's what I meant by "pure curiosity". ;-) Which is a perfectly valid reason to learn a language, of course, possibly the best reason one can have.

But I guess when I need a quick and dirty one-shot script, I'll stick to Perl for the time being. It's not officially standard like Awk, but it is available in the default installation of pretty much every open source Unix-like system I use except FreeBSD where it's one of the first packages I install.

Hypothetically, Perl is also available on systems that are not very Unix-like, such as OS/400 (or "IBM i" as it is called these days, apparently) or VMS. (Hypothetically in the sense that I have never had direct contact with one of these systems and have no expectation I ever will, sadly. ... Well, maybe I should count myself lucky - either way, I will probably never know for sure.)


I actually came from learned Perl before I learned Awk, and the reason I switched is because Awk is a much smaller language, so I can keep it all in my head.

I just about never have to refer to the man page. I also found that I can go for months without writing any Awk and then knock out a quick script without having to relearn anything.

I'll concede that there are things for which Awk just isn't powerful enough and perhaps I was never that good with Perl to begin with, so if you're already well versed in Perl YMMV


Everytime I learn Awk, I forget it. It is an infrequently used tool for me.


I love Perl, but it’s starting to fade from use. Awk is still everywhere.


Nice article. Seems we went through a very similar progression! :-D

If anyone is interested in learning more, I built a conference talk to teach awk, and a set of exercises also that has gotten pretty positive feedback:

Presentation: https://youtu.be/43BNFcOdBlY

Exercises (for you to try): https://github.com/FreedomBen/awk-hack-the-planet

Exercises (me solving): https://youtu.be/4UGLsRYDfo8


And if you learned awk(1) first, then when you saw perl for the first time it immediately made sense to you as a 'super awk'.


That happened to me. AWK -> Perl -> Ruby.


One thing that I would love to hear about is suggestions of how to make my files/output more awk-friendly.


This isn't your question but if your files are CSV, there is a CSV extension for gawk

    @include "csv"
    BEGIN { CSVMODE = 1 }


Tab separated values all the things


Its common as in the OP to see awk recommended for something as simple as extracting a column from tab or space-separated values. IMO, its quite a bit of typing to do on the fly at a command prompt. Performance-wise, it could be significantly slower that other utilities that are equally as ubiquitous as awk.

    echo one two three|awk '{print $2}'
Are there other ways to do this. Are they faster.

    cat > awc
       #!/bin/sh
       test $# -eq 1||exit
       exec tr \\40 \\11|exec cut -f "$1"|exec tr \\11 \\40
    ^D

    echo one two three|awc 2
Test it on a file to see if it is faster than awk.

    time awk '{print $2}' file
    time awc 2 < file


Is the exec needed in the pipeline?


       x(){ tr \\11 \\40;}
       x|cut -f "$1"|x


Correction

       x(){ tr \\40 \\11;}
       y(){ tr \\11 \\40;}
       x|cut -f "$1"|y


I've never gone further than thinking about it, but I've always been curious as to how simple it would be to use Awk as an interpreter for a really simple Tcl-like language:

    set a 1
    set b 2

    define add (n,m) $n + $m

    set result [add a b]
I think it would be simple enough to come up with some Awk pattern/actions to parse the above and execute the commands.


Significant past threads. I had to leave a ton of submissions out! Any others that are particularly good?

Awk: The Power and Promise of a 40-Year-Old Language - https://news.ycombinator.com/item?id=28441887 - Sept 2021 (118 comments)

Awk is the coolest tool you don't know - https://news.ycombinator.com/item?id=27039608 - May 2021 (20 comments)

CGI with Awk on OpenBSD Httpd (2020) - https://news.ycombinator.com/item?id=27037113 - May 2021 (22 comments)

The State of the Awk - https://news.ycombinator.com/item?id=25142867 - Nov 2020 (58 comments)

Awk: `Begin { ` Part 1 - https://news.ycombinator.com/item?id=24940661 - Oct 2020 (106 comments)

Show HN: Awk-JVM – A toy JVM in Awk - https://news.ycombinator.com/item?id=23612910 - June 2020 (27 comments)

Running Awk in parallel to process 256M records - https://news.ycombinator.com/item?id=23394024 - June 2020 (101 comments)

The State of the AWK - https://news.ycombinator.com/item?id=23240800 - May 2020 (86 comments)

Awk in 20 Minutes (2015) - https://news.ycombinator.com/item?id=23048054 - May 2020 (126 comments)

Show HN: An eBook with hundreds of GNU Awk one-liners - https://news.ycombinator.com/item?id=22758217 - April 2020 (48 comments)

Learn Awk by Example (2019) - https://news.ycombinator.com/item?id=22455779 - March 2020 (29 comments)

Awk As A Major Systems Programming Language, Revisited (2018) - https://news.ycombinator.com/item?id=22304017 - Feb 2020 (80 comments)

Why Learn Awk? (2016) - https://news.ycombinator.com/item?id=22108680 - Jan 2020 (235 comments)

Learn Just a Little Awk (2010) - https://news.ycombinator.com/item?id=21101478 - Sept 2019 (69 comments)

Awk by Example - https://news.ycombinator.com/item?id=20308865 - June 2019 (21 comments)

Removing duplicate lines from files keeping the original order with Awk - https://news.ycombinator.com/item?id=20037366 - May 2019 (154 comments)

GNU Awk 5.0 - https://news.ycombinator.com/item?id=19671983 - April 2019 (49 comments)

Learn just a little Awk (2010) - https://news.ycombinator.com/item?id=17322412 - June 2018 (244 comments)

The Awk Programming Language (1988) [pdf] - https://news.ycombinator.com/item?id=17140934 - May 2018 (207 comments)

Learn to use Awk with hundreds of examples - https://news.ycombinator.com/item?id=15549318 - Oct 2017 (116 comments)

Awk for multimedia - https://news.ycombinator.com/item?id=15410259 - Oct 2017 (24 comments)

Awk driven IoT - https://news.ycombinator.com/item?id=14735752 - July 2017 (35 comments)

Skip grep, use awk - https://news.ycombinator.com/item?id=14692233 - July 2017 (130 comments)

Awk vs. Perl (2009) - https://news.ycombinator.com/item?id=14647022 - June 2017 (71 comments)

The Awk Programming Language (1988) [pdf] - https://news.ycombinator.com/item?id=13451454 - Jan 2017 (103 comments)

Show HN: 3D shooter in your terminal using raycasting in Awk - https://news.ycombinator.com/item?id=10896901 - Jan 2016 (55 comments)

Awk in 20 Minutes - https://news.ycombinator.com/item?id=8893302 - Jan 2015 (85 comments)

An Awk Primer - https://news.ycombinator.com/item?id=7961848 - June 2014 (28 comments)

A Crash Course In Awk - https://news.ycombinator.com/item?id=6578960 - Oct 2013 (37 comments)

Why Awk for AI? (1997) - https://news.ycombinator.com/item?id=5725291 - May 2013 (53 comments)

Ask HN: Do people build websites in Awk? - https://news.ycombinator.com/item?id=5041323 - Jan 2013 (12 comments)

Why you should learn just a little Awk - A Tutorial by Example - https://news.ycombinator.com/item?id=2932450 - Aug 2011 (76 comments)

Announcing my first e-book "Awk One-Liners Explained" - https://news.ycombinator.com/item?id=2674284 - June 2011 (24 comments)

AWK-ward Ruby - https://news.ycombinator.com/item?id=2486231 - April 2011 (31 comments)

Music with AWK - https://news.ycombinator.com/item?id=2294909 - March 2011 (15 comments)

Exercise #1: Learning awk Basics - https://news.ycombinator.com/item?id=2210085 - Feb 2011 (20 comments)

Why you should learn at least a little bit of Awk - https://news.ycombinator.com/item?id=1738688 - Sept 2010 (62 comments)

Don't MAWK AWK - the fastest and most elegant big data munging language - https://news.ycombinator.com/item?id=815529 - Sept 2009 (22 comments)


I find it amusing that AWK is coming back. I used it extensively back on the day, but let it go when I picked up Perl 4 and then Perl 5. So Perl is no longer king for unix scripting. It was replaced by other languages; but it seems like there is a niche that they were not able to fill since AWK is back.


If that is true it could be that people are tired of the packaging/dependency hell of "other languages".


"If you like this you might also like" https://ferd.ca/awk-in-20-minutes.html

I too am happy to see more Awk material in the world, once I learned a bit about it I started reaching for it more and more.


Thanks for this tutorial and everyone else that posted some great tips and links. I find myself needing to use awk once in a blue moon and every time it eats a lot of my time. I hope I remember your tutorial next time I need it.


This was one of the best awk tutorials I've read its very concise and digestible. I sometimes use awk but the more complex things get the more i feel like i cannot use it. This tutorial made me feel otherwise


so this isn't related to the article so much, but to something the article reminded me about: why do people use /usr/bin/env to find a program rather than setting the PATH within the script to a known-good value then using that to locate things?

the path that /usr/bin/env returns is (essentially) a global variable that can change underneath you, right? I mean that just screams "variable that may be changed by others" to me.

I've never understood why /usr/bin/env exists.


Portability.

The /usr/bin/env trick will work on a wide range of systems, in which even common utilities might have numerous locations: /bin, /sbin, /usr/bin, /usr/bin/local, /opt, or others. If you're writing scripts for portability and ohers, this has value.

That said, /usr/bin/env fails on Android/Termux AFAIU.


Thanks! I had been putting it off, but after looking at the article, I wrote a little but useful script with a line of awk in it.


Well, this is OK I guess. But if you really want to learn Awk you want the book "The AWK Programming Language", mostly written by Brian Kernighan (he's the K in AWK and in K&R), and as usual for all of his books, it's brilliant.


It's awksome :)


This is a great model of how to do a tutorial.


kkkkkk




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: