Hacker News new | past | comments | ask | show | jobs | submit login
Why Learn Awk? (2016) (jpalardy.com)
420 points by LinuxBender on Jan 21, 2020 | hide | past | favorite | 235 comments



I use awk because there's an almost 100% chance that it's going to be installed on any unix system I can ssh into.

I use awk because I like to visually refine my output incrementally. By combining awk with multiple other basic unix commands and pipes, I can get the data that I want out of the data I have. I'm not writing unit tests or perfect code, I'm using rough tools to do a quick one-off job.

For instance, "mail server x is getting '81126 delayed delivery' from google messages in the logs, find out who is sending those messages".

# get all the lines with the 81126 message. Get the queue IDs, exclude duplicates, save them in a file.

cat maillog.txt | grep 81126 | awk '{print $6}' | sort | uniq | cut -d':' -f1 > queue-ids.txt

# Grep for entries in that file, get the from addresses, exclude duplicates.

cat maillog.txt | grep -F -f queue-ids.txt | grep 'from=<' | awk '{print $7}' | cut -d'<' -f2 | cut -d'>' -f1 | sort | uniq

Each of those 2 one-liners was built up pipe-by-pipe, looking at the output, finding what I needed. It's not pretty, it's not elegant, but it works. I'm sure there's a million ways that a thousand different languages could do this more elegantly, but it's what I know, and it works for me.


I know you’re not asking for awk protips but you can prefix the block with a match condition for processing.

... | grep foo | awk ‘{print $6}’ | ...

becomes

... | awk ‘/foo/{print $6}’ | ...

If you start working this into your awk habits you’ll find delightful little edge cases that you can handle with other expressions before the block (you can, for example, match specific fields).


No one has mentioned changing the default field separator, e.g.,

  awk FS=:   '{print $1}' instead of cut -d: -f1

  awk FS="<" '{print $2}' instead of cut -d'<' -f2

  awk FS=">" '{print $1}' instead of cut -d'>' -f1


No need to explicitly set FS! Just use:

echo test,123 | awk -F, '{print $1}'


Yikes. The syntax I had was wrong anyway. Should have been

   awk 'BEGIN {FS=":"};{print $1}'
One benefit of the FS variable over -F, at least in original awk, is that by using FS the delimiter can be more than one character. I guess that's why I remember FS before I remember -F. More flexible.


-F does allow multicharacter separators (at least true for me on bash shell and gawk)

    $ echo 'Sample123string42with777numbers' | awk -F'[0-9]+' '{print $2}'
    string


you were close! the following works as well

  awk -v FS="\t"


If I am not mistaken, -v is GAWK only.


Every contemporary AWK supports -v. Real AWK from UNIX®️ supported -v since at least the '80's.


True. But there are differences when -v is used, as opposed to FS. Try this, where "nawk" is Lucent awk used by BSD

     cat > 1.awk << eof

     { print $ARGC }

     eof

     echo|nawk -f 1.awk FS=":"
     
     echo|gawk -f 1.awk FS=":"     
     
     echo|nawk -f 1.awk -v FS=":"

     echo|gawk -f 1.awk -v FS=":"


That is not how FS is set; It's set with -F. And there is actually no need to use -v, passing variables at the end works consistently across all AWK's and always has:

  echo "" | awk '{print Bla;}' Bla="Hello."


What if you set FS with -F but then later in the script want to change FS to something else.


The results will be unpredictable at best; either set it with -F, or use 'BEGIN {FS = "...";}', but not both.


So is -F, IIRC.


-F has always been supported by real UNIX®️ AWK; that's where -v and -F come from.


BUGS The -F option is not necessary given the command line variable assignment feature; it remains only for backwards compatibility.

EXAMPLES Print and sort the login names of all users:

            BEGIN     { FS = ":" }
                 { print $1 | "sort" }
The above is from the GAWK manpage. FWIW, the first example under EXAMPLES uses FS not -F.

There is nothing wrong with using FS instead of -F.


GAWK is not a real AWK!!! When will you people learn that GNU is not UNIX®️?

FS is not used on the command line and doing so is asking for trouble. FS is a built-in variable and as such is treated specially.


To pile on :-) you often want -w (match word) flag to grep.

In awk, I couldn't find how to do this. I tried /\bfoo\b/ and /\<foo\>/ but neither worked. I don't know why and don't care enough which brings me to my major awk irritation ...

It doesn't use extended or perl REs, which makes it quite different to ruby, perl, python, java. Now, according to the man page it does; at least on OSX (man re_format) but as mentioned it didn't work for me.

Details

   $ echo fish | awk  '/\bfish\b/' 
gets nothing, vs

   $ echo fish | perl -ne  '/\bfish\b/ && print' 
fish


UGH! Found the problem; it simply doesn't work. Assuming the OSX awk is the same as the freebsd awk there is a very old open bug on this:

awk(1) does not support word-boundary metacharacters https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=171725


GNU awk supports \< and \> for start and end of word anchors, which works for GNU grep/sed as well

GNU awk also supports \y which is same as \b as well as \B for opposite (same as GNU grep/sed)

Intererstingly, there's a difference between the three types of word anchors:

    $ # \b matches both start and end of word boundaries
    $ # 1st and 3rd line have space as second character
    $ echo 'I have 12, he has 2!' | grep -o '\b..\b'
    I 
    12
    , 
    he
     2

    $ # \< and \> strictly match only start and end word boundaries respectively
    $ echo 'I have 12, he has 2!' | grep -o '\<..\>'
    12
    he

    $ # -w ensures there are no word characters around the matching text
    $ # same as: grep -oP '(?<!\w)..(?!\w)'
    $ echo 'I have 12, he has 2!' | grep -ow '..'
    12
    he
    2!


Sure, but a fair bit of the value of the tool is it's consistency across platforms.

There's no point in awk if perl etc are ubiquitous and more consistent.


\< and \> work with GNU's awk:

  $ printf "fishstick\nfish\ngoldfish\n" | awk '/\<fish\>/' 
  fish


\b is Perl RE, not ERE. AWK not only supports ERE's, but POSIX RE's as well.


On the other hand, grep can be far faster for searching alone than awk. I almost always use an initial grep for the string that will most reduce the input to the rest of the pipeline. Later, it feels idiomatic to mix in awk with matches like you suggested


Depends on the awk. mawk is surprisingly fast.


Right. I don't consider that particular exhaustive at all and this has helped me when I wanted to do quick searches.


I always forget about that, and I should try more to remember it. Thank you for the tip!


I disagree, it's quite elegant if you think in terms of relational algebra operators:

* Projection (Π): awk and cut for simple cases

* Selection (σ): grep for simple cases, otherwise sed & awk

* Rename (ρ): sed

* Set operators: join, comm...


Bravo! This is one of the most insightful comments I've read in a long time! I have been using some of these tools for years but I never thought of describing them this way. Now I can think of writing a complex query in relational algebra and translating it into these commands in a very natural way.


Here's an interesting article that links shell scripting and relational algebra - http://matt.might.net/articles/sql-in-the-shell/


Indeed, and with a bit of tuning (e.g., using mawk for most things), one can get quite good performance. [1] The project also provides a translator from Datalog to bash scripts [2].

Disclaimer: I was one of the authors

[1] https://www.thomasrebele.org/publications/2018_report_bashlo... [2] https://www.thomasrebele.org/projects/bashlog/datalog


Thank you, and thank you (really, not sarcasm) for the new stuff I have to learn about relational algebra. I'm a huge fan of wide/shallow knowledge that allows me to dive into a subject quickly.


I’m pretty mathsy but I don’t get this


It is from relational algebra used in database theory. There is an excerpt from one of the first MOOCS offered here on Lagunitas now.[1] It is pretty intuitive once you get the hang of it.

[1] https://lagunita.stanford.edu/courses/DB/RA/SelfPaced/course...


Thank you for this context


ntfsql dreams


Its ubiquity and performance open up all kinds of sophisticated data processing on a huge variety of *nix implementations. Whether it's one liners or giant data scrubs, awk is a tool that you can almost always count on having access to, even in the most restrictive or arcane environments.


It's far more elegant and concise than any other scripting language I can think of using to accomplish the same thing.

As the article points out, other languages will have a lot more ceremony around opening and closing the file, breaking the input into fields, initializing variables, etc.


As part of a practical component to any software engineering degree should be a simple course on common Unix tools, covering grep, awk, sed, PCRE, and git.

A little bit of knowledge here goes a LONG way.


I wholeheartedly agree. I've seen people agonize for days over results from Splunk that they want to turn into something more user-friendly. 15 minutes of messing around with the basic command line Unix tools has that information in a perfect format for their needs.

This is something I need to bring up to my coworkers, I should write some sort of basic guide to unix tools for them.


> it's not elegant

I completely disagree.


Thank you, I too often talk down what I do.


Eloquently put!


Because for some bizarre reason, "cut" doesn't ship with any decent column selection logic that is the equivalent of awk's $1, $2, etc., even in 2020.

That's like 90% of my use of awk right there. I don't know of any easier equivalent of "awk '{ print $2 }'" for what it does.

Posted partially so the Internet Correction Squad can froth at the mouth and set me straight, because I'd love to be showed to be wrong here.


> I don't know of any easier equivalent of "awk '{ print $2 }'" for what it does.

Does `cut -f2` not work? My complaint with cut is that you can't reorder columns (e.g. `cut -f3,2` )

Awk is really great for general text munging, more than just column extraction, highly recommend picking up some of the basics

Edit to agree with the commenters below me: If the file isn't delimited with a single character, then cut alone won't cut it. You need to use awk or preprocess with sed in that case. Sorry, didn't realize that's what the parent comment might be getting at.


It does not. Compare:

    $ echo '   1     2   3' | cut -f2
       1     2   3
    $ echo '   1     2   3' | cut -f2 -d' '
    
    $ echo '   1     2   3' | awk '{print $2}'
    2
"-f [...] Output fields are separated by a single occurrence of the field delimiter character."


echo ' 1 2 3' | tr -s ' ' | cut -b 2- | cut -d' ' -f2

or

echo ' 1 2 3' | tr -s ' ' '\t' | cut -b 2- | cut -f2


oh yes, totally agree. if the data isn't delimited by a single character, then you definitely need awk or sed+cut


Also, the field separator (FS) can be a regular expression.

    FS = "[0-9]"


IIRC, there is an invocation of cut that basically does what I want, but every time I try, I read the manual page for 3 or 4 minutes, craft a dozen non-functional command lines, then type "awk '{ print $6 }'" and move on.


> IIRC, there is an invocation of cut that basically does what I want

I don't think there is, because cut separates fields strictly on one instance of the delimiter. Which sometimes works out, but usually doesn't.

Most of the time, you have to process the input through sed or tr in order to make it suitable for cut.

The most frustrating and asinine part of cut is its behaviour when it has a single field: it keeps printing the input as-is instead of just going off and selecting nothing, or printing a warning, or anything which would bloody well hint let alone tell you something's wrong.

Just try it: `ls -l | cut -f 1` and `ls -l | cut -f 13,25-67` show exactly the same thing, which is `ls -l`.

cut is a personal hell of mine, every time I try to use it I waste my time and end up frustrated. And now I'm realising that really cut is the one utility which should get rewritten with a working UI. exa and fd and their friends are cool, but I'd guess none of them has wasted as much time as cut.


Perfect example of how "do one thing and do it well" is a lie.


> Does `cut -f2` not work?

Most utilities don't use a tab character as separator, and that's what cut operates to by default. Can't cut on whitespace in general, which is what's actually useful, and what awk does.

Only way to get cut to work is to add a tr inbetween, which is a waste of time when awk just does the right thing out of the box.


> which is a waste of time when awk just does the right thing out of the box.

Agree in general. Only exception I'd make to this is when you're selecting a range of columns, as someone else mentioned elsewhere in the thread. I typically find (for example) `| sed -e 's/ \+/\t/g' | cut -f 1,3-10,14-17` to be both easier to type and easier to debug than typing out all the columns explicitly in an awk statement.


Instead of piping to sed, I would simply use

  | tr -s ' ' '\t'


"Does `cut -f2` not work?"

As others have pointed out, no. It should! (Said the guy sitting comfortably in front of his supercomputer cluster in 2020. No, I don't do HPC or anything; everything's a supercomputer by the time that cut was written's standards.) But it doesn't. Going out on a limb, it's just too old. Cut comes from a world of fixed-length fields. Arguably it's not really a "unix" tool in that sense.

"highly recommend picking up some of the basics"

I have, that's the other 10%. I've done non-trivial things with it... well... non-trivial by "what I've typed on the shell" standards, not non-trivial by "this is a program" standards.


Not if the columns are separated by variable number of spaces. By default, the delimiter is 1 tab. You can change it to 1 space, but not more and not a variable number.

In my experience, most column based output uses variable number of spaces for alignment purposes. Tabs can work for alignment, but they break when you need more than 8 spaces for alignment.


The Internet Correction Squad would like to remind you that 1) they are different programs that do different things, 2) if they changed over time, they wouldn't be portable, 3) if all you use awk for is '{print $2}', that is perfectly fine.

You can submit a new feature request/patch to GNU coreutils' cut, but they'll probably just tell you to use awk.

Edit: Nevermind, it's already a rejected feature request: https://lists.gnu.org/archive/html/bug-coreutils/2009-09/msg... (from https://www.gnu.org/software/coreutils/rejected_requests.htm...)


One of the bad things about having an ultra-stable core of GNU utils as that they've largely ossified over time. Even truly useful improvements can often no longer get in.

It's a sharp and not-entirely-welcome change from the 80s and 90s.

Here's another that would be great but will never be added: I want bash's "wait" to take an integer argument, causing it to wait until only that number (at most) of background processes are still running. That would make it almost trivial to write small shell scripts that could easily utilize my available CPU cores.


> I don't know of any easier equivalent of "awk '{ print $2 }'" for what it does.

I'm not sure if you refer spefically to cut, but Perl has something similar and approximaly terse:

> echo 'a b c' | perl -lane 'print $F[1]'

Also, Perl can slice arrays, which is something that I really miss in Awk.


PERL is bloatware by comparison and less likely to be installed on distros than AWK. (e.g, embedded or slim distros. that's why you rarely see nonstandard /bin execs in shell scripts).


Perl used to be part of most distros, but I think favor shifted to Python a few years ago.

I wouldn't call it bloat, but yes it is much bigger. At the time you had C (really fast, but cumbersome) and Awk/Bash (good prototyping tools, but not good for large codebases). Perl was the perfect answer to something that is fairly fast, relatively easy to develop in, and easier to write full-sized codebases


Larry Wall referred to the old dichotomy of the “manipulexity” of C for reaching into low-level system details versus the “whipuptitude” of shell, awk, sed, and friends. He targeted Perl for the sweet spot in the unfilled gap between them.


Thanks for explaining it better than me!


The awk on an embedded system is most likely a non-mainstream awk implementation with fewer or different features.


Proof? Or are you just guessing?


Can confirm. GNU awk is GPLv3, which means it can't be legally included on any system that prevents a user from modifying the installed firmware. This is a result of GPLv3's "Installation Instructions" requirement.

Every commercial embedded Linux product that I've seen uses Busybox (or maybe Toybox) to provide the coreutils. If awk is available on a system like that, it's almost certainly Busybox awk.

And Busybox awk is fine for a lot of things. But it's definitely different than GNU awk, and it's not 100% compatible in all cases.


That rule only apply if the manufacturer has the power to install modified version after sale. If the embedded Linux product is unmodifiable with no update capability then you do not need to provide Installation Instructions under GPLv3.

The point of the license condition is that once a device has been sold the new owner should have as much control as the developer to change that specific device. If no one can update it then the condition is fulfilled.


Thanks. I forgot GPL is the touch of death in many cases due to how it infects entire codebases.

I can't edit my OP but I'm already downvoted so that will suffice.


Specifically GPLv3 is the sticking point - not the GPL in general. GPLv2 is a great license, and I use it for a lot of tools that I write. That's the license that the Linux kernel uses.

GPLv3 (which was written in 2007) has much tougher restrictions. It's the license for most of the GNU packages now, and GPLv3 packages are impractical to include in any firmware that also comes with secret sauce. So most of us in the embedded space have ditched the GNU tools in our production firmware (even if they're still used to _compile_ that firmware).


That's not an entirely accurate understanding of the GPLv3 "anti-tivoisation" restrictions. The restrictions boil down to "if you distribute hardware that has GPLv3 code on it, you must provide the source code (as with GPLv2) and a way for users to replace the GPLv3 code -- but only if it is possible for the vendor to do such a replacement". There's no requirement to relicense other code to GPLv3 -- if there were then GPLv3 wouldn't be an OSI-approved license.

It should be noted that GPLv2 actually had a (much weaker) version of the same restriction:

> For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. [emphasis added]

(Scripts in this context doesn't mean "shell scripts" necessarily, but more like instructions -- like the script of a play.)

So it's not really a surprise that (when the above clause was found to not solve the problem of unmodifiable GPL code) the restrictions were expanded. The GPLv3 also has a bunch of other improvements (such as better ways of resolving accidental infringement, and a patents clause to make it compatible with Apache-2.0).


I do appreciate the intention behind GPLv3. And it does has a lot of good updates over GPLv2.

The reason why I said it's impractical to include GPLv3 code in a system that also has secret sauce (maybe a special control loop or some audio plugins) is more about sauce protection.

If somebody has access to replace GPLv3 components with their own versions, then they effectively have the ability to read from the filesystem, and to control what's in it (at least partially).

So if I had parts that I wanted to keep secret and/or unmodifiable (maybe downloaded items from an app store), I'd have to find some way to separate out anything that's GPLv3 (and also probably constrain the GPLv3 binaries with cgroups to protect against the binaries being replaced with something nefarious). Or I'd have to avoid GPLv3 code in my product. Not because it requires me to release non-GPL code, but more because it requires me to provide write access to the filesystem.

And I guess that maybe GPLv3 is working as intended there. Not my place to judge if the license restrictions are good or bad. But it does mean that GPLv3 code can't easily be shipped on products that also have files which the developer wants to keep as a trade secret (or files that are pirateable). With the end result that most GNU packages become off-limits to a lot of embedded systems developers.


I will post the code fragment if I can find it (this was 10 years ago). I had a tiny awk script on an embedded system (busybox) to construct MAC addresses. There was some basic arithmetic involved and I couldn't quite figure out how to do it with a busybox shell script. The awk script didn't work at all on my Linux desktop.


Even assuming the odd "bloatware" characterization, this is irrelevant. From the article's point of view of "simple tasks", bloat or not doesn't matter; what matter is the language syntax and features used to accomplish a task (and I'd add consistency across platforms).

Regarding slim/embedded distros, it depends on the use cases, and the definition of "slim". It's hard to make broad statements on their prevalence, and regardless, I've never stated that one should use Perl instead and/or that it's "better"; only stated that the option it gives is a valid one.


Do you have data on the relative sizes of the Perl and awk install bases?

Perl is the language, and perl is the implementation. Spelling it with ALL CAPS announces that someone knows little about the language.


Unfortunately, for large files perl is significantly faster than awk. I was working on some very large files doing some splitting, and perl was over an order of magnitude faster.


A tool that is stable, well supported, has outstanding documentation, thoroughly tested, won’t capriciously break your code, and outperforms the rest of the pack is not the unfortunate case.


To be more clear, the unfortunately part was due to the title of this article. I think awk is great, but if you know perl well enough it can easily replace it and be much more versatile


> PERL is bloatware

never heard that before


There is a first time for everything, and it's true: Perl is mega bloatware, especially when compared to AWK.


That is closest, yes. I'd say clearly a couple more things to remember, but if I can get it into my fingers will be just as fluid. Awk's invocation has its own issues around needing to escape the things the shell thinks it owns, too, not that it's at all unique about that.


If you're golfing, there's also

    echo a b c | perl -pale '$_=$F[1]'


Perl stole array slicing from AWK's split() function... which slices arrays.


I define aliases c1, c2, c3, c4, etc. in my .bashrc as "awk '{print $1}'" etc.

But it's nice to have awk for the slightly more complicated cases, up until it's easier to use Python or another language.


I don't suppose your dotfiles are available anywhere. I'm just wondering what other useful things I can steal ;)


I don't have my dotfile here (on my phone) but here's some ideas from things I've aliased that I use a lot:

cdd : like cd but can take a file path

- : cd - (like 1-level undo for cd)

.. : cd ..

... : cd ../.. (and so on up to '.......')

nd : like cd but enter the dir too

udd : go up (cd ..) and then rmdir

xd : search for arg in parent dirs, then go as deep as possible on the other branch (like super-lazy cd, actually calls a python script).

ai : edit alias file and then source it

Also I set a bindkey (F5 I think) that types out | awk '{print}' followed by the left key twice so that the cursor is in the correct spot to start awk'in ;D

# Bind F5 to type out awk boilerplate and position the cursor

bindkey -c -s "^[[[E" " | awk '{print }'^[[D^[[D"

Edit: better formatting (and at work now so pasted the bindkey)


Sorry, not much else going on in my dot file except stuff that is peculiar to my current environment.


awk '{print $1}' can also be written as awk '$0=$1'


And awk doesn't offers cut's column range selection ;)


Absolutely. Everything comes with costs & benefits. But I'm not sure I've, in my entire 23-year professional programming career, ever encountered a fixed-width text format in the wild. I've used cut even so for places where by coincidence the first couple of columns happened to be the same size, but that's really a hack.

Obviously, other people have different experiences which is why I quality it so. (I only narrowly missed it at the beginning, but I started in webdev, and we never quite took direct feeds from the mainframes.) But I don't think it's too crazy to say UNIX shell, to the extent it can be pinned down at all, works most natively with variable-length line-by-line content.


Some topics where you will most definitely come across fixed-width formats: - processing older system printouts that are save to text file - banking industry formats for payment files and statements - industrial machine control

and my favourite....... - source code.

My first intro to awk was using it to process COBOL code to find redundant copy libs, consolidate code, and generally cleanup code (very very crude linting). And it was brilliant. Fast, logical, readable, reliable - was everything i needed.

It is also an eminently suitable language for teaching programming because it introduces the basic concept of SETUP - LOOP - END . which is exactly the same as one will find in most business systems, you find it in arduino sketches, hell you even find it in a browser which is basically just a whole universe of stuff sitting atop a very fast loop that looks for events.

AWK fan for sure - my heirachy of languages these days would be cmd line where there is specific command doing all i need, AWK for anything needing to slice and dice records that dont contain blobs, python for complete programs, and python+nuitka or lazarus or C# when need speed and/or cross platform.


> But I'm not sure I've, in my entire 23-year professional programming career, ever encountered a fixed-width text format in the wild.

SAP comes to mind. I think it does support various different formats, but for reason or another fixed-width seemed to be some kind of default value (that's what I usually got when I asked for SAP feed at least, but that was years ago).


I can confirm that they are not a common problem.

Admittedly I have encountered fixed-width text formats in the wild. But the last such occasion was about 15 years ago. (It was for interacting with a credit card processor to issue reward cards.)


Within my first year of professional development, I encountered several fixed-width files I needed to read and write. I suppose exposure depends a lot on the specific industry.


Also big mainframe users (banks, insurance) often send fixed width data to us.


Several scientific data formats in my industry have fixed width columns that traces back to the era of punch cards


I'm an expert on neither AWK nor cut, but AWK allows you to select a field, or the entire line, and then substrings within those.

Select characters 3-6 (inclusive) in the second field:

$ echo Testing LengthyString | awk '{print substr($2,3,4)}'

> ngth

If you want to select columns from the entire line, then:

$ echo Testing LengthyString | awk '{print substr($0,3,4)}'

> stin

Is that what you meant?


I don't think so. I think they're referring to cut's ability to select an arbitrary range of columns, e.g. `cut -f 2-7` to select the 2nd through 7th columns, while awk requires you to explicitly enumerate all desired columns, i.e. `awk '{print $2, $3, $4, $5, $6, $7}'


> "while awk requires you to explicitly enumerate all desired columns"

Or you can use loops, e.g.,

  echo $(seq 100) | awk '{ for (i = 2; i <= 7; i++) { print $i; }; }'


It's more characters than manual way. There should be just a built-in function.


What cauthon said, cut lets you select range of fields with a simple option, awk doesn't, ex(csv line):

  cut -d, -f10-30
(selects from field 10 to 30)

Not saying this can't be written in awk with more code, but we were talking about field selection ergonomics.


OK, understood. And yes, in my experience AWK is less good at that, cut would definitely be the right tool.

It doesn't detract from the point at hand - which is perfectly valid - but it's worth noting that there's a confusion here with regards the terminology: "fields" vs "columns". I thought they were referring to "columns of characters" whereas the added explanations[0] are about "columns of fields". That makes a difference.

But as I say, yes, I agree that to select a range of columns of fields, especially several fields from each line, is definitely better with cut.

[0] https://news.ycombinator.com/item?id=22109551


I wrapped awk in a thin shell script, f, so that you can say “f N” to get “awk '{print $N}'” and “f N-M” (where either of the end points are optional) to do what cut does, except it operates on all blanks.

Repo has a few other shortcuts, too:

https://github.com/kseistrup/filters


The reason isn't that bizarre; it's a POSIX utility and must conform to the specification, which you can read here:

https://pubs.opengroup.org/onlinepubs/9699919799/utilities/c...


Surely we can infer that jerf would just say that it's bizarre for the POSIX specification to not include any easy way to get columns.


Does POSIX forbid extensions? That would be terrible.


Well... no, it doesn't, obviously, we can take a quick look at gawk to confirm this.

But, just as we can't say "awk offers the -b flag to process characters as bytes", we can't really say that cut offers any extensions not defined in the standard.

An implementation could, sure. I'd prefer that it didn't, writing conformant shell scripts is hard enough.


Before standards happen, creative developers are free to use their imagination and come up with useful features. Then someone makes up a standard, and from thereon progress is halted. The only way you grow new features and functionality is through design-by-committee, if people aren't making extensions that would one day make it to the next revision. I think it is ridiculous.

Tools should improve, and standards should eventually catch up & pick up the good parts.

People who need to work with legacy systems can just opt not to use those extensions, but one day they too will benefit. Others benefit immediately.


I find that, for these kinds of utilities, all the extra add-ons tend to cost me more in the form of, "Whoops, we tried using this on $DIFFERENT_UNIX and it broke," than they ever save me.

When I'm looking to do something more complicated, I'd rather look to tools from outside the POSIX standard. The good ones give me less fuss, because they can typically be installed on any POSIX-compliant OS, which is NBD, whereas extensions to POSIX tools tend to create actual portability hassles.



I'm quite sure we just linked to the same document, were you meaning to address the grandparent?


Awk, like shell, is a fraught programming environment, full of silent failure and hidden gotchas. Even for one-liners.

I use shell because I have to, not because I like it. I dread maintaining shell scripts which have a bunch of awk and sed in them.

The Unix ideal of small single-purpose tools and text processing is separable from these old warhorses.


PowerShell is a lot better at this. It was designed for reading by humans, not for saving precious bytes over 300-baud terminal connections.

So for example, selecting columns is easy, and can be done by name.

Here's a useful little snippet demonstrating converting and processing the CSV output of the legacy "whoami" Windows command. It lists the groups a user is a member of without poking domain controllers using LDAP. It always works, including cross-forest scenarios.

    WHOAMI /GROUPS /FO CSV `
    | ConvertFrom-Csv `
    | Where-Object Attributes -NotLike '*used for deny only*' `
    | Select-Object -ExpandProperty 'Group Name'
It looks like a lot of typing, but everything tab-completes and the end-result is human readable. I find that people that prefer terseness over verbose syntax are selfish. They simply don't care about the future maintainers of their scripts.

PowerShell can also natively parse JSON and XML into object hierarchies and then select from them. That's difficult in UNIX. The Invoke-RestMethod command is particularly brilliant, because instead of a stream of bytes it returns an object that can be manipulated directly.

    $p = Invoke-RestMethod 'https://ipapi.co/8.8.8.8/json/' 
    echo $p.org
    echo $p.country_name
PowerShell Core is available on Linux and is great for ad-hoc tabular data processing! Give it a go...


PowerShell Core (PS5/open source) is crushing it in this space over Bash\AWK specifically.

I'm a pretty advocate for how it works in these use cases now.

Generally I still use Bash\Sed\AWK for the one off instances when ever I need something done there and then. But if I'm writing something that's going to be involved in my CI\CD with a high chance of needing to be maintained. I'll nearly always write it in PowerShell.


While I agree that the silent failures and "opaqueness" can be off-putting, once you understand the tool and how to implement it in your workflow it is wonderfully efficient. Aside from being syntactically terse I haven't found any compelling reasons not to use it.


I've achieved a high level of expertise in Perl. Even though I can, I won't write complex Perl one liners; instead, I will write tested, documented utilities which someone who has not achieved such a level of mastery has a shot at grokking, maintaining and debugging.

Even better if I can write such utilities in a more modern programming language like Go or Rust, if the organization has personnel with expertise in such languages.


rectang: "Awk, like shell, is a fraught programming environment, full of silent failure and hidden gotchas. Even for one-liners..."

Also rectang: "Even better if I can write such utilities in a more modern programming language like Go or Rust..."

Thank you for demonstrating how Linux decays due to lack of experience and NIH syndrome.


What if the task is one-off? would you still care to write a proper utility?

As an example, you have some data you need to fix as input to some program. you incrementally try to fix it with perl:

1) run program with data, observe errors, infer what needs fixing ->

2) write perl to fix data, modify data ->

3) repeat from (1) until no errors

You have no expectation you'll ever need to fix data corrupted in the same way ever again.


> Aside from being syntactically terse I haven't found any compelling reasons not to use it.

For me, that's THE reason to use it. It's terseness is what allows it to be efficient enough to be used primarily interactively.


I wholeheartedly agree, but I think that is why a lot of folks get turned off from it.


There is a school of thought in CS that equates terseness to bad programming practices. That is unfortunate. It is possible for software to be terse and well designed, and of course verbose solutions can easily morph into an unmaintainable nightmare.


I wish there were a set of tools like coreutils but like..."less surprising" I guess? Like I spend 98% of my focus-time in algol-likes. 1-indexing drives me batty. I always have to try a few times to craft craft cut or tail expressions to match [i:j] slice notation.

And anything more complex than that I typically just use python.


scripts are the antithesis of modern software development. No Unit testing, No CI, no monitoring, no consistency, often no source control. Appeals to me as a hacker but only my scripts are intelligible - which is a bad sign. :)


I'm not actually convinced unit testing is all that valuable, unless the unit under test has a very clear input -> output transformation (like algorithms, string utilities, etc). If it doesn't (and most units don't), unit tests just encumber you.


The value of unit tests when creating software is perhaps debatable, but I find their greatest value to be in maintaining software. If you lock in your expectations of a software component, when it's time to make changes (due to shifting requirements or what not), you can add tests capturing your intent and make sure that you aren't breaking existing expectations. Or at least, that you know which expectations your change will break.


I’m really glad to read that.

I never understood the whole religion around unit tests. Integration tests are often far easier to write and far more valuable.

Like you said, unit tests are really nice when testing for a known, expected output.

Unit tests that are effectively testing mocks and crazy stubs because your method has side effects? Not for me.


It is easy to explain: unit tests give a documented proof that you care about code robustness. It is used more for social and psychological effect than for its advantages. In fact, outside a few domains, unit tests make it harder to evolve software because the more tests you write, the harder it is to make changes that move the design in a different direction. This is, by the way, my main problem with verbose techniques of programming: the more you have to code, the harder to make needed changes.


Integration tests can be easier to write, are necessary, but can also be much slower to run. Yes, it’s possible to write bad unit tests; the same is true of integration tests.

You need both unit tests and integration tests.


Yes I worked in a print production industry and unit testing was almost impossible. To generate a string of text reliably and correctly is not difficult, but to position it on the page respecting other elements, flow, overlap, current branded font and size, and white space - unit testing is basically useless. And these are the errors we faced most.


Every single ones of those things can be done when using scripts.


Actually, that is not true.

I have written a tool (also a script) myself that allows you to write unit tests, manage your scripts in git, load scripts in other scripts, etc. Maybe I will post a Show HN in the coming weeks, but at the moment I would like to round up some edges before posting it.

In my experience, the biggest problem is that there are many different runtime environments out there that differ in detail and make it hard to write scripts that run everywhere. But programmers not applying all their skills (e.g. writing tests) to build scripts are also part of the problem.


PowerShell begs to differ.


I gave awk a sincere attempt, but I have to say that it wasn't worth it. As soon as one tries to write anything bigger than a one liner the language shows its warts.. I found myself writing lots of helper routines for things that should be part of the base language/library, e.g. sets with existence check. I also had to work around portability issues, because awk is not awk, unlike this post claims. E.g. matching a part of a line is different among different awks.

And some language decisions are just asinine, e.g. how function parameters and variable scope works, fall through by default although you almost never want that, etc..

But hey, you have socket support! Sounds to me like things have developed in the wrong direction.

And of course no one on your team will know awk.

I found the idea of rule-based programming interesting, but the way it interacts with delimiters and sections (switching rules when encountering certain lines) doesn't work well in practice.

I also found the performance to be very disappoinging when I compared it to C and python equivalents.


You realize AWK is about 1/100th the size of Python, right? That's like comparing a Leatherman multi-tool to a Craftsman 2000 piece tool set that weighs 1,000 lbs. This matters significantly when addressing compatibility and when building distros that are space constrained.

Awk is there for a reason: to be small. That's why the O'Reilly press book is called "Sed & Awk", because they were orignally written to work together in the early days of unix dating back to the late 70's. Sed (1974) & Awk (1977) are in the DNA of unix, Python is something totally different.


First of all I'm not a distro maintainer. I also doubt that people would use awk for seriously space constrained environments. And distros ship both awk and python anyway. And again, I don't understand why they'd support networking but not basic data types/functions.

The only reason I could've seen to use awk was to throw code together more quickly in a DSL.

However this is much less the case than I had hoped. For the one liners there are usually specialized tools like fex that are easier to use and faster (for batch processing).

When I compared my C/python/awk programs the difference was msec/sec/minutes. As soon as I use such a program repeatedly it starts to hurt my productivity. And the development time is not orders of magnitude slower in non-awk languages.


> I also doubt that people would use awk for seriously space constrained environments. And distros ship both awk and python anyway.

Python is absolutely not available everywhere one can find Awk. I've never seen a system with Python but not Awk, but have seen many systems with Awk but not Python (excluding the BSDs, where Python is never in base, anyhow).

Actually, not many years ago I used to claim that I never saw a Linux system with Bash that lacked Perl, but had seen systems with Perl that lacked Bash. (And forget about Python.) This was because most embedded distros use an Ash derivative, often because they used BusyBox for core utilities or a simple Debian install. Perl might not have been default installed, either, but invariably got pulled in as a core dependency for anything sophisticated. Anyhow, the upshot was that you'd be more portable, even within the realm of Linux, with a Perl script than a Bash-reliant shell script. Times have changed, but only in roughly the past 5 years or so. (Nonetheless, IME Perl is still slightly more reliable than Python, but variance is greater, which I guess is a consequence of Docker.)

One thing to keep in mind regarding utility performance is locale support. Most shell utilities rely on libc for locale support, such as I/O translation. Last time I measured, circa 2015, setting LC_ALL=C resulted in significantly improved (2x or better, I forget but am being conservative) standard I/O throughput on glibc systems.[1] I never investigated the reasons. glibc's locale code is a nightmare[2], and that's more than enough explanation for me.

Heavy scripting languages like Perl, Python, Ruby, etc, do most of their locale work internally and, apparently, more efficiently. If you don't care about locale, or are just curious, then set LC_ALL=C in the environment and test again. I set LC_ALL=C in the preamble of all my shell scripts. It makes them faster and, more importantly, has sidestepped countless bugs and gotchas.

For the things I do, and I imagine for the vast majority of things people write shell scripts for, you don't need locale support, or even UTF-8 support. And even if you do care, the rules for how UTF-8 changes the semantics of the environment are complex enough that it's preferable to refactor things so you don't have to care, or can isolate the parts that need to care to a few utility invocations. In practice, system locale work has gone hand-in-hand with making libc and shell utilities 8-bit clean in the C/POSIX locale, which is what most people care about even when they care about locale.

[1] The consequence was that my hexdump implementation, http://25thandclement.com/~william/projects/hexdump.c.html, was significantly faster than the wrapper typically available on Linux systems. My implementation did the transformations from a tiny, non-JIT'd virtual machine, while the wrapper, which only supports a small subset of options, did the transformation in pure C code. My code was still faster even compared to LC_ALL=C, which implied glibc's locale architecture has non-negligible costs.

[2] To be fair, it's a nightmare partly because they've had strong locale support for many years, and the implementation has been mostly backward compatible. At least, "strong" and "backward compatible" relative to the BSDs. Solaris is arguably better on both accounts, though I've never looked at their source code. Solaris' implementation was fast, whatever it was doing. musl libc has the benefit of starting last, so they only support the C and UTF-8 locales, and in most places in libc UTF-8 support simply means being 8-bit clean, so zero or perhaps even negative cost.


There was a long period of time where it was easy to find a non-Linux Unix with Perl installed but not Bash: SunOS, Solaris, IRIX, etc., admins would typically install Perl pretty early on, while Bash was more niche. Like, maybe 1990 to 2000. Now we're getting into an era where lots of Unix boxes run MacOS, and although they have Bash, it's a version of only archaeological interest. But they do have Perl.


Most complaints that awk doesn't have have this or that feature ignore the fact that awk is not supposed to be used in isolation. Any substantial use of awk has to be tied to other UNIX utilities. I don't think you can, or should, write a medium to large size script completely in awk, the whole idea is to compose it with one or more UNIX commands.


Honorary mention of Taco Bell Programming. (fits this genre).

http://widgetsandshit.com/teddziuba/2010/10/taco-bell-progra...

Someone ought to write - Zen and the art of Unix tools usage.


(warning, mandatory HN contrarian comment)

"This is the opposite of a trend of nonsense called DevOps, where system administrators start writing unit tests and other things to help the developers warm up to them - Taco Bell Programming is about developers knowing enough about Ops (and Unix in general) so that they don't overthink things, and arrive at simple, scalable solutions"

It's not possible for developers to know enough about Ops, just as it's not possible for Ops to know enough about development, because they are different jobs. Moreover, devs are doomed to create terrible solutions because of their job, and Ops are doomed to create kludgy hacks for those terrible solutions because of their job. DevOps is just an attempt to get them to talk to each other frequently so that horrible shit doesn't happen as frequently.

Also, the real Taco Bell programming is actually to use only wget, no xargs. It takes a whole lot of basically every option Wget has, and a very reliable machine with a lot of RAM, but you can crawl millions of pages with just that tool. xargs and find make it worse because you don't get the implicit caching features of the one giant threaded process, so you waste tons of time and disk space re-getting the same pages, re-looking up the same hostnames, etc. (And that's Ops knowledge...)

The Zen of Unix is to try to move towards not using the computer at all. One-liners are part of that path, but so is minimizing the one-liner. http://www.catb.org/~esr/writings/unix-koans/ten-thousand.ht...


I don't understand this devs don't understand ops nonsense.

> so you waste tons of time and disk space re-getting the same pages, re-looking up the same hostnames, etc. (And that's Ops knowledge...)

If it's my job to write a web scraper, it's absolutely my job to think about/solve this problem.

Is this a new trend thing?


The difference between dev & Ops is an Italian grandma vs a restaurant chef. It's different experience that gives you different knowledge and a different skillset.


Simple, Composable Pieces is practically the whole ethos of both Unix and Functional Programmers, and they both converged on the same basic flow model: Pipelines with minimal side-effects. This naturally leads to a sequential concurrency, which seems to be easy for humans to reason about: Data flows through the pipeline, different parts of the pipeline can all be active at once, and nobody loses track. It doesn't solve absolutely every possible problem, but the right group of pieces (utility programs, functions) will solve a surprising number of them without much trouble.


Works until time plays a role in the computation you're doing.


>> suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it?

I don't know enough about the 'real way' or the 'taco bell way', but interested to know --- is this doable the way Ted describes in the article via xargs and wget?


Yes, absolutely. This is absolutely how ~~we~~ many (most?) of us used to scrap web pages in the Dark Ages.


I would assume a combination of

- sed/awk to extract URLs, one by line

- xargs and wget to download each page from the previous output


Skip learning of sed and awk and jump straight to perl instead.

  $ perl --help
  ...
  -F/pattern/       split() pattern for -a switch (//'s are optional)
  -l[octal]         enable line ending processing, specifies line terminator
  -a                autosplit mode with -n or -p (splits $_ into @F)

  -n                assume "while (<>) { ... }" loop around program
  -p                assume loop like -n but print line also, like sed
  -e program        one line of program (several -e's allowed, omit programfile)

Example. List file name and size:

  ls -l | perl -ae 'print "@F[8..$#F], $F[4]"'


Because syntatically as a language/tool it is super easy to remember. Writing one liners with awk feels more intuitive to me.

Awk example:

ls -l | awk '{print $9, $5}' or ls -lh | awk '{print $9, $5}'

Seems a whole lot simpler. To me. I find if you have to write exhaustive shell scripts then maybe you can look for something more verbose like Perl, I guess.


Yep, but you have bug in your awk one-liner.


If you mean the lack of quotations, then the behavior is well-defined and is presumably what was intended. Per POSIX,

> The print statement shall write the value of each expression argument onto the indicated output stream separated by the current output field separator (see variable OFS above), and terminated by the output record separator (see variable ORS above).

The default value for OFS is <space> and for ORS, <newline>.


> If you mean the lack of quotations,

No, lack of commas in output and broken filenames with spaces.


I see your point regarding spaces in names. Suppose I could use FILENAME. But I think my point was made.


In my defense I did this fairly quickly (Which was the point.) and was not trying to illustrate proper syntax (I mean it does run and does produces an output.).

ls -l | awk '{print $9 "\t" $5}'

That is about as much as i'm willing to do for this.


Also, because there's a whole culture of one-liners in Perl, you can also conveniently import libraries and call them:

Even though there's frequently value in adding whitespace to programs, many of them are just fine as one liners :)

e.g. this gets the title of a webpage for you: ``` $ perl -Mojo -E 'say g("mojolicious.org")->dom->at("title")->text' ```


I don't know much Perl but 10+ years ago I read "Minimal Perl". For these purposes, I think it can act as the go-to tool.


My favorite Awk reference is this:

https://ferd.ca/awk-in-20-minutes.html

Also, a handy trick is to combine awk and cut. For example I had a log line that had a variable amount of columns just in one field, but immediately after the field was a comma. I cut based on the comma:

cut -d, -f1,2

and then awk'd the last column:

cut -d, -f1,2 | awk '{ print $2" "$5" "$NF }'

So, sometimes awk and cut can help each other.


I've been procrastinating learning awk for a while now. Thanks for this, I read it and it's just what I needed.


Another good resource is "Why you should learn just a little Awk: An Awk tutorial by Example"[0]

[0] https://gregable.com/2010/09/why-you-should-know-just-little...


I've been parsing some documents converted from PDF (using the Poppler library's "pdftotext" command with the "--layout" option).

I found that reading these -- sort-of half-assed structured data, but with page-chunked artefacts and idiosyncrasies -- was difficult on a line-by-line basis, and thought idly "this would be a lot easier if I could process by page instead".

Text was laid out in columns, and the amount of indenting (the whitespace between columns) was significant. So preserving this somehow would be Very Useful.

Suddenly those pesky '^L' formfeeds were an asset, not a liability. Let's treat the formfeed ("\f") as a record delimiter, and the newline ("\n") as a field delimiter. We can parse out the actual columns based on witespace, for each line:

    BEGIN { RS="\f"; FS="\n" }
    {
        pageno = NR
        lines = NF
        for( line=1; line<=lines; line++ ) {
            ncols = split( $line, columns, " {2,}", gaps )
        }
    }
This gives me:

- The running tally of pages.

- Each line of the page as an individual record.

- Via the split() function, an array of columns separated by two or more spaces, which are saved as an array of gaps so I have the whitespace to play with.

Edge cases and fiddling ensue, but that's the essential bit of the code there.

Since the lines are an array, I can roll back and forth through the page (basically being able to read forward and backwards through the text record), testing values, finding out where column boundaries are, etc., and then output a page's worth of content, transposing to a single-column format, with appropriate whitespacing, when done.

In testing and debugging the output (working off of 20+ documents of 100s to ~1,000 pages), a lot of test cases, scaffolding, diagnostics, etc., have been created and removed to make sure the Right Things are happening. Easy with awk.


And to be clear: After realising this ... and knowing what to look for ... I found this documented in the GNU Awk User's guide:

https://www.gnu.org/software/gawk/manual/html_node/Multiple-...


I love awk, but I'm still waiting for the structural regex awk: http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf


I just learned about SE a month or so ago, and it is indeed pretty awesome. I tried out the "vis" editor, using SE to create multiple cursors that I then manipulated with vi-like commands. That was a pretty sweet use case.


Awk is a command I turn to time and again. For me it's the single most valuable command for enabling the piped single-purpose pattern.

As an example, if I want a sorted list of all open files under the home directories on CentOs I can do this:

lsof | awk '{ print $10 }' | grep ^/home/ | sort | uniq


Don' stop there. You've solved a real problem in your life, and you might want that information another day. Make a tiny script that encapsulates it. Generalize it a tiny bit, and give it a memorable name (perhaps lsof-tree). That done, you can stop worrying about the mechanics of the solution and build on it.

  #! /bin/bash                                                                                                           
  # lsof-tree: list open files in a given directory tree (default /home)                                                 
                                                                                                                       
  NAME=9 # set to 10 for CentOS                                                                                          
  BASE="${1:-/home}"                                                                                                     
                                                                                                                       
  lsof | awk -v NAME=$NAME '{print $NAME}' | grep "^$BASE" | sort -u


YMMV, but I find it easier and faster to know the basic utilities and how to compose them with pipes than remembering the name of a zillion such scripts.

I guess it depends on how often you need that particular pipeline. Every day? Sure, make a script. Every few months? Nah, I won't remember it anyway,or probably I remember that I've made a script like that but then I have to start searching my bin directory in the end using more time than just writing the pipeline in the first place.


Spot on. My ~/bin has a large number of such little scripts, and my current project has a 'scripts' repository for the same reason.


you can save one command by using sort -u which eliminates dupes in the sort, removing the need for uniq


Tnx, that one I didn't know...

You can drop another command as such:

   lsof |awk '($10 ~ /^\/home\//) {print $10}' |sort -u


It is fast, robust, and frequently far more performant than a lot of modern tools that can be overkill for most data manipulation. I use it all the time in our ETL processes and it always works as advertised.


Perl is much faster[0], with much more features, with bunch of ready to use libraries, with package manager (CPAN), and similar syntax to awk.

Why you use awk?

[0]: http://rc3.org/2014/08/28/surprisingly-perl-outperforms-sed-...


I use it for a couple reasons: one, it is installed as a base app on almost every single *nix implementation on the planet, so you can count on having it even on ancient or restrictive environments (which I work in frequently); Two, awk is frequently fast enough for most needs, and generally far faster than a number of off the shelf "modern" tools. The first reason is the one that generally leads me to its use, its ubiquity and power make it a compelling tool.


Perl was my second programming love, but awk is much shorter and easier to remember for the simple cases where I need it.

Remembering which Perl command-line arguments simulate awk’s line-by-line processing is harder than just remembering awk.


I use Perl similarly to awk if I need to use regex rather than white space delimited fields.

I think if you know Perl really well and can remember the command line arguments - particularly -E, -n, -I and -p - then it’s a good swap in substitute for grep, sed, awk, cut, bash, etc when whatever 5 min task you’re working on gets a tiny bit more complex.

Similarly a decent version perl 5 seems to be installed everywhere by default.

I’m curious to know if anyone would say the same about python or any other programs? I’m not particularly strong in short python scripting.


I would say Perl’s native support for regular expressions makes it more useful on the CLI than Python, but Python is also very low on my preferred languages list.

I do, however, use it for JSON pretty printing in a pipeline: python -mjson.tool IIRC.


Because you can learn awk in 1/10 of a time it takes to learn perl.


If you don't need gnu extensions, I've found mawk to be 4x faster than gawk on some scripts


Last week I threw out AWK and replaced it with Ruby (Could've been Python, Perl or PHP even).

Because AWK is not suited for CSV. Please prove me wrong!

I had to parse 9million lines. Some of which contain "quoted records", others, same column, are unquoted. Some contain comma's, in the fields, most don't. CSV is like that: more like a guideline than actual sense.

Two hours of googling and hacking later, I gave up and rewrote the importer in Ruby, in under 5 minutes.

Lesson learned: I'll stay clear of AWK, when I know a oneliner of Ruby (or Python) can solve it just as well. Because I know for certain the latter can deal with all the edgecases that will certainly pop up.


> I had to parse 9million lines.

Awk would chew through that no problem.

> Some of which contain "quoted records", others, same column, are unquoted.

In which case, there is the FPAT variable which can be used to define what a field is. FPAT="\"[^\"]\"|[^,]", which means "stuff between quotes, or things that are not commas", would probably have worked for you. (EDIT: Looks like formatting has gotten hold of my FPAT and I don't know how to stop it... hopefully it is still clear where asterisks should be)

> Some contain comma's, in the fields, most don't. CSV is like that: more like a guideline than actual sense.

Well, I would say that's absolutely false. You can't just put the delimiter wherever you fancy and call it a well-formed file. Quoting exists for the unfortunate cases your data includes the delimiting character (ideally the author would have the sense to use a more suitable character, like a tab).

This is just a retort to prevent your post from dissuading readers from awk, which is a fantastic tool. If you actually sit for half and hour and learn it rather than google to cobble together code that works, it is wonderful. I also don't think it is valid to base your judgement of a tool on what was apparently garbage data.


Garbage and poorly specified csv files are a fact of life and people have to deal with them all the time.

But if you want to be in a world where people only deal with well specified files like RFC 4180 (for some definition of well specified), your quick field pattern doesn’t conform. It doesn’t handle escaped double quotes or quoted line breaks. If you’re using your quick awk command to transform an RFC 4180 file into another RFC 4180 file you’ve just puked out the sort of garbage you were railing against.

While awk is a great tool if you’re dealing with a csv format with a predictable specification, and probably could be made to bend to the GP will with a little more knowledge, it gets trickier if you’re dealing with handling some of the garbage that comes up in the real world. What’s worse is the programming model leads you down the path of never validating your assumptions and silently failing.

I love awk for interactive sessions when I can manually sanity check the output. But if I’m writing something mildly complex that has to work in a batch on input I’ve never seen, I too would reach for ruby.


I had the exact same experience, but sub python for Ruby.

The lesson I took wasn't that awk sucks, though. The lesson was that CSV is not trivial, and should not be parsed with regex or string matching. It's a standard with variants, and rolling in a library will pay dividends, especially if you're parsing a wide variety of different dialects of CSV.

A related lesson I took is that once your awk script grows beyond a certain level, graduate it up to a real language. I love awk, but it excels at small scale text munging. It's not suited to anything more involved than that. If translating an awk program is a major task, then the program was already too big to begin with.


The author of ripgrep (Andrew burntsushi Gallant) has 2 tools for working with CSV data - 'xsv' (a command line swiss army knife for working with CSV data) and 'CSV parser'. Check those out: https://blog.burntsushi.net/projects/


xsv is awesome!


If you’re doing a lot of csv ops on the cli, you might like https://csvkit.readthedocs.io/en/latest/

(It’s Python, and you can use it as a library as well)


Sure, for some things awk is a good fit, for others it isn't.

But if your file is correct csv, and you use gawk, this does the trick: https://www.gnu.org/software/gawk/manual/html_node/Splitting...


CSV is not really a standard, various programs have their own interpretation (even MS Excel allows a header line before the data that contain specifications for the file, such as delimiter, which you probably won't account for in a home-grown solution).

So use a dedicated tool or library or you will run into trouble one day.

For one off jobs, I open it in Excel or similar and save as tab or pipe delimited text. This usually plays much more nicely with command line utilities, assuming it didn't mangle any numbers.


I wrote this for parsing CSV in Awk: http://yumegakanau.org/code/awk-csv/ It maybe doesn't handle all the CSV out there, but the cases you mention (quoting and commas) it does handle.

I usually load CSV data into PostgreSQL to do anything with it; mostly wrote this Awk library for fun. So I'm not going to argue that Awk is the best language for doing this kind of thing, but it is possible.


To simplify working with CSV data using command line tools, I wrote csvquote ( https://github.com/dbro/csvquote ). There are some examples on that page that show how it works with awk, cut, sed, etc.


That's very clever.


I'm just starting to learn awk, would you mind sharing some of the issues / edge cases that you ran into that you found it didn't handle well? Was it maybe just that you found it tricky to write the regular expressions?


Parenthetically, since there are a bunch of UNIX greybeards on HN: if anyone has artwork of the AWK t-shirt I will happily pay any reasonable price. The shirt has a parachute-wearing bird about to jump from an airplane and is captioned with AWK's most famous error message: awk: bailing out near line 1.

Contact information is hn handle @ yahoo.com.


Here are some clever one-liners for awk [1] Please be sure to add your own.

[1] - https://www.commandlinefu.com/commands/matching/awk/YXdr/sor...


I have a repo dedicated for some of the cli text processing tools like grep/sed/awk/perl/sort/etc. Here's my one-liner collection for awk [1]

[1] https://github.com/learnbyexample/Command-line-text-processi...


That is a very well written set of examples. Clear and concise. Thankyou!


Wow... a great resource! It's very well written and consistent.


reminds me of the old Knuth vs McIlroy word count battle


Hmm. I'm sure this question will induce a flamewar of practical "#NeverAwk"-ers fighting toolbelt bloat, versus tech-hoarding AWK apologists arguing against throwing something out given if fills <niche>.

Here's the thing: these arguments all too commonly focus on subjective notions of "simplicity", and toy examples divorced from actual common practise, and or solid comparable benchmarks.

Show me a range of practical examples, for each competing env (awk, sed, perl, python, ruby, maybe bash).

Include:

- time it takes to teach a total novice (the time it takes to learn whatever is needed that example, not the entire language)

- how easy it is to recall said knowledge at a later date

- how fast example is multiplied by how much you are likely to use it = actual time saved in terms of execution. for small, fast examples the difference is irrelevant, a 10x speedup that is 0.1s vs 0.01s is meaningless.

- how extendable an example is. Hence the original example should include a series of extensions to the original task, to demonstrate how flexible / composable they are: e.g. task 1) count lines in a file; task 2) count lines in a file, then add 42 to it;

I suspect awk falls behind in practicality vs perl (which can do simple one-liners, but also more complicated constructs), but perhaps has a hidden virtue wrt speed in more expensive tasks, ala https://news.ycombinator.com/item?id=17135841 or https://news.ycombinator.com/item?id=20293579


> Here's the thing: these arguments all too commonly focus on subjective notions of "simplicity", and toy examples divorced from actual common practise, and or solid comparable benchmarks.

I think your missing the point of awk. The O'Reilly sed and awk book has some complex examples, but when I look at my own usage they are all toy examples within a much larger scope. It's more like a special DSL extension for my shell than something I'd pick to build the entire solution, so a comparison to perl, python and ruby don't really make sense, they are general purpose languages but awk just has a couple of features that make it a very specialized yet useful one.

As an example, a have a system for importing and parsing log files that mostly done from a shell script, awk is used in two parts. The first is to transform a structured and easy to read file (records '\n\n' separated) into a csv easier to consume for bash, there's probably quite a few options to do this from tr to bash and it's done inline. The second is to filter the results down to what I need, so I have scripts like:

  #!/usr/bin/awk -f
  /some common error I don't care about/ { next } #skip line
  /other common error/ { next }
  /Error/ { print $0 } #this prints error lines, alternatively:
  /Error/ && !errors[gensub($1", "", "g", $0)]++ { print $0 } #print each error once
  {next} #skip everything else 
Apart from one single line which wasn't in the original that's something you could teach a total novice in minutes, the /pattern/{action} syntax is about as simple as programming can be. Execution speed could probably be improved with a specific program but I suspect the bottleneck would be the spinning disk anyway, I run this over hundreds of MB every few minutes and it's not a problem, when I run it manually it's near instantaneous, I spend longer waiting for the desktop calculator to open up these days.


Can you give an example of when you'd choose AWK over perl?

perl is general purpose, but that doesn't mean it can't be used for one-liners.


I sat down and read through the Awk manual once just to learn and was pleasantly surprised at how orthogonal the language is and how much it offers.

The only problem is that it was written by people using ed on PDP computers and that kind of shows. The primary logic is "filter out lines X and apply transform Y" is completely natural to someone using an editor like ed, but is fairly foreign to modern computer users. Most people aren't going to take the time to learn an obscure commandline tool these days, especially since it comes from the Dennis Ritchie school of "errors are like angry housewives, you know what you did" debugging.


I like to tell the story of when I was doing some genetics data wrangling and spent three days writing some perl code and I kept failing, then sent one of the researchers an email and he suggested an awk method and I turned 3 pages of perl into an awk one liner that works just fine. Now, its probably because I don't know perl very well, but as an ops type who doesnt have the classical dev/cs education, tools like it are my goto.


I find Python list/dict comprehensions, zip, range, enumerate etc. and the itertools module very good for such things.

And you have immediate access to so many useful modules (csv, json, xml) and can easily extract code fragments into functions.

You can also execute shell-like commands with subprocess.check_output() without ever worrying again about escaping strings or accidentally splitting them at spaces or whatever.

Clever one liners are difficult to comprehend. It's better to break them up to a few variable assignments with descriptive, long names without abbreviations.


I agree python seems to have taken over in this space since then (7+ yrs ago) and is usually a better tool in bioinformatics, I was just using the tools I knew as an ops dude.


do you know that feeling when a problem is too much for the shell but too little for C? that's where Perl was supposed to fit.

in everyday life however, many small problems are a bit too much for the shell but too little for Perl/Python/whatever.

awk fits very nice in there.


> Imagine programming without regular expressions.

I live in this future and it's beautiful.

Steps to programming without regular expressions:

1) find a PEG library for your language of choice

There is no step 2.


Quite a few alternatives to RE's I put in this answer:

https://softwareengineering.stackexchange.com/questions/1949...


Sorry for asking such dub question, but what's a PEG library?


Parsing expression grammar;

It is a recursive decent parser with the tiny tweak that productions are ordered (not a set) and short circuit.

A 90's language you did not have to imagine that saved you from regex in this way was[is] REBOL with it built in `parse` dsl.

examples here http://www.rebol.org/search.r?find=parse&form=yes

[0]https://en.wikipedia.org/wiki/Parsing_expression_grammar [1]http://www.rebol.com/


No idea... had to look it up. Parsing Expression Grammar.

https://en.wikipedia.org/wiki/Parsing_expression_grammar


TL;DR: PEG looks to me like simplified lex+yacc

For those not familiar with PEG, like me, here: https://en.wikipedia.org/wiki/Parsing_expression_grammar

I found that fairly abstract, a Python example provided more concrete examples: https://github.com/erikrose/parsimonious


So why for example Awk instead of let's say Perl? The argument of "it's likely to be installed" isn't very compelling.


If you're versed in both, it comes down to taste, though there really are cases where you'll have awk (usually via busybox) but not Perl. OpenWRT comes to mind (just verified it's not present by default, though yes, packages are available).

For a huge number of simple tasks, awk is available and sufficient. It's largely a subset of Perl, so yes, there's some skills overlap, but there are times where knowing awk is the right tool and the available tool will pay off.


I like Perl regexes more though, especially aliases. Using \d is a lot neater than [0-9] or [[:digit:]].


I hear Perl may support those ;-)

(Use the right tool for the job.)


Because you often don't need anything else to deal with a surprisingly large class of tasks in IT.


After a decade of writing application software in C family languages, I am now working on a big devops effort in a Linux environment. A lot of the syntax is janky but it's pretty amazing what you can accomplish with a shell and the Linux command line tools.


In the early aughts, fresh out of college, one of my first projects was to take semi-structured text from database blobs and convert it to XML. I quickly realized it was not a task to be fully automated, too many edge cases that really required human judgement because the it was only very loosely semi-structured. I turned to sed, awk, and pico. Sed & awk did the 99%, and dumped into pico when it didn't know what to do and I would resolve the issue. Doing it semi-supervised in this way was 100 times faster than doing it all manually, and 10 times faster than full automation, and probably more accurate.

But it was the ability to string together these types of command line tools that made it possible.


The author forgot one: C-like syntax

I believe this was intentional by the authors.

In the early days of UNIX, I think more of its users knew C. Today, it is probably a much smaller number. However, learning AWK today, IMO, can help someone who also intends to learn C.


"Imagine programming without regular expressions."

Best of all worlds. I wish there would be a way to get back the time I spent with debugging edge cases of different regex implementations on various OS's.


I have used awk for decades, but I simply stoped and dropped the habit of using it along with sed and perl. Nowadays I would rather write a program than a script just to avoid memorizing all these tricks and hacks and glitches


It would be so great if awk had a csv mode. For whatever reason (Excel), CSV seems to be the default text format for field oriented data.

Maybe I’m dumb but I’ve never come up with a separator regex that is quite right.


The simple case is:

    FS = ","
or:

    awk -F ',' <program>
If you're working with CSV data that has quoted strings with embedded commas, FPAT is your friend:

    FPAT = "([^,]+)|(\"[^\"]+\")"
See: https://www.gnu.org/software/gawk/manual/gawk.html#Splitting...


just want to say thanks for this, I haven't had to deal with CSV for years and now only a week after you posted this I needed it.

brew install gawk

good to go

:)


For _actually_ comma separated values, just use awk -F , '...'

Is this not what you mean?

Edit: This comment might be helpful to you: https://news.ycombinator.com/item?id=22110036


The problem is that the field themselves can contain quotes ('"'), which escapes the comma. So the standard FS=/-f doesn't work properly.

It looks like FPAT from your linked article is for gawk. Gawk is great, but it's not everywhere. Still - it's good to know. Thanks!


There used to be a website shared on HN a lot, which was a table of: Task, SED, Awk. It was really useful but I haven’t seen it for years.



Does someone know how I have to invoke awk to make sure that the awk code I wrote is POSIX compliant?


Needs a '(2016)' at the end of the title.


[updated] ty.


Why not just use Emacs for simple scripts and text manipulation? Way easier, lots of one liners as well, just as expressive.


Too big and slow. Also I disagree with the expressiveness, (emacs-has-chosen-a-very-longwinded 'way (to (express things))) that would be shorter in languages with more syntax.


Many of us do precisely this. Emacs is like a visual perl.


(2016)


When writing a shell script the robustness/reliabilty can be inferred from it's scope:

1. Uses shell-only commands (echo, for) - most robust; but things like basename/dirname and regex's vary by shell (sh, bash, zsh, ksh)

2. Uses /bin - might run into missing a binary but not likely, still robust and allows a richer set of tools (e.g., uname, chmod and admin-ish things live in /bin)

3. Uses /usr/bin - runs risk if missing packages, likely not very robust (packages drop things in here, like gzip, yacc, gcc)

4. Uses /usr/local/bin or /opt/local/bin - definitely requires package installs, least robust




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: