I use awk because there's an almost 100% chance that it's going to be installed on any unix system I can ssh into.
I use awk because I like to visually refine my output incrementally. By combining awk with multiple other basic unix commands and pipes, I can get the data that I want out of the data I have. I'm not writing unit tests or perfect code, I'm using rough tools to do a quick one-off job.
For instance, "mail server x is getting '81126 delayed delivery' from google messages in the logs, find out who is sending those messages".
# get all the lines with the 81126 message. Get the queue IDs, exclude duplicates, save them in a file.
Each of those 2 one-liners was built up pipe-by-pipe, looking at the output, finding what I needed. It's not pretty, it's not elegant, but it works. I'm sure there's a million ways that a thousand different languages could do this more elegantly, but it's what I know, and it works for me.
I know you’re not asking for awk protips but you can prefix the block with a match condition for processing.
... | grep foo | awk ‘{print $6}’ | ...
becomes
... | awk ‘/foo/{print $6}’ | ...
If you start working this into your awk habits you’ll find delightful little edge cases that you can handle with other expressions before the block (you can, for example, match specific fields).
Yikes. The syntax I had was wrong anyway. Should have been
awk 'BEGIN {FS=":"};{print $1}'
One benefit of the FS variable over -F, at least in original awk, is that by using FS the delimiter can be more than one character. I guess that's why I remember FS before I remember -F. More flexible.
That is not how FS is set; It's set with -F. And there is actually no need to use -v, passing variables at the end works consistently across all AWK's and always has:
To pile on :-) you often want -w (match word) flag to grep.
In awk, I couldn't find how to do this. I tried /\bfoo\b/ and /\<foo\>/ but neither worked. I don't know why and don't care enough which brings me to my major awk irritation ...
It doesn't use extended or perl REs, which makes it quite different to ruby, perl, python, java. Now, according to the man page it does; at least on OSX (man re_format) but as mentioned it didn't work for me.
GNU awk supports \< and \> for start and end of word anchors, which works for GNU grep/sed as well
GNU awk also supports \y which is same as \b as well as \B for opposite (same as GNU grep/sed)
Intererstingly, there's a difference between the three types of word anchors:
$ # \b matches both start and end of word boundaries
$ # 1st and 3rd line have space as second character
$ echo 'I have 12, he has 2!' | grep -o '\b..\b'
I
12
,
he
2
$ # \< and \> strictly match only start and end word boundaries respectively
$ echo 'I have 12, he has 2!' | grep -o '\<..\>'
12
he
$ # -w ensures there are no word characters around the matching text
$ # same as: grep -oP '(?<!\w)..(?!\w)'
$ echo 'I have 12, he has 2!' | grep -ow '..'
12
he
2!
On the other hand, grep can be far faster for searching alone than awk. I almost always use an initial grep for the string that will most reduce the input to the rest of the pipeline. Later, it feels idiomatic to mix in awk with matches like you suggested
Bravo! This is one of the most insightful comments I've read in a long time! I have been using some of these tools for years but I never thought of describing them this way. Now I can think of writing a complex query in relational algebra and translating it into these commands in a very natural way.
Indeed, and with a bit of tuning (e.g., using mawk for most things), one can get quite good performance. [1]
The project also provides a translator from Datalog to bash scripts [2].
Thank you, and thank you (really, not sarcasm) for the new stuff I have to learn about relational algebra. I'm a huge fan of wide/shallow knowledge that allows me to dive into a subject quickly.
It is from relational algebra used in database theory. There is an excerpt from one of the first MOOCS offered here on Lagunitas now.[1] It is pretty intuitive once you get the hang of it.
Its ubiquity and performance open up all kinds of sophisticated data processing on a huge variety of *nix implementations. Whether it's one liners or giant data scrubs, awk is a tool that you can almost always count on having access to, even in the most restrictive or arcane environments.
It's far more elegant and concise than any other scripting language I can think of using to accomplish the same thing.
As the article points out, other languages will have a lot more ceremony around opening and closing the file, breaking the input into fields, initializing variables, etc.
As part of a practical component to any software engineering degree should be a simple course on common Unix tools, covering grep, awk, sed, PCRE, and git.
I wholeheartedly agree. I've seen people agonize for days over results from Splunk that they want to turn into something more user-friendly. 15 minutes of messing around with the basic command line Unix tools has that information in a perfect format for their needs.
This is something I need to bring up to my coworkers, I should write some sort of basic guide to unix tools for them.
> I don't know of any easier equivalent of "awk '{ print $2 }'" for what it does.
Does `cut -f2` not work? My complaint with cut is that you can't reorder columns (e.g. `cut -f3,2` )
Awk is really great for general text munging, more than just column extraction, highly recommend picking up some of the basics
Edit to agree with the commenters below me: If the file isn't delimited with a single character, then cut alone won't cut it. You need to use awk or preprocess with sed in that case. Sorry, didn't realize that's what the parent comment might be getting at.
IIRC, there is an invocation of cut that basically does what I want, but every time I try, I read the manual page for 3 or 4 minutes, craft a dozen non-functional command lines, then type "awk '{ print $6 }'" and move on.
> IIRC, there is an invocation of cut that basically does what I want
I don't think there is, because cut separates fields strictly on one instance of the delimiter. Which sometimes works out, but usually doesn't.
Most of the time, you have to process the input through sed or tr in order to make it suitable for cut.
The most frustrating and asinine part of cut is its behaviour when it has a single field: it keeps printing the input as-is instead of just going off and selecting nothing, or printing a warning, or anything which would bloody well hint let alone tell you something's wrong.
Just try it: `ls -l | cut -f 1` and `ls -l | cut -f 13,25-67` show exactly the same thing, which is `ls -l`.
cut is a personal hell of mine, every time I try to use it I waste my time and end up frustrated. And now I'm realising that really cut is the one utility which should get rewritten with a working UI. exa and fd and their friends are cool, but I'd guess none of them has wasted as much time as cut.
Most utilities don't use a tab character as separator, and that's what cut operates to by default. Can't cut on whitespace in general, which is what's actually useful, and what awk does.
Only way to get cut to work is to add a tr inbetween, which is a waste of time when awk just does the right thing out of the box.
> which is a waste of time when awk just does the right thing out of the box.
Agree in general. Only exception I'd make to this is when you're selecting a range of columns, as someone else mentioned elsewhere in the thread. I typically find (for example) `| sed -e 's/ \+/\t/g' | cut -f 1,3-10,14-17` to be both easier to type and easier to debug than typing out all the columns explicitly in an awk statement.
As others have pointed out, no. It should! (Said the guy sitting comfortably in front of his supercomputer cluster in 2020. No, I don't do HPC or anything; everything's a supercomputer by the time that cut was written's standards.) But it doesn't. Going out on a limb, it's just too old. Cut comes from a world of fixed-length fields. Arguably it's not really a "unix" tool in that sense.
"highly recommend picking up some of the basics"
I have, that's the other 10%. I've done non-trivial things with it... well... non-trivial by "what I've typed on the shell" standards, not non-trivial by "this is a program" standards.
Not if the columns are separated by variable number of spaces. By default, the delimiter is 1 tab. You can change it to 1 space, but not more and not a variable number.
In my experience, most column based output uses variable number of spaces for alignment purposes. Tabs can work for alignment, but they break when you need more than 8 spaces for alignment.
The Internet Correction Squad would like to remind you that 1) they are different programs that do different things, 2) if they changed over time, they wouldn't be portable, 3) if all you use awk for is '{print $2}', that is perfectly fine.
You can submit a new feature request/patch to GNU coreutils' cut, but they'll probably just tell you to use awk.
One of the bad things about having an ultra-stable core of GNU utils as that they've largely ossified over time. Even truly useful improvements can often no longer get in.
It's a sharp and not-entirely-welcome change from the 80s and 90s.
Here's another that would be great but will never be added: I want bash's "wait" to take an integer argument, causing it to wait until only that number (at most) of background processes are still running. That would make it almost trivial to write small shell scripts that could easily utilize my available CPU cores.
PERL is bloatware by comparison and less likely to be installed on distros than AWK. (e.g, embedded or slim distros. that's why you rarely see nonstandard /bin execs in shell scripts).
Perl used to be part of most distros, but I think favor shifted to Python a few years ago.
I wouldn't call it bloat, but yes it is much bigger. At the time you had C (really fast, but cumbersome) and Awk/Bash (good prototyping tools, but not good for large codebases). Perl was the perfect answer to something that is fairly fast, relatively easy to develop in, and easier to write full-sized codebases
Larry Wall referred to the old dichotomy of the “manipulexity” of C for reaching into low-level system details versus the “whipuptitude” of shell, awk, sed, and friends. He targeted Perl for the sweet spot in the unfilled gap between them.
Can confirm. GNU awk is GPLv3, which means it can't be legally included on any system that prevents a user from modifying the installed firmware. This is a result of GPLv3's "Installation Instructions" requirement.
Every commercial embedded Linux product that I've seen uses Busybox (or maybe Toybox) to provide the coreutils. If awk is available on a system like that, it's almost certainly Busybox awk.
And Busybox awk is fine for a lot of things. But it's definitely different than GNU awk, and it's not 100% compatible in all cases.
That rule only apply if the manufacturer has the power to install modified version after sale. If the embedded Linux product is unmodifiable with no update capability then you do not need to provide Installation Instructions under GPLv3.
The point of the license condition is that once a device has been sold the new owner should have as much control as the developer to change that specific device. If no one can update it then the condition is fulfilled.
Specifically GPLv3 is the sticking point - not the GPL in general. GPLv2 is a great license, and I use it for a lot of tools that I write. That's the license that the Linux kernel uses.
GPLv3 (which was written in 2007) has much tougher restrictions. It's the license for most of the GNU packages now, and GPLv3 packages are impractical to include in any firmware that also comes with secret sauce. So most of us in the embedded space have ditched the GNU tools in our production firmware (even if they're still used to _compile_ that firmware).
That's not an entirely accurate understanding of the GPLv3 "anti-tivoisation" restrictions. The restrictions boil down to "if you distribute hardware that has GPLv3 code on it, you must provide the source code (as with GPLv2) and a way for users to replace the GPLv3 code -- but only if it is possible for the vendor to do such a replacement". There's no requirement to relicense other code to GPLv3 -- if there were then GPLv3 wouldn't be an OSI-approved license.
It should be noted that GPLv2 actually had a (much weaker) version of the same restriction:
> For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. [emphasis added]
(Scripts in this context doesn't mean "shell scripts" necessarily, but more like instructions -- like the script of a play.)
So it's not really a surprise that (when the above clause was found to not solve the problem of unmodifiable GPL code) the restrictions were expanded. The GPLv3 also has a bunch of other improvements (such as better ways of resolving accidental infringement, and a patents clause to make it compatible with Apache-2.0).
I do appreciate the intention behind GPLv3. And it does has a lot of good updates over GPLv2.
The reason why I said it's impractical to include GPLv3 code in a system that also has secret sauce (maybe a special control loop or some audio plugins) is more about sauce protection.
If somebody has access to replace GPLv3 components with their own versions, then they effectively have the ability to read from the filesystem, and to control what's in it (at least partially).
So if I had parts that I wanted to keep secret and/or unmodifiable (maybe downloaded items from an app store), I'd have to find some way to separate out anything that's GPLv3 (and also probably constrain the GPLv3 binaries with cgroups to protect against the binaries being replaced with something nefarious). Or I'd have to avoid GPLv3 code in my product. Not because it requires me to release non-GPL code, but more because it requires me to provide write access to the filesystem.
And I guess that maybe GPLv3 is working as intended there. Not my place to judge if the license restrictions are good or bad. But it does mean that GPLv3 code can't easily be shipped on products that also have files which the developer wants to keep as a trade secret (or files that are pirateable). With the end result that most GNU packages become off-limits to a lot of embedded systems developers.
I will post the code fragment if I can find it (this was 10 years ago). I had a tiny awk script on an embedded system (busybox) to construct MAC addresses. There was some basic arithmetic involved and I couldn't quite figure out how to do it with a busybox shell script. The awk script didn't work at all on my Linux desktop.
Even assuming the odd "bloatware" characterization, this is irrelevant. From the article's point of view of "simple tasks", bloat or not doesn't matter; what matter is the language syntax and features used to accomplish a task (and I'd add consistency across platforms).
Regarding slim/embedded distros, it depends on the use cases, and the definition of "slim". It's hard to make broad statements on their prevalence, and regardless, I've never stated that one should use Perl instead and/or that it's "better"; only stated that the option it gives is a valid one.
Unfortunately, for large files perl is significantly faster than awk. I was working on some very large files doing some splitting, and perl was over an order of magnitude faster.
A tool that is stable, well supported, has outstanding documentation, thoroughly tested, won’t capriciously break your code, and outperforms the rest of the pack is not the unfortunate case.
To be more clear, the unfortunately part was due to the title of this article. I think awk is great, but if you know perl well enough it can easily replace it and be much more versatile
That is closest, yes. I'd say clearly a couple more things to remember, but if I can get it into my fingers will be just as fluid. Awk's invocation has its own issues around needing to escape the things the shell thinks it owns, too, not that it's at all unique about that.
I don't have my dotfile here (on my phone) but here's some ideas from things I've aliased that I use a lot:
cdd : like cd but can take a file path
- : cd - (like 1-level undo for cd)
.. : cd ..
... : cd ../.. (and so on up to '.......')
nd : like cd but enter the dir too
udd : go up (cd ..) and then rmdir
xd : search for arg in parent dirs, then go as deep as possible on the other branch (like super-lazy cd, actually calls a python script).
ai : edit alias file and then source it
Also I set a bindkey (F5 I think) that types out | awk '{print}' followed by the left key twice so that the cursor is in the correct spot to start awk'in ;D
# Bind F5 to type out awk boilerplate and position the cursor
bindkey -c -s "^[[[E" " | awk '{print }'^[[D^[[D"
Edit: better formatting (and at work now so pasted the bindkey)
Absolutely. Everything comes with costs & benefits. But I'm not sure I've, in my entire 23-year professional programming career, ever encountered a fixed-width text format in the wild. I've used cut even so for places where by coincidence the first couple of columns happened to be the same size, but that's really a hack.
Obviously, other people have different experiences which is why I quality it so. (I only narrowly missed it at the beginning, but I started in webdev, and we never quite took direct feeds from the mainframes.) But I don't think it's too crazy to say UNIX shell, to the extent it can be pinned down at all, works most natively with variable-length line-by-line content.
Some topics where you will most definitely come across fixed-width formats:
- processing older system printouts that are save to text file
- banking industry formats for payment files and statements
- industrial machine control
and my favourite.......
- source code.
My first intro to awk was using it to process COBOL code to find redundant copy libs, consolidate code, and generally cleanup code (very very crude linting). And it was brilliant. Fast, logical, readable, reliable - was everything i needed.
It is also an eminently suitable language for teaching programming because it introduces the basic concept of SETUP - LOOP - END . which is exactly the same as one will find in most business systems, you find it in arduino sketches, hell you even find it in a browser which is basically just a whole universe of stuff sitting atop a very fast loop that looks for events.
AWK fan for sure - my heirachy of languages these days would be cmd line where there is specific command doing all i need, AWK for anything needing to slice and dice records that dont contain blobs, python for complete programs, and python+nuitka or lazarus or C# when need speed and/or cross platform.
> But I'm not sure I've, in my entire 23-year professional programming career, ever encountered a fixed-width text format in the wild.
SAP comes to mind. I think it does support various different formats, but for reason or another fixed-width seemed to be some kind of default value (that's what I usually got when I asked for SAP feed at least, but that was years ago).
Admittedly I have encountered fixed-width text formats in the wild. But the last such occasion was about 15 years ago. (It was for interacting with a credit card processor to issue reward cards.)
Within my first year of professional development, I encountered several fixed-width files I needed to read and write. I suppose exposure depends a lot on the specific industry.
I don't think so. I think they're referring to cut's ability to select an arbitrary range of columns, e.g. `cut -f 2-7` to select the 2nd through 7th columns, while awk requires you to explicitly enumerate all desired columns, i.e. `awk '{print $2, $3, $4, $5, $6, $7}'
OK, understood. And yes, in my experience AWK is less good at that, cut would definitely be the right tool.
It doesn't detract from the point at hand - which is perfectly valid - but it's worth noting that there's a confusion here with regards the terminology: "fields" vs "columns". I thought they were referring to "columns of characters" whereas the added explanations[0] are about "columns of fields". That makes a difference.
But as I say, yes, I agree that to select a range of columns of fields, especially several fields from each line, is definitely better with cut.
I wrapped awk in a thin shell script, f, so that you can say “f N” to get “awk '{print $N}'” and “f N-M” (where either of the end points are optional) to do what cut does, except it operates on all blanks.
Well... no, it doesn't, obviously, we can take a quick look at gawk to confirm this.
But, just as we can't say "awk offers the -b flag to process characters as bytes", we can't really say that cut offers any extensions not defined in the standard.
An implementation could, sure. I'd prefer that it didn't, writing conformant shell scripts is hard enough.
Before standards happen, creative developers are free to use their imagination and come up with useful features. Then someone makes up a standard, and from thereon progress is halted. The only way you grow new features and functionality is through design-by-committee, if people aren't making extensions that would one day make it to the next revision. I think it is ridiculous.
Tools should improve, and standards should eventually catch up & pick up the good parts.
People who need to work with legacy systems can just opt not to use those extensions, but one day they too will benefit. Others benefit immediately.
I find that, for these kinds of utilities, all the extra add-ons tend to cost me more in the form of, "Whoops, we tried using this on $DIFFERENT_UNIX and it broke," than they ever save me.
When I'm looking to do something more complicated, I'd rather look to tools from outside the POSIX standard. The good ones give me less fuss, because they can typically be installed on any POSIX-compliant OS, which is NBD, whereas extensions to POSIX tools tend to create actual portability hassles.
PowerShell is a lot better at this. It was designed for reading by humans, not for saving precious bytes over 300-baud terminal connections.
So for example, selecting columns is easy, and can be done by name.
Here's a useful little snippet demonstrating converting and processing the CSV output of the legacy "whoami" Windows command. It lists the groups a user is a member of without poking domain controllers using LDAP. It always works, including cross-forest scenarios.
It looks like a lot of typing, but everything tab-completes and the end-result is human readable. I find that people that prefer terseness over verbose syntax are selfish. They simply don't care about the future maintainers of their scripts.
PowerShell can also natively parse JSON and XML into object hierarchies and then select from them. That's difficult in UNIX. The Invoke-RestMethod command is particularly brilliant, because instead of a stream of bytes it returns an object that can be manipulated directly.
PowerShell Core (PS5/open source) is crushing it in this space over Bash\AWK specifically.
I'm a pretty advocate for how it works in these use cases now.
Generally I still use Bash\Sed\AWK for the one off instances when ever I need something done there and then. But if I'm writing something that's going to be involved in my CI\CD with a high chance of needing to be maintained. I'll nearly always write it in PowerShell.
While I agree that the silent failures and "opaqueness" can be off-putting, once you understand the tool and how to implement it in your workflow it is wonderfully efficient. Aside from being syntactically terse I haven't found any compelling reasons not to use it.
I've achieved a high level of expertise in Perl. Even though I can, I won't write complex Perl one liners; instead, I will write tested, documented utilities which someone who has not achieved such a level of mastery has a shot at grokking, maintaining and debugging.
Even better if I can write such utilities in a more modern programming language like Go or Rust, if the organization has personnel with expertise in such languages.
There is a school of thought in CS that equates terseness to bad programming practices. That is unfortunate. It is possible for software to be terse and well designed, and of course verbose solutions can easily morph into an unmaintainable nightmare.
I wish there were a set of tools like coreutils but like..."less surprising" I guess? Like I spend 98% of my focus-time in algol-likes. 1-indexing drives me batty. I always have to try a few times to craft craft cut or tail expressions to match [i:j] slice notation.
And anything more complex than that I typically just use python.
scripts are the antithesis of modern software development. No Unit testing, No CI, no monitoring, no consistency, often no source control. Appeals to me as a hacker but only my scripts are intelligible - which is a bad sign. :)
I'm not actually convinced unit testing is all that valuable, unless the unit under test has a very clear input -> output transformation (like algorithms, string utilities, etc). If it doesn't (and most units don't), unit tests just encumber you.
The value of unit tests when creating software is perhaps debatable, but I find their greatest value to be in maintaining software. If you lock in your expectations of a software component, when it's time to make changes (due to shifting requirements or what not), you can add tests capturing your intent and make sure that you aren't breaking existing expectations. Or at least, that you know which expectations your change will break.
It is easy to explain: unit tests give a documented proof that you care about code robustness. It is used more for social and psychological effect than for its advantages. In fact, outside a few domains, unit tests make it harder to evolve software because the more tests you write, the harder it is to make changes that move the design in a different direction. This is, by the way, my main problem with verbose techniques of programming: the more you have to code, the harder to make needed changes.
Integration tests can be easier to write, are necessary, but can also be much slower to run. Yes, it’s possible to write bad unit tests; the same is true of integration tests.
Yes I worked in a print production industry and unit testing was almost impossible. To generate a string of text reliably and correctly is not difficult, but to position it on the page respecting other elements, flow, overlap, current branded font and size, and white space - unit testing is basically useless. And these are the errors we faced most.
I have written a tool (also a script) myself that allows you to write unit tests, manage your scripts in git, load scripts in other scripts, etc. Maybe I will post a Show HN in the coming weeks, but at the moment I would like to round up some edges before posting it.
In my experience, the biggest problem is that there are many different runtime environments out there that differ in detail and make it hard to write scripts that run everywhere. But programmers not applying all their skills (e.g. writing tests) to build scripts are also part of the problem.
I gave awk a sincere attempt, but I have to say that it wasn't worth it. As soon as one tries to write anything bigger than a one liner the language shows its warts.. I found myself writing lots of helper routines for things that should be part of the base language/library, e.g. sets with existence check. I also had to work around portability issues, because awk is not awk, unlike this post claims. E.g. matching a part of a line is different among different awks.
And some language decisions are just asinine, e.g. how function parameters and variable scope works, fall through by default although you almost never want that, etc..
But hey, you have socket support! Sounds to me like things have developed in the wrong direction.
And of course no one on your team will know awk.
I found the idea of rule-based programming interesting, but the way it interacts with delimiters and sections (switching rules when encountering certain lines) doesn't work well in practice.
I also found the performance to be very disappoinging when I compared it to C and python equivalents.
You realize AWK is about 1/100th the size of Python, right? That's like comparing a Leatherman multi-tool to a Craftsman 2000 piece tool set that weighs 1,000 lbs. This matters significantly when addressing compatibility and when building distros that are space constrained.
Awk is there for a reason: to be small. That's why the O'Reilly press book is called "Sed & Awk", because they were orignally written to work together in the early days of unix dating back to the late 70's. Sed (1974) & Awk (1977) are in the DNA of unix, Python is something totally different.
First of all I'm not a distro maintainer. I also doubt that people would use awk for seriously space constrained environments. And distros ship both awk and python anyway. And again, I don't understand why they'd support networking but not basic data types/functions.
The only reason I could've seen to use awk was to throw code together more quickly in a DSL.
However this is much less the case than I had hoped. For the one liners there are usually specialized tools like fex that are easier to use and faster (for batch processing).
When I compared my C/python/awk programs the difference was msec/sec/minutes. As soon as I use such a program repeatedly it starts to hurt my productivity. And the development time is not orders of magnitude slower in non-awk languages.
> I also doubt that people would use awk for seriously space constrained environments. And distros ship both awk and python anyway.
Python is absolutely not available everywhere one can find Awk. I've never seen a system with Python but not Awk, but have seen many systems with Awk but not Python (excluding the BSDs, where Python is never in base, anyhow).
Actually, not many years ago I used to claim that I never saw a Linux system with Bash that lacked Perl, but had seen systems with Perl that lacked Bash. (And forget about Python.) This was because most embedded distros use an Ash derivative, often because they used BusyBox for core utilities or a simple Debian install. Perl might not have been default installed, either, but invariably got pulled in as a core dependency for anything sophisticated. Anyhow, the upshot was that you'd be more portable, even within the realm of Linux, with a Perl script than a Bash-reliant shell script. Times have changed, but only in roughly the past 5 years or so. (Nonetheless, IME Perl is still slightly more reliable than Python, but variance is greater, which I guess is a consequence of Docker.)
One thing to keep in mind regarding utility performance is locale support. Most shell utilities rely on libc for locale support, such as I/O translation. Last time I measured, circa 2015, setting LC_ALL=C resulted in significantly improved (2x or better, I forget but am being conservative) standard I/O throughput on glibc systems.[1] I never investigated the reasons. glibc's locale code is a nightmare[2], and that's more than enough explanation for me.
Heavy scripting languages like Perl, Python, Ruby, etc, do most of their locale work internally and, apparently, more efficiently. If you don't care about locale, or are just curious, then set LC_ALL=C in the environment and test again. I set LC_ALL=C in the preamble of all my shell scripts. It makes them faster and, more importantly, has sidestepped countless bugs and gotchas.
For the things I do, and I imagine for the vast majority of things people write shell scripts for, you don't need locale support, or even UTF-8 support. And even if you do care, the rules for how UTF-8 changes the semantics of the environment are complex enough that it's preferable to refactor things so you don't have to care, or can isolate the parts that need to care to a few utility invocations. In practice, system locale work has gone hand-in-hand with making libc and shell utilities 8-bit clean in the C/POSIX locale, which is what most people care about even when they care about locale.
[1] The consequence was that my hexdump implementation, http://25thandclement.com/~william/projects/hexdump.c.html, was significantly faster than the wrapper typically available on Linux systems. My implementation did the transformations from a tiny, non-JIT'd virtual machine, while the wrapper, which only supports a small subset of options, did the transformation in pure C code. My code was still faster even compared to LC_ALL=C, which implied glibc's locale architecture has non-negligible costs.
[2] To be fair, it's a nightmare partly because they've had strong locale support for many years, and the implementation has been mostly backward compatible. At least, "strong" and "backward compatible" relative to the BSDs. Solaris is arguably better on both accounts, though I've never looked at their source code. Solaris' implementation was fast, whatever it was doing. musl libc has the benefit of starting last, so they only support the C and UTF-8 locales, and in most places in libc UTF-8 support simply means being 8-bit clean, so zero or perhaps even negative cost.
There was a long period of time where it was easy to find a non-Linux Unix with Perl installed but not Bash: SunOS, Solaris, IRIX, etc., admins would typically install Perl pretty early on, while Bash was more niche. Like, maybe 1990 to 2000. Now we're getting into an era where lots of Unix boxes run MacOS, and although they have Bash, it's a version of only archaeological interest. But they do have Perl.
Most complaints that awk doesn't have have this or that feature ignore the fact that awk is not supposed to be used in isolation. Any substantial use of awk has to be tied to other UNIX utilities. I don't think you can, or should, write a medium to large size script completely in awk, the whole idea is to compose it with one or more UNIX commands.
"This is the opposite of a trend of nonsense called DevOps, where system administrators start writing unit tests and other things to help the developers warm up to them - Taco Bell Programming is about developers knowing enough about Ops (and Unix in general) so that they don't overthink things, and arrive at simple, scalable solutions"
It's not possible for developers to know enough about Ops, just as it's not possible for Ops to know enough about development, because they are different jobs. Moreover, devs are doomed to create terrible solutions because of their job, and Ops are doomed to create kludgy hacks for those terrible solutions because of their job. DevOps is just an attempt to get them to talk to each other frequently so that horrible shit doesn't happen as frequently.
Also, the real Taco Bell programming is actually to use only wget, no xargs. It takes a whole lot of basically every option Wget has, and a very reliable machine with a lot of RAM, but you can crawl millions of pages with just that tool. xargs and find make it worse because you don't get the implicit caching features of the one giant threaded process, so you waste tons of time and disk space re-getting the same pages, re-looking up the same hostnames, etc. (And that's Ops knowledge...)
The difference between dev & Ops is an Italian grandma vs a restaurant chef. It's different experience that gives you different knowledge and a different skillset.
Simple, Composable Pieces is practically the whole ethos of both Unix and Functional Programmers, and they both converged on the same basic flow model: Pipelines with minimal side-effects. This naturally leads to a sequential concurrency, which seems to be easy for humans to reason about: Data flows through the pipeline, different parts of the pipeline can all be active at once, and nobody loses track. It doesn't solve absolutely every possible problem, but the right group of pieces (utility programs, functions) will solve a surprising number of them without much trouble.
>> suppose you have millions of web pages that you want to download and save to disk for later processing. How do you do it?
I don't know enough about the 'real way' or the 'taco bell way', but interested to know --- is this doable the way Ted describes in the article via xargs and wget?
Skip learning of sed and awk and jump straight to perl instead.
$ perl --help
...
-F/pattern/ split() pattern for -a switch (//'s are optional)
-l[octal] enable line ending processing, specifies line terminator
-a autosplit mode with -n or -p (splits $_ into @F)
-n assume "while (<>) { ... }" loop around program
-p assume loop like -n but print line also, like sed
-e program one line of program (several -e's allowed, omit programfile)
Because syntatically as a language/tool it is super easy to remember. Writing one liners with awk feels more intuitive to me.
Awk example:
ls -l | awk '{print $9, $5}' or
ls -lh | awk '{print $9, $5}'
Seems a whole lot simpler. To me. I find if you have to write exhaustive shell scripts then maybe you can look for something more verbose like Perl, I guess.
If you mean the lack of quotations, then the behavior is well-defined and is presumably what was intended. Per POSIX,
> The print statement shall write the value of each expression argument onto the indicated output stream separated by the current output field separator (see variable OFS above), and terminated by the output record separator (see variable ORS above).
The default value for OFS is <space> and for ORS, <newline>.
In my defense I did this fairly quickly (Which was the point.) and was not trying to illustrate proper syntax (I mean it does run and does produces an output.).
ls -l | awk '{print $9 "\t" $5}'
That is about as much as i'm willing to do for this.
Also, a handy trick is to combine awk and cut. For example I had a log line that had a variable amount of columns just in one field, but immediately after the field was a comma. I cut based on the comma:
I've been parsing some documents converted from PDF (using the Poppler library's "pdftotext" command with the "--layout" option).
I found that reading these -- sort-of half-assed structured data, but with page-chunked artefacts and idiosyncrasies -- was difficult on a line-by-line basis, and thought idly "this would be a lot easier if I could process by page instead".
Text was laid out in columns, and the amount of indenting (the whitespace between columns) was significant. So preserving this somehow would be Very Useful.
Suddenly those pesky '^L' formfeeds were an asset, not a liability. Let's treat the formfeed ("\f") as a record delimiter, and the newline ("\n") as a field delimiter. We can parse out the actual columns based on witespace, for each line:
- Via the split() function, an array of columns separated by two or more spaces, which are saved as an array of gaps so I have the whitespace to play with.
Edge cases and fiddling ensue, but that's the essential bit of the code there.
Since the lines are an array, I can roll back and forth through the page (basically being able to read forward and backwards through the text record), testing values, finding out where column boundaries are, etc., and then output a page's worth of content, transposing to a single-column format, with appropriate whitespacing, when done.
In testing and debugging the output (working off of 20+ documents of 100s to ~1,000 pages), a lot of test cases, scaffolding, diagnostics, etc., have been created and removed to make sure the Right Things are happening. Easy with awk.
I just learned about SE a month or so ago, and it is indeed pretty awesome. I tried out the "vis" editor, using SE to create multiple cursors that I then manipulated with vi-like commands. That was a pretty sweet use case.
Don' stop there. You've solved a real problem in your life, and you might want that information another day. Make a tiny script that encapsulates it. Generalize it a tiny bit, and give it a memorable name (perhaps lsof-tree). That done, you can stop worrying about the mechanics of the solution and build on it.
#! /bin/bash
# lsof-tree: list open files in a given directory tree (default /home)
NAME=9 # set to 10 for CentOS
BASE="${1:-/home}"
lsof | awk -v NAME=$NAME '{print $NAME}' | grep "^$BASE" | sort -u
YMMV, but I find it easier and faster to know the basic utilities and how to compose them with pipes than remembering the name of a zillion such scripts.
I guess it depends on how often you need that particular pipeline. Every day? Sure, make a script. Every few months? Nah, I won't remember it anyway,or probably I remember that I've made a script like that but then I have to start searching my bin directory in the end using more time than just writing the pipeline in the first place.
It is fast, robust, and frequently far more performant than a lot of modern tools that can be overkill for most data manipulation. I use it all the time in our ETL processes and it always works as advertised.
I use it for a couple reasons: one, it is installed as a base app on almost every single *nix implementation on the planet, so you can count on having it even on ancient or restrictive environments (which I work in frequently); Two, awk is frequently fast enough for most needs, and generally far faster than a number of off the shelf "modern" tools. The first reason is the one that generally leads me to its use, its ubiquity and power make it a compelling tool.
I use Perl similarly to awk if I need to use regex rather than white space delimited fields.
I think if you know Perl really well and can remember the command line arguments - particularly -E, -n, -I and -p - then it’s a good swap in substitute for grep, sed, awk, cut, bash, etc when whatever 5 min task you’re working on gets a tiny bit more complex.
Similarly a decent version perl 5 seems to be installed everywhere by default.
I’m curious to know if anyone would say the same about python or any other programs? I’m not particularly strong in short python scripting.
I would say Perl’s native support for regular expressions makes it more useful on the CLI than Python, but Python is also very low on my preferred languages list.
I do, however, use it for JSON pretty printing in a pipeline: python -mjson.tool IIRC.
Last week I threw out AWK and replaced it with Ruby (Could've been Python, Perl or PHP even).
Because AWK is not suited for CSV. Please prove me wrong!
I had to parse 9million lines. Some of which contain "quoted records", others, same column, are unquoted. Some contain comma's, in the fields, most don't. CSV is like that: more like a guideline than actual sense.
Two hours of googling and hacking later, I gave up and rewrote the importer in Ruby, in under 5 minutes.
Lesson learned: I'll stay clear of AWK, when I know a oneliner of Ruby (or Python) can solve it just as well. Because I know for certain the latter can deal with all the edgecases that will certainly pop up.
> Some of which contain "quoted records", others, same column, are unquoted.
In which case, there is the FPAT variable which can be used to define what a field is. FPAT="\"[^\"]\"|[^,]", which means "stuff between quotes, or things that are not commas", would probably have worked for you. (EDIT: Looks like formatting has gotten hold of my FPAT and I don't know how to stop it... hopefully it is still clear where asterisks should be)
> Some contain comma's, in the fields, most don't. CSV is like that: more like a guideline than actual sense.
Well, I would say that's absolutely false. You can't just put the delimiter wherever you fancy and call it a well-formed file. Quoting exists for the unfortunate cases your data includes the delimiting character (ideally the author would have the sense to use a more suitable character, like a tab).
This is just a retort to prevent your post from dissuading readers from awk, which is a fantastic tool. If you actually sit for half and hour and learn it rather than google to cobble together code that works, it is wonderful. I also don't think it is valid to base your judgement of a tool on what was apparently garbage data.
Garbage and poorly specified csv files are a fact of life and people have to deal with them all the time.
But if you want to be in a world where people only deal with well specified files like RFC 4180 (for some definition of well specified), your quick field pattern doesn’t conform. It doesn’t handle escaped double quotes or quoted line breaks. If you’re using your quick awk command to transform an RFC 4180 file into another RFC 4180 file you’ve just puked out the sort of garbage you were railing against.
While awk is a great tool if you’re dealing with a csv format with a predictable specification, and probably could be made to bend to the GP will with a little more knowledge, it gets trickier if you’re dealing with handling some of the garbage that comes up in the real world. What’s worse is the programming model leads you down the path of never validating your assumptions and silently failing.
I love awk for interactive sessions when I can manually sanity check the output. But if I’m writing something mildly complex that has to work in a batch on input I’ve never seen, I too would reach for ruby.
I had the exact same experience, but sub python for Ruby.
The lesson I took wasn't that awk sucks, though. The lesson was that CSV is not trivial, and should not be parsed with regex or string matching. It's a standard with variants, and rolling in a library will pay dividends, especially if you're parsing a wide variety of different dialects of CSV.
A related lesson I took is that once your awk script grows beyond a certain level, graduate it up to a real language. I love awk, but it excels at small scale text munging. It's not suited to anything more involved than that. If translating an awk program is a major task, then the program was already too big to begin with.
The author of ripgrep (Andrew burntsushi Gallant) has 2 tools for working with CSV data - 'xsv' (a command line swiss army knife for working with CSV data) and 'CSV parser'. Check those out:
https://blog.burntsushi.net/projects/
CSV is not really a standard, various programs have their own interpretation (even MS Excel allows a header line before the data that contain specifications for the file, such as delimiter, which you probably won't account for in a home-grown solution).
So use a dedicated tool or library or you will run into trouble one day.
For one off jobs, I open it in Excel or similar and save as tab or pipe delimited text. This usually plays much more nicely with command line utilities, assuming it didn't mangle any numbers.
I wrote this for parsing CSV in Awk: http://yumegakanau.org/code/awk-csv/
It maybe doesn't handle all the CSV out there, but the cases you mention (quoting and commas) it does handle.
I usually load CSV data into PostgreSQL to do anything with it; mostly wrote this Awk library for fun. So I'm not going to argue that Awk is the best language for doing this kind of thing, but it is possible.
To simplify working with CSV data using command line tools, I wrote csvquote ( https://github.com/dbro/csvquote ). There are some examples on that page that show how it works with awk, cut, sed, etc.
I'm just starting to learn awk, would you mind sharing some of the issues / edge cases that you ran into that you found it didn't handle well? Was it maybe just that you found it tricky to write the regular expressions?
Parenthetically, since there are a bunch of UNIX greybeards on HN: if anyone has artwork of the AWK t-shirt I will happily pay any reasonable price. The shirt has a parachute-wearing bird about to jump from an airplane and is captioned with AWK's most famous error message: awk: bailing out near line 1.
Hmm. I'm sure this question will induce a flamewar of practical "#NeverAwk"-ers fighting toolbelt bloat, versus tech-hoarding AWK apologists arguing against throwing something out given if fills <niche>.
Here's the thing: these arguments all too commonly focus on subjective notions of "simplicity", and toy examples divorced from actual common practise, and or solid comparable benchmarks.
Show me a range of practical examples, for each competing env (awk, sed, perl, python, ruby, maybe bash).
Include:
- time it takes to teach a total novice (the time it takes to learn whatever is needed that example, not the entire language)
- how easy it is to recall said knowledge at a later date
- how fast example is multiplied by how much you are likely to use it = actual time saved in terms of execution. for small, fast examples the difference is irrelevant, a 10x speedup that is 0.1s vs 0.01s is meaningless.
- how extendable an example is. Hence the original example should include a series of extensions to the original task, to demonstrate how flexible / composable they are: e.g. task 1) count lines in a file; task 2) count lines in a file, then add 42 to it;
> Here's the thing: these arguments all too commonly focus on subjective notions of "simplicity", and toy examples divorced from actual common practise, and or solid comparable benchmarks.
I think your missing the point of awk. The O'Reilly sed and awk book has some complex examples, but when I look at my own usage they are all toy examples within a much larger scope. It's more like a special DSL extension for my shell than something I'd pick to build the entire solution, so a comparison to perl, python and ruby don't really make sense, they are general purpose languages but awk just has a couple of features that make it a very specialized yet useful one.
As an example, a have a system for importing and parsing log files that mostly done from a shell script, awk is used in two parts. The first is to transform a structured and easy to read file (records '\n\n' separated) into a csv easier to consume for bash, there's probably quite a few options to do this from tr to bash and it's done inline. The second is to filter the results down to what I need, so I have scripts like:
#!/usr/bin/awk -f
/some common error I don't care about/ { next } #skip line
/other common error/ { next }
/Error/ { print $0 } #this prints error lines, alternatively:
/Error/ && !errors[gensub($1", "", "g", $0)]++ { print $0 } #print each error once
{next} #skip everything else
Apart from one single line which wasn't in the original that's something you could teach a total novice in minutes, the /pattern/{action} syntax is about as simple as programming can be. Execution speed could probably be improved with a specific program but I suspect the bottleneck would be the spinning disk anyway, I run this over hundreds of MB every few minutes and it's not a problem, when I run it manually it's near instantaneous, I spend longer waiting for the desktop calculator to open up these days.
I sat down and read through the Awk manual once just to learn and was pleasantly surprised at how orthogonal the language is and how much it offers.
The only problem is that it was written by people using ed on PDP computers and that kind of shows. The primary logic is "filter out lines X and apply transform Y" is completely natural to someone using an editor like ed, but is fairly foreign to modern computer users. Most people aren't going to take the time to learn an obscure commandline tool these days, especially since it comes from the Dennis Ritchie school of "errors are like angry housewives, you know what you did" debugging.
I like to tell the story of when I was doing some genetics data wrangling and spent three days writing some perl code and I kept failing, then sent one of the researchers an email and he suggested an awk method and I turned 3 pages of perl into an awk one liner that works just fine. Now, its probably because I don't know perl very well, but as an ops type who doesnt have the classical dev/cs education, tools like it are my goto.
I find Python list/dict comprehensions, zip, range, enumerate etc. and the itertools module very good for such things.
And you have immediate access to so many useful modules (csv, json, xml) and can easily extract code fragments into functions.
You can also execute shell-like commands with subprocess.check_output() without ever worrying again about escaping strings or accidentally splitting them at spaces or whatever.
Clever one liners are difficult to comprehend. It's better to break them up to a few variable assignments with descriptive, long names without abbreviations.
I agree python seems to have taken over in this space since then (7+ yrs ago) and is usually a better tool in bioinformatics, I was just using the tools I knew as an ops dude.
If you're versed in both, it comes down to taste, though there really are cases where you'll have awk (usually via busybox) but not Perl. OpenWRT comes to mind (just verified it's not present by default, though yes, packages are available).
For a huge number of simple tasks, awk is available and sufficient. It's largely a subset of Perl, so yes, there's some skills overlap, but there are times where knowing awk is the right tool and the available tool will pay off.
After a decade of writing application software in C family languages, I am now working on a big devops effort in a Linux environment. A lot of the syntax is janky but it's pretty amazing what you can accomplish with a shell and the Linux command line tools.
In the early aughts, fresh out of college, one of my first projects was to take semi-structured text from database blobs and convert it to XML. I quickly realized it was not a task to be fully automated, too many edge cases that really required human judgement because the it was only very loosely semi-structured. I turned to sed, awk, and pico. Sed & awk did the 99%, and dumped into pico when it didn't know what to do and I would resolve the issue. Doing it semi-supervised in this way was 100 times faster than doing it all manually, and 10 times faster than full automation, and probably more accurate.
But it was the ability to string together these types of command line tools that made it possible.
In the early days of UNIX, I think more of its users knew C. Today, it is probably a much smaller number. However, learning AWK today, IMO, can help someone who also intends to learn C.
"Imagine programming without regular expressions."
Best of all worlds. I wish there would be a way to get back the time I spent with debugging edge cases of different regex implementations on various OS's.
I have used awk for decades, but I simply stoped and dropped the habit of using it along with sed and perl. Nowadays I would rather write a program than a script just to avoid memorizing all these tricks and hacks and glitches
Too big and slow. Also I disagree with the expressiveness, (emacs-has-chosen-a-very-longwinded 'way (to (express things))) that would be shorter in languages with more syntax.
When writing a shell script the robustness/reliabilty can be inferred from it's scope:
1. Uses shell-only commands (echo, for) - most robust; but things like basename/dirname and regex's vary by shell (sh, bash, zsh, ksh)
2. Uses /bin - might run into missing a binary but not likely, still robust and allows a richer set of tools (e.g., uname, chmod and admin-ish things live in /bin)
3. Uses /usr/bin - runs risk if missing packages, likely not very robust (packages drop things in here, like gzip, yacc, gcc)
4. Uses /usr/local/bin or /opt/local/bin - definitely requires package installs, least robust
I use awk because I like to visually refine my output incrementally. By combining awk with multiple other basic unix commands and pipes, I can get the data that I want out of the data I have. I'm not writing unit tests or perfect code, I'm using rough tools to do a quick one-off job.
For instance, "mail server x is getting '81126 delayed delivery' from google messages in the logs, find out who is sending those messages".
# get all the lines with the 81126 message. Get the queue IDs, exclude duplicates, save them in a file.
cat maillog.txt | grep 81126 | awk '{print $6}' | sort | uniq | cut -d':' -f1 > queue-ids.txt
# Grep for entries in that file, get the from addresses, exclude duplicates.
cat maillog.txt | grep -F -f queue-ids.txt | grep 'from=<' | awk '{print $7}' | cut -d'<' -f2 | cut -d'>' -f1 | sort | uniq
Each of those 2 one-liners was built up pipe-by-pipe, looking at the output, finding what I needed. It's not pretty, it's not elegant, but it works. I'm sure there's a million ways that a thousand different languages could do this more elegantly, but it's what I know, and it works for me.