Hacker News new | past | comments | ask | show | jobs | submit login
What does $0=$2 in awk do? (kau.sh)
131 points by sorcercode on Sept 25, 2022 | hide | past | favorite | 85 comments



Here are some more learning resources:

* https://backreference.org/2010/02/10/idiomatic-awk/ — how to write more idiomatic (and usually shorter and more efficient) awk programs

* https://learnbyexample.github.io/learn_gnuawk/preface.html — my ebook on GNU awk one-liners, plenty of examples and exercises

* https://www.grymoire.com/Unix/Awk.html — covers information about different `awk` versions as well

* https://earthly.dev/blog/awk-examples/ — start with short Awk one-liners and build towards a simple program to process book reviews


Also, the original awk book is available on the Internet Archive, IIRC.

https://www.google.com/search?q=site%3Aarchive.org+the+awk+p...


Elegant, but not something you should ever use outside of ad-hoc situations.

I think this is more comprehensible, and is also more robust because it actually specifies the field you're looking for rather than just any line with quotes:

gawk -F'"' '/^ name:/ {print $2}' appVersion.gradle

(hacker news is clobbering the spaces after ^)


> (hacker news is clobbering the spaces after ^)

Indent with 2 spaces to get code formatting: https://news.ycombinator.com/formatdoc


It’s a pity that awk doesn’t support capture groups in line patterns. Then you could get rid of the field separator and make the script even more comprehensible. (You can simulate this with match() in gawk, but then you must know how match() works.)

Personally, I’d probably rather use sed here:

  sed -nE 's/^\s*name:\s*"([^"]*)"\s*$/\1/p"
While that regex is more complex, it is also safer and more explicit.


Grep is often better suited for extracting matched portion, especially if you also have PCRE option:

    grep -oP -m1 'name:\s*"\K[^"]+'


I thought I was pretty good at regex, but I could never have written this one and had to consult both `man grep` and regex101.com.

Explanations for the beginner and intermediate regex and grep user:

`-o`: Only return the match, instead of the entire line

`-P`: use Perl compatible regex

`-m` max-count, Stop reading a file after NUM matching lines.

And now for the regex:

`name:`: find the exact match

`\s*"`: Zero or more spaces leading up to and including an double quote

`\K`: This was the kicker for me. "resets the starting point of the reported match. Any previously consumed characters are no longer included in the final match" - basically tells the regex engine that the characters _before_ `\K` needs to be there in order to form a match, but it should only return the characters _after_ `\K` as the match. This is super handy! Is there a "reverse \K"?

`[^"]+`: One or more characters that are not a double quote. This basically means "Find the line that has a key called "name" and return all the characters after the first double quote and until the last double quote"


If you'd like to learn more about such grep powers, check out my free ebook [0]

What do you mean by "reverse \K"? Are you aware of lookarounds? Perhaps you meant positive lookahead?

    # match digits only if there is a semicolon afterwards
    $ echo '12; 42,31;100' | grep -oP '\d+(?=;)'
    12
    31

[0] https://learnbyexample.github.io/learn_gnugrep_ripgrep/intro...


In vim there is `\zs` and `\ze` where `\zs` is the `\K` equivalent in grep.

Basically

  :%s/hello \zsworld\ze out there/planet/g
would find all `hello world out there` and replace `world` to `planet`.


Consider this:

    grep -P 'start: (\d+) end'
"How do I make it print only the captured group with the number, not the whole line?" is a pretty common Stack Overflow question. The "\K" thing gets rid of the "start: " part, but what about " end"? That's were "reverse \K" would come in handy.


That's where ripgrep's -r/--replace flag comes in handy:

    $ echo 'foobar start: 123 end quuxbar' | rg 'start: ([0-9]+) end'
    foobar start: 123 end quuxbar
    $ echo 'foobar start: 123 end quuxbar' | rg 'start: ([0-9]+) end' -r '$1'
    foobar 123 quuxbar
    $ echo 'foobar start: 123 end quuxbar' | rg 'start: ([0-9]+) end' -or '$1'
    123


That's where lookarounds help:

    grep -oP 'start: \K\d+(?= end)'
`\K` is kinda similar to lookbehind (but not exactly same as it is not zero-width), and particularly helpful for variable length patterns.

If you need to process further, you can make use of `-r` option in `ripgrep` or move to other tools like sed, awk, perl, etc.



> all the characters after the first double quote and until the last double quote

Until the next double quote, not necessarily the last one.


Why doesn't POSIX include the "-o" option.

The word "often" in the parent comment implies there are times when GNU grep is not better suited than sed with ERE. (I often use flex instead of sed.)

For example,

https://debbugs.gnu.org/cgi/bugreport.cgi?bug=44754

Or from the 3.8 manual:

6.1 Known Bugs

Large repetition counts in the `{n,m}' construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.

Back-references can greatly slow down matching, as they can generate exponentially many matching possibilities that can consume both time and memory to explore. Also, the POSIX specification for back-references is at times unclear. Furthermore, many regular expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority: for example, as of 2021 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, 24269 and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications.


>there are times when GNU grep is not better suited than sed with ERE

Because, among other things, sed has filtering features like based on line numbers, address range, etc that aren't present in grep.

Regarding known bugs, that probably extends to GNU sed/awk as well. An issue I filed (https://debbugs.gnu.org/cgi/bugreport.cgi?bug=26864) led to the manual update about back-reference bugs.


Gnu ask does support them. MacOS cli userland is less capable by default.


You can't reference the results of capture groups in line patterns in gawk, unless you use the match() function. You can write:

  match($0, /...(...).../, arr) { ...refer to arr[i]... }
But you can't write:

  /...(...).../ { ...refer to a capture group somehow... }


Nice! I think your solution is definitely more legible.

The key to legible awk I’ve found is to not rely on the defaults that awk assumes.

(but it’s quite the mental exercise to decipher it). Feels like regex but less painful


Most of my AWK scripts are actually standalone scripts, as with `#!/usr/bin/awk -f` of similar. It's actually a very readable and comfortable little language once you free yourself from the one-liner constraint.


if you used a regex to match a line, would it run slower than just via the field separator, and count the NF?

On a tiny file the speed won't matter at all - but on a big file (let's say, you're looking thru log files with hundreds of gigs in size).


The regex solution might be faster.

The $0=$2 solution has to find the second field and test it for truthiness for every line, while the regex one only has to do that for those matching the regex, and matching that regex will be fast for most lines (skipping spaces and then testing for a fixed string ‘name’ is easy)

Regardless, the regex solution is more robust and maintainable.

Depending on how this get used, I might add code to detect the ‘appVersion’ start and end lines, set/clear a flag there and only match the ‘name’ lines when that flag is set to make it even more robust (who knows what other lines might contain ‘name’ now or in the future?)


That's way more comprehensible and maintainable.


Awk is such a weird tool--it's powerful and so few people know how to leverage it.

Yesterday, someone in chat wanted to extract special comments from their source code and turn them into a script for GDB to run. That way they could set a break point like this:

  void func(void) {
    //d break
  }
They had a working script, but it was slow, and I felt like most of the heavy lifting could be done with a short Awk command:

  awk -F'//d[[:space:]]+' \
    'NF > 1 {print FILENAME ":" FNR " " $2}' \
    source/*.c
This one command find all of those special comments in all of your source files. For example, it might print out something like:

  source/main.c:105 break
  source/lib.c:23 break
The idea of using //d[[:space:]]+ as the field separator was not obvious, like many Awk tricks are to people who don't use Awk often (that includes me).

(One of the other cases I've heard for using Awk is for deploying scripts in environments where you're not permitted to install new programs or do shell scripting, but somehow an Awk script is excepted from the rules.)


Unfortunately, the universal POSIX standard of awk only supports single-character, non-regular expression field separators (https://pubs.opengroup.org/onlinepubs/9699919799/utilities/a...). It's arguable whether one should write POSIX-compliant awk or not (similar arguments apply for shell scripting).

When feasible, I try to write POSIX-compliant awk, so the script could have been written as:

    awk '/\/\/d / {gsub(/.*\/\/d /, ""); print FILENAME ":" FNR " " $0; }' source/*.c


I just write Perl instead


yeah people say perl is write only but I guarantee it's less write only than that awk incantation


That "Awk incantation" looks clear to me.

- On lines that contain //d,

- Delete the part of the line up to and including //d,

- Print the file, line number, and the rest of the line.

Maybe it's familiarity with regular expressions? If you're not familiar with regular expressions, that Awk is gonna look a bit funny. The regular expressions look a bit messy just because they have to match literal slashes, so you get /\/\/

There is no use of funny features or clever tricks in the code, it's just kind of straightforward, mindless code that does exactly what it says it does. It's definitely less clever than the Awk invocation that I wrote (which is a good thing).


If I see that regex in code somewhere, I'd have to stop and break it down the way you did to understand what it's doing, and I've spent a fair bit of time with regex. I think reading it and understanding what it does in one go would require far more than mere familiarity. (and that's not including the awk-isms like FNR etc.)


“That regex” makes me suspect that you think that the entire script is a regex. It’s not. It’s an Awk script, with two regexes in it.

First, know that Awk goes line by line. The script is implicitly executed for each line in the input. That’s just the entire thing Awk does, normally—if you want to process files line by line, and your needs are simple, well, Awk fills in the gaps where stuff like “cut” fall short (and I can never remember how to use cut, so I just use Awk anyway).

Second, know that “if” is implicit in Awk. You don’t write this:

  if (condition) { code }
You write this instead:

  condition { code }
This is like how Sed works, or Vim, except you get braces and the syntax is a bit easier to read.

The code block contains two statements: one function call (gsub) and then print.

So the first regular expression is just “//d ”, with some escaping for the slashes. The second regular expression is “.*//d ”. I do think that someone with basic familiarity with regexes should have no problem understanding these.


Cool! Now do this with every line of awk you're asking someone else to maintain, give up, and then write it in a better, more explicit language instead.


Feels like I’m on Usenet again when I read comments like this.

Awk is mostly nice for one-liners and is something you can just write into a command-line ad-hoc. It’s good at that. I could write the same thing as a Python script but it would take longer, and I would need to know that Python is installed on the system—Awk has a larger install base and is found on “minimal” installs.

If you hate Awk, and think it’s stupid, don’t use it. Seems like a big waste of time trying to argue with people who like Awk. That’s the kind of discussion I remember from my experience on Usenet in the 90s, and it seems like some people haven’t learned to move on.


Well, Perl kinda started where SED and AWK finished, and it does incorporate a lot of syntax from them, for better or worse.

It makes it easy for quick & powerful one-liners but sadly people then put the one-liners into actual programs instead of writing it in nicer way...


I thought about this, but both the macOS Awk and GNU Awk support regexps as field separators.


it's powerful and so few people know how to leverage it

Because otherwise it is useless. It has the same fate as AHK scripting and similar little languages. The language may be okay for its task, but if you cannot or do not use it for other things (unlike e.g. perl) at least from time to time, chances that you will learn and remember it are low. People know sed because regexps are everywhere. People use perl instead of awk, because they have a muscle memory for it. They may know [, find and glob for their relative generic-ness. They ignore awk and ahk because these are too niche to pay enough attention to. You either find a snippet or just move on.

If you are not constrained by a single line in a script, it’s easier to feed a heredoc into an interpreter of choice.


ahk is a lot easier than any alternatives for a lot of Windows stuff like window management.

I use ahk scripts to arrange my windows for me and some other simple hotkey things. Wouldn't use it for "real" programming stuff though.


Yes, ahk is a great framework for its purpose. It’s hard to repeat it from scratch. But it would hugely benefit from literally any mainstream language around it. Imagine how many plugins simply wouldn’t exist or delivered quickly if e.g. vscode invented its own arcane vscodescript instead of js/ts. Or how richer desktop automation could be if ahk could use pip, npm, IDEs and interop other “junction” tech seamlessly.


Just like Perl. Most people prefer dumb verbose code, not smart terse.


The question I prefer to ask is “how much functionality can I comprehend in a given amount of time” rather than “how many lines of code can I comprehend in a given amount of time”.


Most people prefer legible code, not hieroglyphics


(Honest question) what do you feel that your comment added to the one above it?

Are you suggesting that "dumb verbose code" might not be legible (I suppose that's technically possible, but seems unlikely to happen by accident)?

Or are you implying that Perl consists of "hieroglyphics" and so is not a suitable language for writing legible code? This, I think, would miss the point - deepsun was saying that, in both Perl and in awk, readers prefer legible code over cleverness - to claim that Perl cannot be legible at _all_ requires a little more justification, and would probably be disputed on the grounds that familiarity with a language's conventions is often a prerequisite for legibility.


I think it’s additive. We like to feel smart and can over complicate things. I much prefer a boring non-trendy approach that’s easy to maintain over new hotness every time.


Yes, I was basically saying that terse code is often unreadable due to the terseness coming from what amounts to a weird compression algorithm that can satisfy a compiler but buries information for humans in weird syntax instead of something resembling English.


not OP but perhaps they are trying to say that terse code can be quite illegible

I've written sed sripts and more compilated regular expressions that when I come back I don't remember what all that mess was supposed to accomplish


As an aside, I find a little macro that expands to asm(“int3”) is useful for this.


Awk, like vim, is a tool I love dearly that I absolutely would never recommend someone else learn. It’s like a form of mental illness but it’s so lodged in my brain and I’ll give it up when they put me in the grave.


Depends on the person and on his specifics. A CS student should absolutely take time to learn to use these vim, awk, gdb etc. For a self-learned dev who is already working and already have his habits, i don't think this is worth his time.


Why is it worth the time of a CS student and not worth the time for a self-learned developer?

Feels like the reasoning should be something like; either learning it makes you more efficient, or it doesn't make you more efficient. If it does, learn it, if it doesn't, don't, regardless of your background.


Learning this will make you more efficient, always. Using gdb (pdb for me rn, but it's the same) or vim especially, but you have to consider your working week, the habits you forged, and like a sibling said, the opportunity cost.

Let's be honest, CS students do have time to try thing, do vim tutor or regex golf (yeah, add regex to the list too) and other stuff like that. Once you're working, you loose some agency. And gain some.

Recently at a daily, i proposed to help write my coworker's regex. He is a fine dev/ops guy, but self taught (ex electronics guy) and miss some basics that aren't useful 99% of the time. He could've written his regex without help, but this is typically the case where getting more efficient isn't really worth the cost once you're working.


The opportunity cost of time is not zero. If you already have tools which are working for you, time is (possibly) better spent learning other things.


> For a self-learned dev who is already working and already have his habits

this implies that the self-learned dev has habits that are just as efficient as the unix toolset being recommended for the CS student.

But you dont know if that's actually true - it might be for some people, but not for others. It's a skill for someone to have, to find out whether their current toolkit is not good, and that a better one exists.


> A CS student should absolutely take time to learn to use these vim, awk, gdb etc.

You forgot emacs.


He mentioned vim.


> He mentioned vim.

He did, but just as less is more vi and emacs should always be mentioned together.


I shouldn't have mentionned vim. What people should learn is to use any modal editor, the specifics aren't important.


Nobody should learn vim when vis exists.


Except you have to recreate the awesome plugins and community that exist around vim for vis. I wish they existed.


And this is why we CS never advances and is still stuck in the 70's. well, just linux.


Replace vim by a modal editor.


Off-topic, but I _love_ the "sidenote" format for footnotes. I've been meaning to implement that in my own blog for a while, now. I'll check out the source for inspiration.



Love the sound of it, but don't see this on mobile, even in desktop mode, must be tied to a media query; I'll be checking it out later too c:


for the kind words and noticing .

I enable the sidenotes based on how “wide” your current viewport is. I’ve just personally found they don’t work as well on smaller screens.

I wrote it with simple jQuery. Happy to share if anyone is interested and don’t want to have to write it from scratch.


Thanks a ton!


Do not do this in awk when the same in sed would be easier to read.

But also do not match this data format in the first quote. Match "name:" instead if that is what you mean. This makes the intention clear.

For matching text, just use grep is that is what most people would expect:

  grep -m 1 "name:" appVersion.gradle | grep -o "[0-9.]"
You can shorten it to one regexp if you use an extended format that can handle backrefs if the context would make that clearer.


Yeah, this just proves to me that the intuitive shell tools are way better. I'll keep all the heads, tails, cuts, and trs, etc.


I think awk and sed are great, and it's great to eschew pipes when a powerful component that's ready in the pipelibe, like sed, awk, or grep, can pull double duty, but do think that one should favor using `grep -e` with a positive look behind in a case like this. I can just picture an innocuous string being added anywhere above the version string and throwing this one-liner off, when a regex could have pulled the version string out from an arbitrary position within the file and would hold up much more robustly; You don't want to impose opaque constraints on future edits even if you're the only one who will be editing the file, it's just too easy to forget and trip over later.


I'm whatever the opposite of a code golf extremist is.

E.g. I hate the so called "useless use of cat." I uselessly use cat all the dang time; visualizing the pipeline is infinity more useful then, what, "elegance?" Who cares?


Same here. I use `cat` as a tool to read data from the filesystem into a pipeline, regardless if that's the initial intention of the tool. I don't want the input filename being to the right of the first manipulation command in a pipeline that's supposed to read left to right!


Its actually part of the POSIX standard that redirections can be put anywhere in the command line, so one can do:

  <file.txt sed 's/some/filter' | other_cmd
on any standard compliantish shell :) (I use zsh, and I know it also works for bash and dash)


Never knew this. Looking at it now. First impression, I can't tell if this is idiotic or genius :)


Well, its only intuitive because you already memorized what all the options and command lines do. I for one don't even remember how to use head, I just

    sed 15q

instead of head -n 15, s/// is clearer than a cut sometimes. if you're starting unix, learning sed/awk is enough to do everything.


As someone who teaches this stuff, I don't think so? Down the line you can do memorization of quicker ways, but I think "memorization" and "intuition" are opposites here.

As in, starting the pipeline with "cat" every time makes intuitive sense more than taking the time to figure out which command to start with?


Sure, is egrep a|egrep b| egrep c more intuitive than awk '/a/ && /b/ && /c/'. sure, until you need to have more control over the pipe, in which case, you already forked to awk. so why not go all the way and keep the operation as an awk expression? The great thing about it is, the more you use a tool, the faster you are at using it.


Wonderful. My old license plate was awk sed.


Completely off topic now, but there’s a local politician named “Cat Ping.” I figured she has the UNIX enthusiast vote wrapped up.


if you do an image search for Cat Ping, she faces some challenges gaining recognition

https://duckduckgo.com/?t=ffcm&q=Cat+Ping&iax=images&ia=imag...


As others here pointed out: this is bad awk.

I think good awk scripts lie somewhere between "This is atrociously overfit to this file" and "this is so general you should have done it in python/perl/etc".

Of course there are legitimate uses for both super-specific and super-general awk scripts, but finding the right compromise is what makes a good awk script. You want your script to be concise yet robust to changes in the input files. Also, readability and maintainability are really important if you plan to add it to an important script.

Short awk != good awk.


> This is really such a gorgeous piece of code. Clever and poetic.

I suppose that is subjective, but I think this is terrible code.

The fact that someone had to write a lengthy blogpost to figure out what it was doing should be an obvious warning against it.

Write code for readability / comprehensibility.


I understand the sentiment. As a software engineer I refrain from clever code especially if somebody else has to come and maintain it.

In this case, once you realize how the defaults stack, whoever came up with the one liner indeed has come up with something clever. Also the exercise in understanding the defaults cements the understanding of awk and as others have pointed out enables you to write much cleaner awk.


Too much text to explain what could have been explained clearly in 3-4 sentences. Making simple things complicated and then writing a long essay with dramatic phrases is a disservice to awk.


It's not about explaining the one liner, it was more of an awk tutorial.


Awk strikes me as a marvelous text parsing tool that too few understand.


$0=$2 sounds a lot like my wallet recently...


takes 2 dollars and spends it!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: