I'm a huge fan of awk but the "Python vs awk" page this links to [1] shows python code that's almost deliberately atrocious.
Take this function the author wrote for converting a list of integers (or strings) into floats
def ints2float(integerlist):
for n in range(0,len(integerlist)):
integerlist[n]=float(integerlist[n])
return integerlist
Using `range(0,len(integerlist))` immediately betrays how the author doesn't understand python. The first arg in `range` is entirely redundant. Mutating the input list like this is also just bad design. If someone has used python for longer than a month, you'd write this with just `[float(i) for i in integerlist]`.
Further down in the function `format_captured` you see this attempt at obfuscation:
> python code that's almost deliberately atrocious
That code was so bad I felt I had to step in too, I used chatGPT to simplify it a bit but it also introduced some errors, so I found what appears to be an input file to test it on [1]. The only difference with the awk program is that it uses spaces while the original python program used tabs.
#!/usr/bin/env python3
import sys
freq, fc, ir = [], [], []
with open(sys.argv[1]) as f:
for line in f.readlines():
words = line.split()
if "Frequencies" in line:
freq.extend(words[2:])
elif "Frc consts" in line:
fc.extend(words[3:])
elif "IR Inten" in line:
ir.extend(words[3:])
for i in range(len(freq)):
print(f"{freq[i]}\t{fc[i]}\t{ir[i]}")
Agreed - this is pretty much the perfect use case for list comprehensions, which are one of the best features of Python. Normally "oh but there's a better way to do it in that language" isn't a particularly interesting observation, but here it completely turns the author's point on its head. I can't think of many more elegant ways to convert a list of ints to floats, in any language, than `[float(i) for i in integerlist]`.
I prefer the python syntax generally, but throw in some typing with Typescript inferring the post-map type from the map function's return type, and I'd definitely go with `integerlist.map(float)`
I wanted to point out that the Python code was written to be 2.7 compatible, and maybe the atrociousness was due to that, but then I looked up when list comprehensions were introduced - 2.0, with PEP202.
Looks like he copy pasted the python version from another forum post, and didn't look at it carefully. I'd suspect it can be made to look a lot cleaner (edit, yes, e.g. by just translating each of the main lines in the awk script to an if statement) I agree with the strawman comment.
I don’t think you’re right to put Awk into the one-liner category. It actually scales up remarkably well to a couple hundred lines or so, as long as the problem does not strain its anaemic data-structure capabilities.
Compared to straight Python (i.e. not Numpy/pandas), it can also be surprisingly fast[1]. I experienced this personally on a trivial problem: take a TSV of (I,J) pairs, histogram I/J into a given number of bins. I can’t remember the exact figures now, but it went like this: on a five-gig file, pure Python is annoyingly slow, GNU awk is okayish but I still have to wait, mawk is fast enough that I don’t wait for it, C is of course still at least an order of magnitude faster but at that point it doesn’t matter.
Oh, I agree with you and the article that Awk is a wonderful scripting language. It is just that personally I use it mostly for one-liners (even wrote a book) and prefer Python for longer scripts.
If I have a text file of csv that I need to do something with, I'll usually start with a shell script and coreutils + sed and awk.
If I need a script to generate some output (a regular use case that seems to come up for me is random numbers with some property), I tend to use python.
I also use python if I need to do more complicated aggregations or something on tabular data that pandas is better at. Though it's fun to try with `join` and awk sometimes (parsing csv can get tricky).
If I need to plot something I tend to use jupyter notebook but it's way more satisfying to use gnuplot, which I mention because it fits naturally into workflows that use shell tools like awk.
Honestly for what you describe R might be quite comfortable. pandas was built to try and get R sensibilities into python, so you might as well drop the dependency and use the real deal all in one place.
I'm glad you mention that because I forgot something: I try and stick with only stuff I can find on most computers in order to minimize dependencies. That's a big advantage imo of learning how to use awk and tr for basic data analysis, for example. You can ssh into any server and work without installing anything.
I don't like this approach. One is being peddled a bucket of old shit that might have been okay in the 80's for portability reasons. This portability is usually not necessary for my one time tasks anyway. Today there are really cool and small tools to do everything with comfort, and if you want some kind of portability you can still put static linked tools quickly on the given system. But of course that is a personal opinion and depends on the situation.
Really depends on your use case though. If you're always going to be working on your own machine, of course customize it to make yourself maximally productive. But if you're regularly going into new situations, it's good to know how to be effective with tools you can count on being there.
My (admittedly very lame) anology: you're hiring a ninja assassin and have two candidates. One tells you about all the special swords and staffs and and smoke bombs he carries, the other says I just need my hands. Who do you hire?
I have a simple thought about this. If your work involves many machines, you usually have enough permissions to transfer your programs. I think that the benefits of a pleasant way of working outweigh the effort.
And by the way, I would prefer the first ninja if the second one is handless.
I'm surprised with some of the responses you've received on this.
But I absolutely know what you are talking about. I often SSH into new machines, that I may not have root access on, that may not even have internet access, or be a distro with a package manager (e.g. some switch running a custom distribution).
In those situations it is a huge advantage to know the tool, rather than try to do some gymnastics to get your tools onto the box or the data off the box.
These "old" and "broken" tools are installed as defaults on all these systems for a reason.
I agree 99%, but sometimes a tool is such a huge productivity boost that it’s worth installing. Fzf is one of those tools for me. If I had to deal with CSV files a lot then xsv would be worthwhile
Here is a script that I use, mainly composed with the Toybox version of awk, that will extract all of the WiFi passwords stored on an Android device.
This is enormously portable, and does not require any new software installations for Android users who have root.
Requiring xsv would reduce availability to a fraction of where it can be run now.
#!/bin/sh
find /data \
-name WifiConfigStore.xml \
-print0 |
xargs -0 awk '
/"SSID/ { s = 1 }
/PreShared/ { p = 1 }
s || p {
gsub(/[<][^>]+[>]/, "")
sub(/^[&]quot;/, "")
sub(/[&]quot;$/, "")
gsub(/[&]quot;/, "\"")
gsub(/[&]amp;/, "\\&")
gsub(/[&]lt;/, "<")
gsub(/[&]gt;/, ">")
}
s { s = 0; printf "%-32.32s ", $0 }
p { p = 0; print }
' | sort -f
Note that the -print0 null processing is not POSIX. This is a reasonable compromise of standards compliance, as it does not reduce the base of available users.
I did try to do this first with arrarys, but awk segfaulted.
That is quite nifty implementation of reverse HTML escaping. But in python that could be done with much less work:
import html
print(html.unescape(foo))
And the best part - you don't need to debug/update the (g)sub list every time you stumble upon new weird &whatever; too. And there are alot of those out there:
XSV is a tool for exploring and manipulating X-separated value files (CSV, TSV, …), which is why I mentioned it in reply to a comment which talks about the exploration of CSV files, and furthermore specifically mentions that
Awk sed bash and perl are extremely underrated and nearly always beats python for elegance and succinctness of the repeating problems in the daily chores of sys admin
Then you don't know neither awk or sed. Most of awk can be redone in Perl with ease, and for regex matching and replacing with sed, PCRE and Perl were practically made for that.
I mean, sometimes for small edits sed it's better, and awk for some tabular based files by using xml2tsv or lots of TSV related tools.
But for medium sized projects, Perl it's the obvious tool against something that requieres something similar to awk/sed but more complex data parsing.
> for small edits sed it's better, and awk for some tabular based files by using xml2tsv or lots of TSV related tools.
That, and shell, sed and awk are standard tools which are easy to learn, every unix user should learn basics of (even a java developer), and this won't change anytime soon.
However, this can't be said of Perl - it is a powerful tool, but it became culturally obsolete and deprecated, and most unix users in the present and the future won't be bothered to learn it, when learning more modern languages like Python is a better investment.
A long time ago, I built a relatively complex program that managed some other systems in awk. It was really a great fit for the problem and I was, at the time, working in an environment with poor developer tooling. The target systems were heterogeneous and I could not depend on Perl even being available. But awk was guaranteed to be there.
The problem was that every time someone else was asked to add any feature to it, they freaked out at the language choice and I had to get on a plane.
If you find yourself piping any combination of cut, grep, sed, uniq (and likely others I'm missing) together, you can probably do it all in awk. If you can guarantee usage of gawk, you can add sort to that list (tbf you can also implement any algorithm you want in awk, but arguably at that point you're wasting time) - and it's also worth noting that you can dedupe in awk _without_ having sorted input, albeit at the cost of storing all unique lines in memory.
Pipes are great because they enable you to trivially send data between programs, and they're terrible for the same reason. While the execution time on modern computers for the average data size isn't noticeable, on larger datasets or repeated execution, it absolutely is. If you don't have to pipe, don't.
Awk is actually amazing as long as you operate within its limitations and core problems that it solves well—it is definitely not limited to one-liners, but it has other limitations.
It is just so damn good at the things it is good at, but nobody learns to use it, because learning Awk is inefficient in the grand scheme of things. There are better things you can learn.
There is one place where Awk has undeniable superiority—and that is its use in environments where bureaucratic rules prohibit the distribution of programs / code (Perl, Python), but where Awk is permitted.
My case for learning AWK is if you spend any significant chunk of time in the terminal (sysadmin, DevOps, data eng). Notwithstanding any other scripting languages you know.
CLI world is full of bespoke text interfaces. AWK is the tool for dealing with those programmatically.
In 1996 I was paid to teach Unix to a group of customers. As a graduation work for my course they had to write an awk program for budgeting without use of any databases. I'm sure they had a lot of fun and cursed me for this sadistic approach to teaching at that time. However decades later some of them were still thankful they have learned regex and awk back then.
pdf / csv / excel export from my three webbanks, a bit of pdftotext, or soffice conversion just to pipe to awk to augment it and render properly formatted spreadsheet
I found the awk syntax to be surprisingly discoverable, once I got the rough structure of scripts.
I think the confusing factor with awk is that it allows you to leave out variuos levels of structure in the really simple scripts, meaning that the same scripts you see around will look quite different.
E.g. all the following would be the same (looking for the string "something" in column 1, and printing only those lines):
Awk syntax is basically what became core Javascript according to its creator [1]. Bourne shell syntax is very different so I take your comment as a frustrated reaction to the "Python obsolete" comment which must be seen in context with Python introducing itself as an awk replacement among other things (though not nearly as aggressive as Perl which use to have the a2p script to bogusly convert awk to Perl code).
I can agree on bash syntax being crazy, but certainly not on awk. Awk is very simple, a man page is all you need if you need to find something. Otherwise what's so complex with awk?
If you don't like learning from a man page (understandable), I can recommend a quick read of the Awk book by A, K, and W. It's very short and has fun examples.
For larger programs, I wrote and use ngetopt.awk: https://github.com/joepvd/ngetopt.awk. This is a loadable library for gawk that lets you add option parsing for programs.
This can be a very powerful idiom (basically, code generation at the shell prompt).
It’s well suited to iterative composition of the commands: I’ll write the query/find part, and (with ctrl P) add the awk manipulations, and then pipe to sh.
If it doesn’t have side effects you can pass through “head” before “sh” to check syntax on a subset.
It is a guilty pleasure but I like writing awk scripts that write shell scripts that get piped into sh, for example
ls | awk '{if (length($1)==7) {print "cat " $1 }}' | sh
it is something you really aren't supposed to do because bad inputs could be executed by the shell. Personally the control structures for bash never stick in my mind because they are so unlike conventional programming languages (and I only write shell scripts sporadically) so I have to look them up in the info pages each time. I could do something like the above with xargs but same thing, I find it painful to look at the man page for xargs.
When I show this trick to younger people it seems most of them aren't familiar with awk at all.
For me the shell is mostly displaced by "single file Python" where I stick to the standard library and don't pip anything, for simple scripting it can be a little more code than bash but there is no cliff where things get more difficult and I code Python almost every day so I know where to find everything in the manual that isn't on my fingertips.
That’s not so terrible if you at least verify the output before the final “| sh”.
Though you’d have to be confident that running it twice is going to give the same results. If it’s remote data that could change then weird/bad/nasty things could happen.
For anything non trivial, best to separate those steps and generate a temp script to execute.
Once I wrote a 300-line Awk script to install a kernel driver. It would scan a hardware bus, and ask the user questions before loading the driver onto the system. Lots of fun!
The existence of things like "git implemented in awk"* serves as a great reminder that there are a lot of developers out there who are far, far, far, far more talented than I will ever be. I salute them.
*Especially keeping in mind that these people wrote things like this for fun.
I asked ChatGPT to explain that and "prettify" it, for anyone interested (everything below from ChatGPT):
This is an AWK script that serves HTTP requests by listening on port 8080.
It defines several functions:
1. `send` function takes in status code, status message, content, content type, and content length and sends an HTTP response with the provided information.
2. `cf` function checks if a path contains `..` and returns 0 if it does, otherwise returns 1.
3. `mt` function determines the MIME type of a file using the `file` command.
The script sets the record separator RS and output record separator ORS to \r\n which is the line ending used in HTTP.
It enters an infinite loop listening for incoming connections on port 8080. When a connection is established, it reads the HTTP request line by line using `getline` and processes the `GET` request. If no path is provided, the script serves `index.html`. It checks if the requested path is safe using the `cf` function, and if it is a file, reads the file and sends an HTTP response using the `send` function. If the path is not valid or the file is not found, it sends a "404 Not Found" response.
gawk '@load "filefuncs"
@load "readfile"
function send(s, e, d, t, b) {
print "HTTP/1.0 " s " " e |& S
print "Content-Length: " b |& S
print "Content-Type: " t |& S
print d |& S
close(S)
}
function cf(x) {
split(x, y, "/")
for (z in y) {
print "FOUND " y[z]
if (y[z] == "..") {
return 0
}
}
return 1
}
function mt(f) {
c = "file -b --mime-type " f
r = ""
while ((c | getline z) > 0) {
r = r z
}
close(c)
return r
}
BEGIN {
# Change to the specified directory
if (ARGV[1] != "") {
if (chdir(ARGV[1])) {
print "Failed to chdir to " ARGV[1]
exit
}
ARGC = 1
}
# Set the record separator and output record separator
RS = ORS = "\r\n"
# Listen for incoming connections
while (1) {
S = "/inet/tcp/8080/0/0"
while ((S |& getline l) > 0) {
split(l, f, " ")
if (f[1] == "GET") {
p = substr(f[2], 2)
}
if (p == "") {
p = "index.html"
}
stat(p, s)
if (cf(p) && s["type"] == "file") {
m = mt(p)
o = readfile(p)
send(200, "OK", o, m, s["size"])
break
}
n = "<html>Not Found</html>"
send(404, "Not Found", n, "text/html" RS, length(n))
break
}
}
}'
I added some comments to explain what each section of the code does. Let me know if you have any questions!
Looks like ChatGPT got it right with the description. Perhaps obvious, but the ".." check is to stop path traversal. Also, if you pass an additional arg, it takes that as a directory for the script to serve over http, defaults to CWD. The rest seems like enough explanation.
You can also pretty print gawk source with -o[filename], using - as the filename to print to stdout, so you can just run the oneliner version, but lead with "gawk -o- " and it will print the pretty version.
Edit: Also, the performance is terrible. Gawk's listen sockets suck because you get no granular control over listen/accept. The socat based one I replied to is probably much better.
Take this function the author wrote for converting a list of integers (or strings) into floats
Using `range(0,len(integerlist))` immediately betrays how the author doesn't understand python. The first arg in `range` is entirely redundant. Mutating the input list like this is also just bad design. If someone has used python for longer than a month, you'd write this with just `[float(i) for i in integerlist]`.Further down in the function `format_captured` you see this attempt at obfuscation:
Why bother with a `filter`? Who hurt you? That said, the author's implementation in awk does look pretty clean. I'm just peeved that they straw-manned the other language.[1] https://pmitev.github.io/to-awk-or-not/Python_vs_awk/