Awk is such a nice little tool! It doesn’t even have to be an archaic one liner akin to a stack overflow answer. You can write well structured, easy to follow awk programs, that use variables, sane conditional logic, matching functions, etc. You can do all of this by referencing the man page that you already have, and nothing else. It’s a bit like something between bash and perl— enough functionality to accomplish non-trivial file processing tasks, but not a fully featured programming language. Which is perfect when it’s perfect.
Give it a go! For a fun starter example, I recently wrote an awk program to add-or-replace a key pair in a .env file. Easy enough to do with a one liner, but was a fun little mid-day code challenge! The tricky part is the “add if not present”, the END part of an awk program comes in handy here :)
IRL I ended up moving it all into a go cli since there were other complicated moving parts, but just that bit was completely doable with pure (and clear) awk!
Querying by rsid is clearly a bad idea. You want to partition by chromosome (and sample-id in this case) and sort by position. When looking for a given snp, the parquet reader will go through the metadata to only read the data page that contains the given position. Unless your pages are huge, read time will be super small (and cost-efficient since you don't fetch too much data). Since the data is static I would want to try storing all the sample data and metadata in arrays. (For non-static data you can't do that because you won't be able to edit the arrays later - you can only add new rows to the parquets.)
I am not really sure I understand what the author is doing, sounds like he wanted to sort by position but he failed to do so and decided to bin instead?
I agree that Awk is very useful in this kind of problems.
I'm a database guy so everything looks like a database problem to me, but I'm not sure how this would fit in (as I'm completely unfamiliar with the data used here). Can anyone more knowledgeable than me suggest whether a database, on a conventional server with some decent RAM and a bunch of SSDs would have have worked and perhaps been cheaper?
(Edit: OK, SSDs in 2019 might not have been affordable but spinny disks were cheap and still pretty fast)