Using Awk and R to parse 25tb (2019)

xnx · 2023-10-07T19:55:43.000000Z

Reminds me of "Command-line Tools can be 235x Faster than your Hadoop Cluster (2014)" https://news.ycombinator.com/item?id=30595026

corytheboyd · 2023-10-07T19:33:04.000000Z

Awk is such a nice little tool! It doesn’t even have to be an archaic one liner akin to a stack overflow answer. You can write well structured, easy to follow awk programs, that use variables, sane conditional logic, matching functions, etc. You can do all of this by referencing the man page that you already have, and nothing else. It’s a bit like something between bash and perl— enough functionality to accomplish non-trivial file processing tasks, but not a fully featured programming language. Which is perfect when it’s perfect.

xnx · 2023-10-08T03:21:07.000000Z

I'm hoping the availability of LLM code generation will encourage people to try more things in awk (myself included!).

corytheboyd · 2023-10-08T04:51:27.000000Z

Give it a go! For a fun starter example, I recently wrote an awk program to add-or-replace a key pair in a .env file. Easy enough to do with a one liner, but was a fun little mid-day code challenge! The tricky part is the “add if not present”, the END part of an awk program comes in handy here :)

IRL I ended up moving it all into a go cli since there were other complicated moving parts, but just that bit was completely doable with pure (and clear) awk!

laurent_du · 2023-10-07T19:52:44.000000Z

Querying by rsid is clearly a bad idea. You want to partition by chromosome (and sample-id in this case) and sort by position. When looking for a given snp, the parquet reader will go through the metadata to only read the data page that contains the given position. Unless your pages are huge, read time will be super small (and cost-efficient since you don't fetch too much data). Since the data is static I would want to try storing all the sample data and metadata in arrays. (For non-static data you can't do that because you won't be able to edit the arrays later - you can only add new rows to the parquets.) I am not really sure I understand what the author is doing, sounds like he wanted to sort by position but he failed to do so and decided to bin instead? I agree that Awk is very useful in this kind of problems.

_a_a_a_ · 2023-10-07T22:30:31.000000Z

I'm a database guy so everything looks like a database problem to me, but I'm not sure how this would fit in (as I'm completely unfamiliar with the data used here). Can anyone more knowledgeable than me suggest whether a database, on a conventional server with some decent RAM and a bunch of SSDs would have have worked and perhaps been cheaper?

(Edit: OK, SSDs in 2019 might not have been affordable but spinny disks were cheap and still pretty fast)

dang · 2023-10-07T19:53:18.000000Z

Recent and also related:

Exploratory data analysis for humanities data - https://news.ycombinator.com/item?id=37792916 - Oct 2023 (38 comments)

RetroTechie · 2023-10-09T16:26:28.000000Z

Heh, using lowercase doesn't make 25TB any smaller. :-)

Unless one meant bits vs bytes.

isoprophlex · 2023-10-07T19:41:15.000000Z

Now that DuckDB has S3 support, i guess a linux box, DuckDB and some light SQL'ing is all you need?