Typically, any high performance (low latency or high throughput) genomics/bioinf...

adgjlsfhk1 · on Sept 15, 2021

IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.

dekhn · on Sept 15, 2021

100 cores? I forgot how to count that low.

The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).

Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.

east2west · on Sept 15, 2021

I recall that the group that created Spark had a bioinformatics project on Spark but I don't know what happened to it. All I could find now is a paper[1] hosted by databricks.

[1]https://databricks.com/wp-content/uploads/2018/08/SSE15-40-D...

heuermh · on Sept 16, 2021

We're here, still plugging along.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

https://github.com/bigdatagenomics/adam

dekhn · on Sept 16, 2021

Yep, that's the one I was thinking of (along with GNOMAD, which IIRC uses ADAM or some similar tech). My main complaint with ADAM was that they came up with their own file format (which had some flaws). But the general idea is the right one.

heuermh · on Sept 16, 2021

I'm interested in chatting with you about this, and genomics on Spark more generally, feel free to reach out on Github or via my username at the usual suspects.

dekhn · on Sept 16, 2021

I left this field, actually. I cofounded Google Cloud Genomics, and when I proposed that we pivot from working with the GA4GH (very stupid APIs) to working with ADAM (real data processing) I got kicked off the team. Since then I've come to see genomics as a minefield of bad practices and don't really work in the field any more, except to help scientists run their workflows in the cloud.