Hacker News new | past | comments | ask | show | jobs | submit login

Typically, any high performance (low latency or high throughput) genomics/bioinformatics applicaiton is not going to be written in plain Python, except possibly for prototyping. Instead, nearly all codes today are written in C++ or Java, with some sort of command and control in Python or a DAG-based workflow scheduler.

I don't expect the community will adopt other languages at a large scale. My hope, though, is that more of these algorithms move to real distributed processing systems like Spark, to take advantage of all the great ideas in systems like that. But genomics will continue to trail the leading edge by about 20 years for the foreseeable future.




IMO, spark isn't the way forward. The typical pattern with it is it lets you scale up to 100 cores really easily which is almost enough to compete with a good single threaded implementation in a fast language.


100 cores? I forgot how to count that low.

The workflows I deal with generally involve moving hundreds of terabytes of storage into memory, processing it, and writing it out. Single machines (even beefy ones) tend to hit their limits (networking, max RAM, cache size, TLB, etc).

Maybe there's another tool better than spark, i don't know, the important thing is that spark is the most ubiquitous.


I recall that the group that created Spark had a bioinformatics project on Spark but I don't know what happened to it. All I could find now is a paper[1] hosted by databricks.

[1]https://databricks.com/wp-content/uploads/2018/08/SSE15-40-D...


We're here, still plugging along.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

https://github.com/bigdatagenomics/adam


Yep, that's the one I was thinking of (along with GNOMAD, which IIRC uses ADAM or some similar tech). My main complaint with ADAM was that they came up with their own file format (which had some flaws). But the general idea is the right one.


I'm interested in chatting with you about this, and genomics on Spark more generally, feel free to reach out on Github or via my username at the usual suspects.


I left this field, actually. I cofounded Google Cloud Genomics, and when I proposed that we pivot from working with the GA4GH (very stupid APIs) to working with ADAM (real data processing) I got kicked off the team. Since then I've come to see genomics as a minefield of bad practices and don't really work in the field any more, except to help scientists run their workflows in the cloud.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: