It has to be noted: it's quite a strange approach all round - this framework reads all the data into memory. So if you have a 100GB genome it will read 100GB into memory. Presumably it stays in memory uncompressed so we are talking hundreds of GB to process even a single whole genome sample.
This may indeed have some performance benefits, but it's a very impractical approach from a hardware point of view. Few places doing processing of genomic data will have many compute nodes with > 256GB memory, yet that would barely process 1 sample with this framework. God forbid you have a family of samples or tumor/normal comparison samples to analyse and need several genomes in memory together.
Genomes are for the most part massively parallelisable and nearly every other toolkit I have seen has put that first and foremost in its design approach. Ensuring tools process data in a streaming manner and pipe between each other is a basic expectation of most genomic data tools.
Which is all to say ... this is a very strange beast and I'm not sure a lot of conclusions can be drawn from it that generalise to other activities or approaches.
Untrue. Large memory nodes are par the course for genomics workloads. Only a few stages of the analysis pipeline can stream the in/out effectively. Even then, going to disk for the result only to be read back is going to bottleneck your pipeline.
We once asked our cluster department to give us bare metal access to their smallest machine. They gave us a dedicated server with 256GB RAM. The real cluster nodes are bigger than that because they handle multiple jobs at the same time and they have hundreds of them.
This isn't some electron framework where it is unexcusable to use that much RAM. The hardware that is available to scientists is more than capable of handling these workloads.
This may indeed have some performance benefits, but it's a very impractical approach from a hardware point of view. Few places doing processing of genomic data will have many compute nodes with > 256GB memory, yet that would barely process 1 sample with this framework. God forbid you have a family of samples or tumor/normal comparison samples to analyse and need several genomes in memory together.
Genomes are for the most part massively parallelisable and nearly every other toolkit I have seen has put that first and foremost in its design approach. Ensuring tools process data in a streaming manner and pipe between each other is a basic expectation of most genomic data tools.
Which is all to say ... this is a very strange beast and I'm not sure a lot of conclusions can be drawn from it that generalise to other activities or approaches.