I was a sysadmin at a genetics company for a while, and learned quite a bit whil...

danieltillett · on July 12, 2016

Yes the problem is Moore's law is not keeping up with sequencing data deluge. The cost limitations of genomics are now being cause by the data analysis and not the data generation.

Personally I think the unusual genetic history of humans is going to make it really difficult to reach the full potential of what genomics. The recent population explosion and changes in selective pressures will make it impossible to really tease out the true linkages between human genes and phenotypes like disease risk.

dekhn · on July 13, 2016

There really isn't any crisis in data storage and analysis of sequencing data. The data isn't really that large (sum total of all sequencer data is in the exabytes (per year), but most of that data isn't retained, nor does it need to be retained after processing).

I founded the Google Cloud Genomics project and at the time, people were citing some very scary stats about the growth of sequencing data. After doing some analysis most people were concerned about retaining the original raw data files from the sequencer, usually saying that they wanted to go back and reanalyze the data when the algorithms got better. But nobody wanted to pay the (commodity) storage costs to keep all that data around.

Well, the first thing to be realized is that the quality scores (which represent about 75% of the sequencer data post-compression) were overly precise. Simple quantization techniques drop the value space of quality scores tremendously with no relevant effect on the resulting variant calls.

The second thing to be realized is that keeping tons of old BAM files or FASTQ files around is typically wasteful. Few people go back and reanalyze, and even if they do, they get only marginally better results; in those cases, it's better to not spend the money to bank exabytes of raw data.

Next, once people get away from thinking they need to store exabytes of raw reads, suddenly the costs shift from "$X/year to store a bunch of archival data and a smaller amount on the data analysis" to "nearly all the money is spent on analyzing a relatively small (tens of petabytes) data set". And that shifts things from storage to CPUs. I can assure you, Moore's law is still quite ahead of sequencing data analysis- the sum of all genomic data processing is just a tiny fraction of what it takes to run a medium-sized Google service. Most processing algorithms are fairly naive and unoptimized, but when I built processing algorithms that ran using Google's BigQuery, I got results in minutes.

As to your second point, I honestly don't know what the outcome of large-scale sequencing and disease correlation analysis will be. it seems like, as was suggested quite some time ago, sequence data is very limited in its ability to produce actionable medical data for human populations, and will continue to be an ancillary tool, rather than a silver bullet, for the foreseeable future.