Hacker News new | past | comments | ask | show | jobs | submit login

I was a sysadmin at a genetics company for a while, and learned quite a bit while there. This is a very accurate article of the current state of things.

I have two main takeaways for anyone curious about this:

1) Sequencing costs are getting lower, yes, but the computation and data complexity is growing higher and higher as the scientists want to analyse more and more, especially those smart enough to realize the microbiome is really where it's at for a holistic approach, which equals massive amounts of data. We are talking about petabytes over time, with 200gb+ data generation a day just for a few sequencers.

2) The end goal I think is eventually there is going to be a sequencer in every hospital, and it will start catching tons of diseases before the patient even experienced any effects. It's going to be great for healthcare, not just practically but healthcare has tons of money floating around.

I think the real interesting developments will be between the sequencing machine manufacturers. Illumina, Roche 454, Ion Torrent, all doing great things and keeping the competition in innovation mode is great for the industry.

I can't wait to see what comes of it all, plus my non-compete is finally over. There is a huge bridge to be gapped between IT and the scientists that I don't think very many companies have done very well.




Yes the problem is Moore's law is not keeping up with sequencing data deluge. The cost limitations of genomics are now being cause by the data analysis and not the data generation.

Personally I think the unusual genetic history of humans is going to make it really difficult to reach the full potential of what genomics. The recent population explosion and changes in selective pressures will make it impossible to really tease out the true linkages between human genes and phenotypes like disease risk.


There really isn't any crisis in data storage and analysis of sequencing data. The data isn't really that large (sum total of all sequencer data is in the exabytes (per year), but most of that data isn't retained, nor does it need to be retained after processing).

I founded the Google Cloud Genomics project and at the time, people were citing some very scary stats about the growth of sequencing data. After doing some analysis most people were concerned about retaining the original raw data files from the sequencer, usually saying that they wanted to go back and reanalyze the data when the algorithms got better. But nobody wanted to pay the (commodity) storage costs to keep all that data around.

Well, the first thing to be realized is that the quality scores (which represent about 75% of the sequencer data post-compression) were overly precise. Simple quantization techniques drop the value space of quality scores tremendously with no relevant effect on the resulting variant calls.

The second thing to be realized is that keeping tons of old BAM files or FASTQ files around is typically wasteful. Few people go back and reanalyze, and even if they do, they get only marginally better results; in those cases, it's better to not spend the money to bank exabytes of raw data.

Next, once people get away from thinking they need to store exabytes of raw reads, suddenly the costs shift from "$X/year to store a bunch of archival data and a smaller amount on the data analysis" to "nearly all the money is spent on analyzing a relatively small (tens of petabytes) data set". And that shifts things from storage to CPUs. I can assure you, Moore's law is still quite ahead of sequencing data analysis- the sum of all genomic data processing is just a tiny fraction of what it takes to run a medium-sized Google service. Most processing algorithms are fairly naive and unoptimized, but when I built processing algorithms that ran using Google's BigQuery, I got results in minutes.

As to your second point, I honestly don't know what the outcome of large-scale sequencing and disease correlation analysis will be. it seems like, as was suggested quite some time ago, sequence data is very limited in its ability to produce actionable medical data for human populations, and will continue to be an ancillary tool, rather than a silver bullet, for the foreseeable future.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: