Game of Genomes

biomcgary · on July 11, 2016

I strongly recommend any biology articles by Carl Zimmer. He is very good at providing an approachable perspective for non-scientists even on complex topics. In my experience, he is one of the rare journalist covering science that does not make scientists cringe.

eggie · on July 11, 2016

I did cringe. This is my field. (I know virtually all of the people he's been talking with.)

The descriptions are accurate, but murky. One huge missing piece is a description of the structure of the genome. There is one mention of the word chromosome. The words haplotype and allele never make an appearance. There is no mention of the fact that we are diploid, nor that one copy of our chromosomes comes from each of our parents. Nor is there any hint as to the way that we share our genomes.

Maybe I'm allergic to assertions of novelty, but I don't enjoy his claim that he's the first journalist to get access to their "raw" genomic data. That's just being fancy. Is it necessary to say that? In any case, what's "raw" changes every few years, so someone else will soon get the chance to say the same thing. (I gripe, but I have to admit that it's nerdy/cool to see someone writing so much about BAM files :) !)

I'm still waiting for the first journalist to mention genome structure in a popular article on genomics. As badly as we need to understand what's going on in the genome, the public, and particularly professionals whose work brings them to articles like this, need to develop an intuition about how genomes work.

biomcgary · on July 11, 2016

>I did cringe. ... The descriptions are accurate, but murky.

I agree that the article does not have the precision that biologists expect; however, I think broad accuracy is important when writing for the general public and sadly lacking in most science reporting.

I attended a conference on evolution and medicine last month where Zimmer spoke about how he approaches reporting science and the struggle to cover important, but complex ideas for a general audience. Having heard his side of the story, I may be a bit biased but I would love to hear recommendations for other science journalists that tend to get the story right.

mcarlise · on July 11, 2016

The article touches or at least hints to ploidy level. The author talks briefly about being a carrier of a disease.

But as you state, it is devoid of the 4D structure leading to the central dogma of molecular biology (DNA->RNA->Protein->function). This might be the natural discourse between science and the public. NGS technology didn't really exploid until 2005. So ten years roughly, to get the media to start talking about re-sequencing experiements. I don't know the normal length of time before trends in scientific literature bleeds into general mass media, but I would think it would be at least several years, if not a decade.

With that being said, structural genomes (which the author technically foreshadows with insertions/deletions) and their spatial effects on RNA expression is still a large mystery in scientific literature - so it wouldn't surprise me at all that it's not included or missed by a journalist.

noiv · on July 11, 2016

All the points you're missing I learned at secondary school decades ago. What I did not learned is how people might judge the overwhelming complexity of their own genome in terms of fitness. I found his views appropriate if not foreshadowing.

lunula · on July 11, 2016

You did, but if others did they promptly forgot their lessons. These things need to be repeated many times for people to appreciate them. This is an opportunity to do so, and the author apparently found it unimportant.

noiv · on July 11, 2016

Well, what do you think he found important?

urza · on July 12, 2016

Would you recommend some resources that might help me more understand genome sequencing and its consequences? I too was thinking that the author could have go deeper and explain more in the article. I am computer scientists with high school biology knowledge.

aab0 · on July 11, 2016

I wonder about the novelty. He doesn't give very specific dates (was 'mid-January' this January, last January, or the January before that?) but does say

"The process, including my registration at an Illumina-sponsored seminar, cost $3,100."

Hasn't Illumina been at $1k/genome now for at least a year?

mcarlise · on July 11, 2016

Illumina claims they can do $1k/genome using their latest HiSeq X Ten technology. But a miSeq or nextSeq cost more and produce less throughput. The author states ~70GB of data for $3.1k That could be two lanes of a MiSeq or a 1/4 lane of HiSeq (sharing with other samples); prices can vary by sequencing center.

arca_vorago · on July 11, 2016

I was a sysadmin at a genetics company for a while, and learned quite a bit while there. This is a very accurate article of the current state of things.

I have two main takeaways for anyone curious about this:

1) Sequencing costs are getting lower, yes, but the computation and data complexity is growing higher and higher as the scientists want to analyse more and more, especially those smart enough to realize the microbiome is really where it's at for a holistic approach, which equals massive amounts of data. We are talking about petabytes over time, with 200gb+ data generation a day just for a few sequencers.

2) The end goal I think is eventually there is going to be a sequencer in every hospital, and it will start catching tons of diseases before the patient even experienced any effects. It's going to be great for healthcare, not just practically but healthcare has tons of money floating around.

I think the real interesting developments will be between the sequencing machine manufacturers. Illumina, Roche 454, Ion Torrent, all doing great things and keeping the competition in innovation mode is great for the industry.

I can't wait to see what comes of it all, plus my non-compete is finally over. There is a huge bridge to be gapped between IT and the scientists that I don't think very many companies have done very well.

danieltillett · on July 12, 2016

Yes the problem is Moore's law is not keeping up with sequencing data deluge. The cost limitations of genomics are now being cause by the data analysis and not the data generation.

Personally I think the unusual genetic history of humans is going to make it really difficult to reach the full potential of what genomics. The recent population explosion and changes in selective pressures will make it impossible to really tease out the true linkages between human genes and phenotypes like disease risk.

dekhn · on July 13, 2016

There really isn't any crisis in data storage and analysis of sequencing data. The data isn't really that large (sum total of all sequencer data is in the exabytes (per year), but most of that data isn't retained, nor does it need to be retained after processing).

I founded the Google Cloud Genomics project and at the time, people were citing some very scary stats about the growth of sequencing data. After doing some analysis most people were concerned about retaining the original raw data files from the sequencer, usually saying that they wanted to go back and reanalyze the data when the algorithms got better. But nobody wanted to pay the (commodity) storage costs to keep all that data around.

Well, the first thing to be realized is that the quality scores (which represent about 75% of the sequencer data post-compression) were overly precise. Simple quantization techniques drop the value space of quality scores tremendously with no relevant effect on the resulting variant calls.

The second thing to be realized is that keeping tons of old BAM files or FASTQ files around is typically wasteful. Few people go back and reanalyze, and even if they do, they get only marginally better results; in those cases, it's better to not spend the money to bank exabytes of raw data.

Next, once people get away from thinking they need to store exabytes of raw reads, suddenly the costs shift from "$X/year to store a bunch of archival data and a smaller amount on the data analysis" to "nearly all the money is spent on analyzing a relatively small (tens of petabytes) data set". And that shifts things from storage to CPUs. I can assure you, Moore's law is still quite ahead of sequencing data analysis- the sum of all genomic data processing is just a tiny fraction of what it takes to run a medium-sized Google service. Most processing algorithms are fairly naive and unoptimized, but when I built processing algorithms that ran using Google's BigQuery, I got results in minutes.

As to your second point, I honestly don't know what the outcome of large-scale sequencing and disease correlation analysis will be. it seems like, as was suggested quite some time ago, sequence data is very limited in its ability to produce actionable medical data for human populations, and will continue to be an ancillary tool, rather than a silver bullet, for the foreseeable future.

rgejman · on July 11, 2016

Generally I thought this was a great piece. I was a little annoyed that he kept referring to "BAM" files as holding his genomic information. e.g.: "He wanted to get his own BAM file and study it."

BAM is just a format that describes the way that short chunks of DNA are aligned to a reference file of other DNA. It would be a BAM file whether you are talking about someone's genome or an experiment where the readout was a sequencing assay.

To hear how off that sounds, consider the following: "He wanted to get his own m4a file to listen to it." "She wanted to get her own xlsx file to calculate her expenses"

Perhaps its a small gripe, but why not just say "raw data" or "genomic sequence" either of which would be more accurate and not cause some of us to cringe!

JangoSteve · on July 12, 2016

Also, the BAM file isn't really the raw file that comes out of the sequencers, that would be FASTA or FASTQ. The BAM file is then, as you stated, what you get after comparing/aligning it to another sequence such as the human reference genome.

a_bonobo · on July 12, 2016

Some people have been pushing to use BAM files for unaligned reads too - the upside being that it's a) compressed and b) you know when the file is incomplete as the BAM ending is missing. You don't get that with FASTQ. Most notably, the new versions of the PacBio software produce BAM files with the sequencing reads instead of FASTQ (or h5 as previous versions did)

rgejman · on July 12, 2016

Yeah, but you can often derive the original FASTQ file from the BAM file anyway so it's like raw data PLUS.

breck · on July 11, 2016

I'm enjoying the article, but the dancing dollar signs and other animations in the gutters are killing me. Not only distracting, but slowing down my machine. Why oh why? And sadly the page also breaks Safari Reader View with only the first page visible in that view.

noiv · on July 11, 2016

Your reply is equivalent to a remark concerned about the font size of an unreadable lore ipsum paragraph on a site mostly presenting advanced CSS animations viewed on a C64. But, now I'm thinking whether a genetic disposition exists capable of steering focusing, so there is that.