One major problem with developing a "Google for the human genome" is that we don't actually understand how most of the genes (coding) and noncoding regions in our DNA actually work or interact with each other... except at a very basic level for a very limited set of genes.
There are genome browsers out there already that came out of the human genome project and work in that direction. One example: http://huref.jcvi.org/
Yeah, I was disappointed that it's not a full genome too.
It's by no means all of his SNPs, either -- each person has around 3 million actual SNPs (variations from the reference genome), and 23andme just chooses a million sites that could be the location of a SNP to look at, most of which won't actually be points of variation for most people.
So, 23andme is only looking for common SNPs you might have. If you have a rare SNP you're interested in, or if you're a researcher trying to analyze the effects of an uncommon SNP, you're out of luck with 23andme data.
I agree, but still applaud Manu's release of something many consider so private.
Even though this isn't a genome sequence, there is potential for interesting analyses if lots of people release their 23andme data. (I believe 23andme use the same SNPs for every user).
I'll go one step further and emphatically state "this isn't his entire genome."
Additionally, this data could be radically improved if his phenotype was also included. Just because we know that the marker says "AA" without the correlating information of "blond hair" doesn't tell us whether "AA" is important for hair color.
What you can mine with this type of information is the correlation between the markers themselves: if rs1001 = AA, then rs2002 = {GG(85%), TT(10%), GT(5%)}. This is where community software could definitely benefit from more data.
If you are interested in helping create a "Google for DNA", drop us a line at SeqCentral.com
A lot of the “aren’t you afraid that somebody is going to use that against you?" remarks are reminiscent of the early days of the internet, when people were afraid to put pictures of themselves or their contact info online. There's now a $50 billion dollar company dedicated to doing just that.
This thread is probably a good place to point to perhaps the best resource on the web for personal genomics today, the Genomes Unzipped blog http://www.genomesunzipped.org/
These are early days in personal genomics, so it's great to see others jumping in. Hopefully they all do so with some awareness, and folks like Genomes Unzipped do a great job in creating that awareness, and never forgetting that there is difficult, evolving science behind our understanding.
This is a question asked out of ignorant curiousity: what, if any, are the intellectual property implications of releasing genomic information into the public domain? Does doing so preclude the patenting yet un-patented genetic sequences published in that genome?
A number of human genomes are public domain via the human genome project and 1000 genomes project, etc. The part that needs to be resolved is the bit about genes and disease implications. A recent case overturned Myriad's patents on BRCA1 and BRCA2 [1]. On the other hand, I believe it's still OK to patent signatures corresponding to a diagnostic etc (not 100% sure).
Actually, more action is going on scientist-mode-__experimental__ branch. Code is little rusty, but workable with and possibilities of optimizations are great.
EDIT:
BTW. Any ideas how to get continous integration working with it?
I really like your question. Personally I value stories, creations, etc. in which authors make multiple layers of "jokes" or references. Unfortunately, I'm not good enough to know what letters I'm actually changing :(.
The Personal Genome Project [personalgenomes.org] is aiming to recruit 100,000 people to publicly release their DNA sequence and medical data. The website currently has phenotype and medical history data and genotyping data for the first ten participants who are all well-known scientist.
One major problem with developing a "Google for the human genome" is that we don't actually understand how most of the genes (coding) and noncoding regions in our DNA actually work or interact with each other... except at a very basic level for a very limited set of genes.
There are genome browsers out there already that came out of the human genome project and work in that direction. One example: http://huref.jcvi.org/