Hacker News new | past | comments | ask | show | jobs | submit login

I would be eternally grateful to a bio-chem person to explain two facets of DNA sequencing I've never understood. To motivate the question a short quote from Wikipedia:

"For longer targets such as chromosomes, common approaches consist of cutting (with restriction enzymes) or shearing (with mechanical forces) large DNA fragments into shorter DNA fragments. The fragmented DNA may then be cloned into a DNA vector and amplified in a bacterial host such as Escherichia coli. "

Two questions:

1) OK, we cut up the DNA and replicate it. How in the heck do we assemble the pieces together to restore the original ordering? I've got three pieces A,B,C. How do I know it was originally ABC or CBA or BAC etc? I can correctly sequence a piece ... but what I want is the sequence of all base-pairs in their original order. How on earth do they do that?

2) DNA spends most of its time wrapped up around histones. How do does one get it off the histones for access to sequencing?

3) Less interesting: unless we sequence the whole chromosome researches must pick off a piece ... how do we know which piece to pick off and how do we know where it's crammed around the histones?




1.

The key to this is understanding that the reads or sequences are long, and the slicing happens at somewhat random locations, such that your reads all overlap with each other (of course in reality it's more complicated than that, due to a bunch of long non-coding sequences and repeating sequences, but this will be mostly true for interesting parts of the genome).

Therefore your mental model of A, B, C is not particularly useful. I would replace it with the following example:

QWERTYUIOP ---[gets chopped up into]----> QWERT ERTY UIOP WE QWER TYU QW

Note that I used the terms "read" and "sequence" interchangeably here because the general notion is the same. But these two words refer to two different things.

2.

Histones are proteins. You can get rid of them simply by adding proteases to the mix. Proteases specifically break down proteins and leave nucleic acid sequences intact.

A commonly used protease is Proteinase K, which you can just order online here: https://www.sigmaaldrich.com/life-science/metabolomics/enzym...

You can also buy kits for specific applications: https://www.sigmaaldrich.com/life-science/molecular-biology/...

Once the histones have been digested, the DNA can be extracted, purified, amplified, etc, at your leisure. A full treatment of this topic would fill an entire textbook, so I will spare you the details.

3.

I'm not sure what this question means, but if you can rephrase I might be able to answer.


>understanding that the reads or sequences are long

So why slice and replicate slices? All that makes more noise. The wikipedia seems to suggest (or I infer) that one can't merely "just read of" the base pairs in QWERTYUIOP through some super cool process and be done with it. No, I have to split QWERTYUIOP into chunks and replicate the hell out of the resulting chunks which just re-asks my question. I mean how is QWERTYUIOPQWERTYUIOP not a reasonable outcome?

I'm still missing something fundamental. The more chunks there are, the more permutations, combinations there are in possible re-assembly up to all the power sets of base pairs. Granted, it may not be possible to make complex simple here. So that thank you for your time and effort.


Indeed you are right, that would be much easier. DNA assembly is an insanely hard computational problem. The issue there is that it's difficult to actually build a sequencing machine that can sequence more than a few hundred base pairs before it stops.

Why that is hard depends on the approach the machine takes to sequencing. With the "sequence by synthesis" approach, the problem is that you need one or two chemical reactions per base, and any yield much lower than 100% will quickly degrade the product after a few hundred cycles.

Nanopore uses a different approach and can indeed produce very long reads, with the tail of the distribution being tens of thousands of base pairs. Not sure what the bottleneck for the length is there.


Again, the key is understanding that the reads or sequences are long, hahaha. For reads, the length of the sequences are on the order of 100-1000bp in length. This is not captured by the QWERTYUIOP example.

It's also important to understand that it's impossible to obtain an error-free sequence (I am ignoring nanopore sequencing, which works differently, for the purposes of this comment), and that the assembly of all these reads is a game of probabilities.

The DNA fragment sequences are even longer than the reads.

Nobody said this was easy! Hehe.


If you're interested in how 1 works in details, http://rosalind.info/problems/tree-view/ has a few bioinformatics coding problems. Starting completely from scratch, so you can learn as you go. The reconstruction problem is there as well.


I'm gonna try this link out and see where it goes. Thank you.


For 1), the average person here might be better equipped to answer than the average biochemist! Sequences are put back together with de bruijn graphs. Small sequences are compiled into continuous segments based on their overlap. The key piece you might be missing is that there are many many copies of the same genome, randomly fragmented so with luck (or careful experimental design) you’ll have sufficient coverage to complete chromosomes. There’s still lots of tricky regions though.


Biochemist here, this is correct. This falls under "bioinformatics" or "computational biology". Most biochemists are more focused on the "wet lab" part of the job, and less on the in silico fun stuff.


I'm not an expert, but I think the PCR page will help: https://en.wikipedia.org/wiki/Polymerase_chain_reaction

basically for PCR (enzyme based) it's sort of "unzipped" down the middle and the base pairs match a specific way which is how it's duplicated.

The rest I can't answer but I hope this was as fun a read as it was for me the first time. It really is insane to think about.


Biochemist here! The DNA polymerase reaction can only take place when the gene is unwound from the histone in the first place. Therefore the PCR wiki page will not provide you with the answer... The key is the "DNA purification" prep step where you digest the histones with proteases.

See here: https://en.wikipedia.org/wiki/DNA_extraction


This is not really right. The high duplex denaturing temperature of the initial thermocycle will usually also denature histones and unwind them from the dna. That's why thermocycle procedures for raw samples have a long initial denaturation (5-10 min) and thermocycle procedures for prepurified DNA need not have such a long initial incubation at 95 C.


Did you guys all skip the DNA extraction practical at university?

1) Proteinases are indeed not necessary.

2) The step that removes any DNA bound proteins and in fact denatures the vast majority of proteins is the salting out. DNA is negatively charged. Proteins bind DNA by being positively charged. When you add a lot of salt, you add a lot of ions that compete for those ionic bonds. Proteins fall of the DNA, and will eventually be denatured.

3) The long initial incubation in some PCR protocols is mostly a relic from old times when there weren't any commercial extraction kits and contamination with residual RNA was a potential issue. A modern setup doesn't need it (but many keep the step anyway because why would you remove it when it doesn't hurt to do it anyway).


Different programmes will place more or less emphasis on different things. For myself, this was a tiny part of a 4-year undergraduate degree and I probably didn't spend any more than a day or two on the topic.

What you say does sound right, but I think the other answers still have pedagogical value. Thanks for clearing this up, I now realize I need to brush up on this topic.


That's quite possible! I hadn't really thought of that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: