Software breakthrough radically boosts the speed of nanopore DNA sequencers

doonesbury · on Dec 18, 2020

I would be eternally grateful to a bio-chem person to explain two facets of DNA sequencing I've never understood. To motivate the question a short quote from Wikipedia:

"For longer targets such as chromosomes, common approaches consist of cutting (with restriction enzymes) or shearing (with mechanical forces) large DNA fragments into shorter DNA fragments. The fragmented DNA may then be cloned into a DNA vector and amplified in a bacterial host such as Escherichia coli. "

Two questions:

1) OK, we cut up the DNA and replicate it. How in the heck do we assemble the pieces together to restore the original ordering? I've got three pieces A,B,C. How do I know it was originally ABC or CBA or BAC etc? I can correctly sequence a piece ... but what I want is the sequence of all base-pairs in their original order. How on earth do they do that?

2) DNA spends most of its time wrapped up around histones. How do does one get it off the histones for access to sequencing?

3) Less interesting: unless we sequence the whole chromosome researches must pick off a piece ... how do we know which piece to pick off and how do we know where it's crammed around the histones?

epgui · on Dec 18, 2020

1.

The key to this is understanding that the reads or sequences are long, and the slicing happens at somewhat random locations, such that your reads all overlap with each other (of course in reality it's more complicated than that, due to a bunch of long non-coding sequences and repeating sequences, but this will be mostly true for interesting parts of the genome).

Therefore your mental model of A, B, C is not particularly useful. I would replace it with the following example:

QWERTYUIOP ---[gets chopped up into]----> QWERT ERTY UIOP WE QWER TYU QW

Note that I used the terms "read" and "sequence" interchangeably here because the general notion is the same. But these two words refer to two different things.

2.

Histones are proteins. You can get rid of them simply by adding proteases to the mix. Proteases specifically break down proteins and leave nucleic acid sequences intact.

A commonly used protease is Proteinase K, which you can just order online here: https://www.sigmaaldrich.com/life-science/metabolomics/enzym...

You can also buy kits for specific applications: https://www.sigmaaldrich.com/life-science/molecular-biology/...

Once the histones have been digested, the DNA can be extracted, purified, amplified, etc, at your leisure. A full treatment of this topic would fill an entire textbook, so I will spare you the details.

3.

I'm not sure what this question means, but if you can rephrase I might be able to answer.

doonesbury · on Dec 19, 2020

>understanding that the reads or sequences are long

So why slice and replicate slices? All that makes more noise. The wikipedia seems to suggest (or I infer) that one can't merely "just read of" the base pairs in QWERTYUIOP through some super cool process and be done with it. No, I have to split QWERTYUIOP into chunks and replicate the hell out of the resulting chunks which just re-asks my question. I mean how is QWERTYUIOPQWERTYUIOP not a reasonable outcome?

I'm still missing something fundamental. The more chunks there are, the more permutations, combinations there are in possible re-assembly up to all the power sets of base pairs. Granted, it may not be possible to make complex simple here. So that thank you for your time and effort.

jakobnissen · on Dec 19, 2020

Indeed you are right, that would be much easier. DNA assembly is an insanely hard computational problem. The issue there is that it's difficult to actually build a sequencing machine that can sequence more than a few hundred base pairs before it stops.

Why that is hard depends on the approach the machine takes to sequencing. With the "sequence by synthesis" approach, the problem is that you need one or two chemical reactions per base, and any yield much lower than 100% will quickly degrade the product after a few hundred cycles.

Nanopore uses a different approach and can indeed produce very long reads, with the tail of the distribution being tens of thousands of base pairs. Not sure what the bottleneck for the length is there.

epgui · on Dec 21, 2020

Again, the key is understanding that the reads or sequences are long, hahaha. For reads, the length of the sequences are on the order of 100-1000bp in length. This is not captured by the QWERTYUIOP example.

It's also important to understand that it's impossible to obtain an error-free sequence (I am ignoring nanopore sequencing, which works differently, for the purposes of this comment), and that the assembly of all these reads is a game of probabilities.

The DNA fragment sequences are even longer than the reads.

Nobody said this was easy! Hehe.

viraptor · on Dec 18, 2020

If you're interested in how 1 works in details, http://rosalind.info/problems/tree-view/ has a few bioinformatics coding problems. Starting completely from scratch, so you can learn as you go. The reconstruction problem is there as well.

doonesbury · on Dec 18, 2020

I'm gonna try this link out and see where it goes. Thank you.

tum92 · on Dec 18, 2020

For 1), the average person here might be better equipped to answer than the average biochemist! Sequences are put back together with de bruijn graphs. Small sequences are compiled into continuous segments based on their overlap. The key piece you might be missing is that there are many many copies of the same genome, randomly fragmented so with luck (or careful experimental design) you’ll have sufficient coverage to complete chromosomes. There’s still lots of tricky regions though.

epgui · on Dec 18, 2020

Biochemist here, this is correct. This falls under "bioinformatics" or "computational biology". Most biochemists are more focused on the "wet lab" part of the job, and less on the in silico fun stuff.

lbotos · on Dec 18, 2020

I'm not an expert, but I think the PCR page will help: https://en.wikipedia.org/wiki/Polymerase_chain_reaction

basically for PCR (enzyme based) it's sort of "unzipped" down the middle and the base pairs match a specific way which is how it's duplicated.

The rest I can't answer but I hope this was as fun a read as it was for me the first time. It really is insane to think about.

epgui · on Dec 18, 2020

Biochemist here! The DNA polymerase reaction can only take place when the gene is unwound from the histone in the first place. Therefore the PCR wiki page will not provide you with the answer... The key is the "DNA purification" prep step where you digest the histones with proteases.

See here: https://en.wikipedia.org/wiki/DNA_extraction

dnautics · on Dec 18, 2020

This is not really right. The high duplex denaturing temperature of the initial thermocycle will usually also denature histones and unwind them from the dna. That's why thermocycle procedures for raw samples have a long initial denaturation (5-10 min) and thermocycle procedures for prepurified DNA need not have such a long initial incubation at 95 C.

burning_hamster · on Dec 18, 2020

Did you guys all skip the DNA extraction practical at university?

1) Proteinases are indeed not necessary.

2) The step that removes any DNA bound proteins and in fact denatures the vast majority of proteins is the salting out. DNA is negatively charged. Proteins bind DNA by being positively charged. When you add a lot of salt, you add a lot of ions that compete for those ionic bonds. Proteins fall of the DNA, and will eventually be denatured.

3) The long initial incubation in some PCR protocols is mostly a relic from old times when there weren't any commercial extraction kits and contamination with residual RNA was a potential issue. A modern setup doesn't need it (but many keep the step anyway because why would you remove it when it doesn't hurt to do it anyway).

epgui · on Dec 18, 2020

Different programmes will place more or less emphasis on different things. For myself, this was a tiny part of a 4-year undergraduate degree and I probably didn't spend any more than a day or two on the topic.

What you say does sound right, but I think the other answers still have pedagogical value. Thanks for clearing this up, I now realize I need to brush up on this topic.

epgui · on Dec 18, 2020

That's quite possible! I hadn't really thought of that.

msandford · on Dec 17, 2020

Michael Schatz has been involved in a number of interesting computational biology projects. I worked in a bioinformatics lab ~10-12 years ago and I remember using his fast short read aligner to help speed things up. Back then the nanopore devices were still just rumors.

Computers have Moore's law with a base or time constant of 12-24mo. Batteries seem to have the same with a base of about 10 years (slow Moore's law I've heard it called). It feels like genome/genetic things have the same behavior but with a base of something like 5ish years.

Real_S · on Dec 18, 2020

The cost of sequencing is dropping much faster than Moore's law.

https://www.nature.com/news/technology-the-1-000-genome-1.14...

biotinker · on Dec 18, 2020

Not really. The cost of sequencing was dropping slowly for a while, then dropped much faster than Moore's Law between 2007 and 2012. It has since dropped much more slowly.

fatboy93 · on Dec 19, 2020

Mike Schatz is awesome, and is genuinely one of the most amazing people.

I've met him on a couple of occasions and it was always amazing that he takes time off to listen to whatever ideas you've got and tries to debate them regarding their feasibility.

HPsquared · on Dec 18, 2020

Something a bit less than Moore's law... how about Lee's law?

chmaynard · on Dec 17, 2020

Link to the GitHub repository:

https://github.com/skovaka/UNCALLED

fwip · on Dec 17, 2020

This appears to be a potential alternative to selective amplification. https://www.wikiwand.com/en/Polymerase_chain_reaction#/Selec...

Instead of pre-processing the DNA by preferentially amplifying the DNA segments of interest (so that 99%+ of what you sequence is stuff you want), this software/hardware combo sequences the first N bases of each molecule, and then moves on to the next molecule if it doesn't match any expected sequences. If you're sequencing molecules over >200bp, being able to eject the molecule in the first ~10bp could be a very big speedup.

Our lab already has sub-1 day processing times for this type of analysis, but I believe the traditional sequencer machine we use (an Illumina MiSeq) is likely more expensive and more cumbersome than the portable solution described here.

alextheparrot · on Dec 17, 2020

If anyone is interested in DNA sequencing, I think the WENGAN paper [0] that came out the other day is pretty interesting too. They are using a hybrid input [1] in combination with their software package to reconstruct higher fidelity reference genomes.

The big impact from that paper is really summarized by this:

> WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb)

[0] https://www.nature.com/articles/s41587-020-00747-w

[1] Long and short reads, for example from Nanopore and Illumina, respectively

acmj · on Dec 17, 2020

HiCanu achieved that a year ago. WENGAN is a worse assembler with clearly more misassemblies. It amazes me that this level of paper can be published in Nature Biotech.

alextheparrot · on Dec 18, 2020

They even call it out in the paper, I hadn’t given it a thorough enough reading. Damn, they must’ve been a bit miffed.

I think the novelty they're pitching is the lower resources, but I wonder how much of a problem that is in practice.

acmj · on Dec 18, 2020

Several recent assemblers are faster than WENGAN. For nanopore, there are shasta and wtdbg2 (both published). For HiFi, there are Peregrine and hifiasm (both unpublished but with preprints).

I also found their phrasing here misleading: "The run time of WENGAN was at least 183 times faster than that of CANU (UL), while at the same time using less memory than other assemblers such as FLYE and SHASTA". Why not say the other way around (which is also a fact): The run time of WENGAN was at least several times slower than that of shasta, while at the same time using more memory than CANU. Come on, this is not the right way to do comparisons!

wiz21c · on Dec 17, 2020

until with have AlphaGenome (they'just beaten some protein folding problem by a laarge margin).

(I didn't look in the specifics of the problem so it may be completely wrong, but, you know, these days ...)

xyhopguy · on Dec 17, 2020

theyre not really similar problem, genome assembly is a problem that is most effectively solved in the lab and with traditional graph algorithms. protein folding fits very nicely into structured prediction and has a very quantitative way of measuring performance. Genome assembly is more qualitative, to say the least.

given a bunch of reads of short(50-300bp) or long (1000-100,000 bp) lengths (we use different algorithms for each) you need to resolve a 1D sequence that holds all of those continguous subsequences for each chromosome. Genome assembly is often described as finding a hamiltonian cycle among the data. For short reads, we use debrujin graphs to avoid the hardness of a complete hamiltonian. for long reads, MinHash has become a popular heuristic.

But this can be tricky, for instance, in species that isn't haploid -- you actually have two possible genomes that you need to correctly assemble. Sometimes the difference is a base or two, but sometimes its much longer stretches that can appear as large 'bubbles' in the assembly graph.

The assembly problem can be harder or easier depending on the organism, for instance, Wheat is notoriously hard to assemble due to the fact that it has 3 mostly (but not completely) distinct genomes. The Norway Spruce is composed of ~80B basepairs, which will put strain on even the biggest machines. Oh and its mostly repeats.

Repeats are _everywhere_, and they can be really long (sometimes on the order of millions of nucleotides). Also -- depending on the species youre assembling, the repeat content and the kind of repeats can change. You also can get different errors from the library preparation, again related to the species of origin.

Using lab techniques is particularly helpful, as we can leverage molecular and genetic information. BACs (baceterial artificial choromosomes) can be used to break the genome into smaller continguous chunks and to act as 'anchors' for the larger assembly problem. Recombination rates of different SNPs can be used to infer spatially local segments of DNA, and optical tags can be added and imaged to provide physical anchors for certain sequences. Some organisms can be bred into pure lines, with limited heterozygosity. Expressed RNA transcripts can be used in a similar manner, as they are the concatenation of ordered exons. Combining these methods is typically the best way to get a good genome assembly.

Some of the bubble resolution heuristics could probably be improved with deep learning, but getting the right data for that is probably more effective with traditional graph algorithms. Also -- you really only need a good genome assembly once, and sometimes it doesnt even need to be all that good.

Genome assembly is a _really_ fascinating area of research, and honestly, the right direction is probably to represent genomes as something other than 1D sequence that more accurately reflects biology. Deep learning is still looking to make its mark on genomics, and unfortunately its not well suited for much of the discovery efforts such as genome assembly.

jakobnissen · on Dec 19, 2020

What a great write-up. I used to work in metagenomics, where we would assemble environments with hundreds of bacterial genomes. We developed a deep-learning approach to binning the assembly, Vamb, which worked really well. In hindsight, our approach to deep learning was quite naive, so I can easily see more skilled application of DL completely outclass all existing approaches to binning. So while DL indeed would have a hard time with assembly, I'm not sure that's the case for genomics as such.

acmj · on Dec 17, 2020

No practical assemblers take assembly as a Hamiltonian problem.

xyhopguy · on Dec 17, 2020

many long read assemblers (good ones, at that) treat it as a hamiltonian problem.

acmj · on Dec 17, 2020

No, not even a single one of them. Read Gene Myers paper in 1995 or 2005. Modern OLC assemblers all follow that route which has nothing to do with the Hamilton problem. Equating overlap based assembly to a Hamilton problem is the biggest lie in the field of sequence assembly. Please stop spreading that.

breck · on Dec 17, 2020

I'd suggest the title be "...speed of nanopore DNA sequencers" as this tech doesn't apply to the majority of sequencers.

(Not to say it's not exciting: I find nanopore sequencing to be the most exciting thing in the field)

PunchTornado · on Dec 17, 2020

to me nanopore seq is not that promising. it has been here for a long time now and the benefits didn't convince a lot of customers to adopt it.

there are 2 main use cases now in precision medicine: rare disease and cancer. for both you need high precision reads, which nanopore doesn't provide.

aroch · on Dec 17, 2020

I think you're stuck in the biotech == people mindset.

On the bacterial side, it is fantastic for quickly sequcinging and closing genomes. Personally, I think the killer use case is field portable and real time sequencing of pathogens. I've worked with groups (.gov and private, defense and health related) that want to put minION + flongles to use in applications like early detection for bio terrorism and pathogen surveillance.

jakobnissen · on Dec 18, 2020

I have a colleague who is working on a portable field kit with a minION + laptop + car battery with the intent of being able to sequence and identify pathogens directly in the field, even with no electric grid.

Turns out the hardest part is the sample prep. For bioinformatics, she'll just do a simple kmer mapping against a curated database of pathogen genomes.

aroch · on Dec 18, 2020

Sample prep, particularly if you want truly long reads, is 100% the hardest part.

Funnily, there are people out there who want to teach army/marine grunts on the front line to run minIONs. Most of the work is on making the sample prep automated and idiot proof.

ivirshup · on Dec 17, 2020

Who are the customers here? The waiting list for new nanopore sequencers is quite long at the moment. Our neighboring lab has been waiting on a promethION for about a year, and only just got it because their last one bit the dust putting them on top of the list.

PunchTornado · on Dec 18, 2020

maybe I'm biased due to my work, but I see the customers as hospitals and research groups like GEL.

anon946 · on Dec 17, 2020

Should the long read lengths allow error correction to work well if there is sufficient coverage?

fwip · on Dec 17, 2020

Nanopore/pyrosequencing technology is interesting in that the class of errors that it is most susceptible to (homopolymer inaccuracies) nearly do not exist in more traditional base-by-base sequencing.
These have proven harder to correct than simple substitution errors - this is both a fault of the bias of existing tooling, and also a difficult problem in general. Roche and other companies have had a lot of smart people working on this problem.
Increasing coverage will definitely help resolve these errors, but the coverage required may be such that it's more cost effective to use a more traditional sequencer.

A homopolymer is when you have stretches of the same nucleotide, and the error is miscounting the number of them. e.g: GAAAC could be called as "GAAC" or "GAAAAC" or even "GAAAAAAAC".

dnautics · on Dec 18, 2020

What? Have you ever looked at a sanger trace with homopolymer stretches? Depending on how blotchy it is, After about 7 you might not really be sure, and it definitely gets the n wrong occasionally even with nicely resolved peaks.

I'm not defending nanopores here, frankly I'm not convinced about them yet.

fwip · on Dec 19, 2020

Sorry, I didn't mean Sanger sequencing, was comparing to Illumina sequencing.

jakobnissen · on Dec 18, 2020

Only somewhat, because the errors are systematic, and not random. Using the R9 pore flowcells, I've basically given up on getting correct consensus sequences even for influenza genomes (without manual correction, which is too labour intensive). Perhaps the new R10 pore, much better at homopolymers, will solve the problem.

dang · on Dec 17, 2020

Ok, we've added nanopores above.

ml_thoughts · on Dec 17, 2020

Link to biorxiv version: https://www.biorxiv.org/content/10.1101/2020.02.03.931923v1

Summary because the title overstates significance a bit (of a cool paper): Most sequencing today is done using Illumina machines, which basically break DNA into small parts (on the order of ~hundred letters/bases), then use fluorescence/imaging to find sequence.

This paper applies to a new technology, nanopore-based sequencing, which pulls much longer pieces of DNA (record length: 2.3 million base pairs, but most end up shorter, on the order of thousands of bases) through a microscopic molecular channel, providing real-time outputs of voltage that can be somewhat noisily mapped to nucleotide sequences (since different DNA sequences have different voltage outputs when run through a channel).

This technology is very cool, and the fact that it's real-time opens up an interesting idea: you can potentially give real-time feedback to the sequencer as the run is operating about whether a given piece of DNA it's reading is interesting to you. If it's not, the channel can spit out the piece of DNA and start reading in a new one.

So for example, let's say you want to sequence viral sequences present in a human tissue sample. Well, naively, if you just try to collect all the DNA in the sample, most of the DNA is going to be from the human genome (human genome is ~100,000x bigger than viral genomes and likely not all cells are infected). This approach aims to map the DNA as it's being read to some reference (in this case, the human genome) and avoid re-sequencing pieces of DNA mapping to the reference (by getting the channel to spit out the piece of DNA it's reading and wait for a new sequence to come in). Your sequencing results are therefore enriched in the actual sequences of interest.

In terms of the contributions of this paper, nanopore sequencing is still a growing area. This real-time aspect has been a key motivation for some time: this paper improves the feasibility of the approach by algorithmic improvements to mapping between real-time voltage readings and reference sequences. This has to be fast to be effective, since DNA is read pretty quickly and in parallel across many channels, and previous methods apparently weren't fast enough to provide real advantages.

The caveats are that this only applies to cases where you don't want to read the majority of DNA present (an important use case but not universal), nanopore sequencing still has issues with high error rates which makes it a bit less attractive than Illumina sequencing, and the amount of DNA you can read through nanopore is still less than what you can do with Illumina. So it's a cool step on the way to a future where we can do some really exciting "interactive" real-time sequencing work but it's still a part of a developing technology suite.

jgmmo · on Dec 17, 2020

Note - this is NOT about whole genome sequencing.

xiphias2 · on Dec 17, 2020

Can it be used instead of PCR Covid test for example?

myhf · on Dec 17, 2020

This would be me more applicable to looking for variants of a known gene. For example, measuring mutation rates within known covid-positive samples.

ivirshup · on Dec 17, 2020

Nanopore isn’t great for looking at generic mutation rates (many will be single nucleotide polymorphisms) due to the high error rate. It’s much better at looking for splicing patterns and epigenetic modifications. Splicing patterns could conceivably change due to mutation, but that’d be a pretty dramatic mutation.

aroch · on Dec 17, 2020

This isn’t the case anymore. Pathogen surveillance labs use Nanopore for single-base resolution variant calls to determine antibiotic susceptibility and to “fingerprint” against known and previously sequences isolates.

ivirshup · on Dec 19, 2020

Could you share some links on this? I’ve heard talks on using Nanopore for pathogen surveillance, but they were mostly about the ability to spit out reads once you knew what they were. Also, I sit (depending on restriction level) next to a nanopore lab, and they’re pretty consistent about nanopore not being good for single base resolution.

Maybe this could be a case with a lot of amplification, so you have many squiggles to infer a consensus from?

PunchTornado · on Dec 17, 2020

we already have covidseq by illumina which does this at a pretty big scale and provides survaillence and diagnostics in one test.

the issue is that PCR still is the cheapest and fastest way to get the results.