Mysterious DNA sequences, known as ‘Borgs,’ recovered from California mud

bno1 · on July 17, 2021

What if this is just an artifact of the DNA sequence assembly. They soil they collect probably contains DNA from many organisms. The sequence assembly algorithms are likely looking for patterns in noise [1].

[1] https://en.m.wikipedia.org/wiki/Apophenia

andrewon · on July 18, 2021

Was thinking the same and wondering how did they obtain the million-base long sequences. Turned out they used short read sequencing with 150 or 250 bp reads and computationally assembled the long reads [1]. While this is a traditionally valid method, the newer long read sequencing technologies such as Oxford Nanopore or PacBio would be an more appropriate and direct method.

[1] https://www.biorxiv.org/content/10.1101/2021.07.10.451761v1....

ejstronge · on July 18, 2021

PacBio reads don't extend to the million base length and are a considerably more expensive approach than short read sequencing if assembly is happening anyway.

ONT is a great tool but the goal of this report isn't necessarily to show that these molecules are 1 million bases in length vs showing that they represent novel DNA sequences.

In summary, ONT and PacBio are neither more appropriate or necessarily more direct methods here.

andrewon · on July 18, 2021

Would have to disagree. They will need to confirm these were not artifacts. Long read sequencing assembly is the best approach to show the computationally assembled reads indeed coming from single contiguous molecules.

ejstronge · on July 19, 2021

I'm not sure what kind of artifact would assemble into a series of novel ORFs but I suppose we will indeed have to disagree.

f6v · on July 18, 2021

Long reads are still not as widely used since the error rates are much higher than in shotgun approach.

tdido · on July 18, 2021

Actually, for the use case of genome assemblies you can compensate for the error rate with depth of coverage. Long reads are already the state-of-the-art for assembling genomes and are allowing us to get information that is simply invisible using short reads.

https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1

londons_explore · on July 18, 2021

One would think a hybrid approach will bring the best results...?

Long reads help with alignment and arrangement, and short reads eliminate small errors of just a few base pairs.

f6v · on July 18, 2021

It's definitely getting better, but errors have to be corrected computationally, and it still seems like a big challenge.

noname123 · on July 18, 2021

Yes. This type of error could be easily determined. If it's an contamination or an error, it will have very low read coverage (typical sequencing project be 50-200x meaning if you realign and pile up the reads back to the main assembly, you get a normal distribution centered around 50-200 reads supporting a particular base or region. If you have a very low coverage region or a sharp drop off in coverage, then it's most likely an error.

Metagenomics assembly also do additional binning prior to assembly based on the abundance and GC content of the cluster reads to separate out the different taxas of the sample. How well the read clusters are distanced by this huerisirc is another measure of quality of assembly.

f6v · on July 18, 2021

> If it's an contamination or an error, it will have very low read coverage (typical sequencing project be 50-200x meaning if you realign and pile up the reads back to the main assembly, you get a normal distribution centered around 50-200 reads supporting a particular base or region. If you have a very low coverage region or a sharp drop off in coverage, then it's most likely an error.

Wouldn’t sequence-specific biases(capture efficiency, amplification) result in distortions?

kasperset · on July 18, 2021

This could be the case. I remember this controversy few years ago about Tardigrade genome. The problem was that other paper said that "foreign" dna was the probable reason behind cryptobiosis in Tardigrade. This particular paper found that it was mostly likely a contamination.

Koutsovoulos, G., Kumar, S., Laetsch, D. R., Stevens, L., Daub, J., Conlon, C., Maroon, H., Thomas, F., Aboobaker, A. A., & Blaxter, M. (2016). No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences, 113(18), 5053–5058. https://doi.org/10.1073/pnas.1600338113

greazy · on July 18, 2021

Very unlikely, the algos used to assembly genomes can make errors but not to such an extent.

f6v · on July 18, 2021

I bet you can calculate a probability that random short reads from a bunch of organisms can be assembled in a single 1 megabase-long sequence. And that probability is going to be low.

jcims · on July 18, 2021

Are there any works consumable by a layperson that explore the mathematical probabilities of evolution and genetic diversity? When you look at it from the perspective of simple combinatorics, nothing seems possible. A single gene has say 4^300 permutations. You could devote the entire mass of the known universe iterating on versions of a protein a trillion times per second for a trillion years and not even remotely scratch the surface of possible configurations.

mrtnmcc · on July 18, 2021

A single neural network with 8 bit weights and 1M parameters has 256^1000000 possibilities by that logic, but we can train it fine with simple evolutionary algorithms like SPSA (Simultaneous perturbation stochastic approximation)

E.g., https://openai.com/blog/evolution-strategies/

One of the remarkable results is that convergence rate for each parameter is not strongly dependent on the number of parameters.

jcims · on July 18, 2021

Sure, but we can't really explain why that works either can we? However, I would suggest that there's a difference in evolution.

With the neural networks we train, there is actually a highly engineered process required to create the state in which a network can be trained and inferences be driven through it. The necessary combination of hardware that is able to perform and persist operations on information and the algorithms required to do so in a way to yield this outcome is an extremely complex set of pre-conditions that we wouldn't expect to find in the computing equivalent of a primordial soup.

With natural evolution, there is no obvious agency or intent behind it. Who is there to care whether or not life started on Earth, and/or who is driving the laws of nature such that the constructive, generative process of genetic evolution actually 'works' as well as it does? Seemingly nobody. Yet this process is able to create systems that operate on scales that we can only dream of. Look up YouTube videos on ATP Synthase for example. It's a nanomachine in every sense of the word. It uses the proton equivalent of a water wheel to spin a little machine that grabs a molecule of ADP, a molecule of inorganic phosphate, then literally snaps them together with mechanical leverage to make ATP. This little miracle machine that powers most of life on earth was built in literal and figurative darkness...it's so damn small light can't see it, and there was nobody there to appreciate its beauty until we came along billions of years later.

Ultimately I'm not surprised natural evolution works, I'm surprised at its speed and efficacy.

mtqwerty · on July 18, 2021

This is an insightful comment but I think you’re missing one thing in your understanding of natural evolution. Evolution has a “reward function” and it’s survival of the fittest.

As reproduction produces different variants of the same organism, some variations help the organism while others do not. Organisms with the helpful mutations will be more likely to pass those onto their offspring. Organisms with detrimental variations will be less likely to pass those variations to their kids.

It’s not a precise process like gradient descent but when there are billions (trillions?) of organisms evoking simultaneously and independently, it makes more sense how the complexity of biology has come about.

mrtnmcc · on July 18, 2021

Actually the most remarkable thing you find is how _little_ structure you need to get evolution algorithms to work. I've thrown together random structures with feedback that are inherently nonlinear and chaotic, and evolutionary algorithms are able to quickly find parameters that stabilize them and optimize for the survival reward. The solutions are incredibly clever and in my case sometimes matched designed that were in PhD theses that took a human decades to design by hand.

Similarly the details of the evolutionary 'training' also tend not to matter much, just about any algorithm that prefers better instances (by whatever metric) with a very slightly higher probability will converge after a reasonable number of generations.

Exponential processes are always surprising. If you have a trait or parameter than confers only a 1% chance of helping survival, it will have an effect of (1.01)^100 =2.7x after only 100 generation. After 1000 generations the effect is 21,000x.

KhoomeiK · on July 18, 2021

I think you might be severely underestimating the level of parallelization happening in living beings. There are 100 trillion bacteria in every single human. Extrapolated to the number of bacteria on the planet, that's a whole lot of computational ability.

Retr0id · on July 18, 2021

I think you're severely underestimating 4^300. The answer to GP's question is that it's not an iterative brute-force. But I'm not a biologist, so I can't go into any more detail...

gus_massa · on July 18, 2021

I think it's more important that it's a branch-and-prune algorithm. Many branches are created, but they are quickly pruned if they are worse or just unlucky. (Someone may be a the "best" human alive, but if a piano falls on the head, s/he ded and pruned.)

Branches are retried, so if some branch is an improvement and is cut by bad luck, it may be luckier a few million years later.

This does not guarantee that the "best" solution is found, but also avoid looking in the 4^300 combinations. Also, there are shorter genes, some proteins have ~50 amino acids (~150 bases), and some useful short amino acids chains have a length of 20 or even less (~60 bases or less). It's possible to start with a short versions that does something slightly useful, and slowly increase the length an efficiency.

jcims · on July 18, 2021

The 'branch-and-prune' aspect of evolution seems to be the primary lifeline out of the combinatorics conundrum.

ramraj07 · on July 18, 2021

Look up the Logic of Chance by Eugene Koonin. Written by an actual evolution researcher of current times instead of legacy hacks like you-know-who.

This book is even better because it actually talks a lot of about horizontal gene transfer s role in prokaryotic evolution, and borgs might actually be something involved in a similar process. Kinda prescient if you ask me!

RealityVoid · on July 18, 2021

I am a layperson, so I would be really curious as to whom this one that is known is.

ramraj07 · on July 18, 2021

Richard Dawkins of course!

jcims · on July 18, 2021

Looked perfect so I just bought the paperback, thank you!

andreskytt · on July 18, 2021

Three factors, I think, might be at play:

1. Parallelism. The number of all sorts of organisms going through mutations is large. Like 10^40 kind of large.

2. Time. This has been going on at a very rapid tempo for quite a while.

3. Evolutionary pressure. In every generation, harmful mutations are radically weeded out so your search space is dramatically reduced at each generation.

While the first two could be roughly estimated, the third one involves non-linearity very sensitive to estimation errors. So I don’t think anybody can _prove_ this is how we ended up with Angela Merkel but it’s not implausible either and nobody has a better idea

turingcomplet · on July 18, 2021

Dawkins talks about this in The Blind Watchmaker, calling it cumulative selection (not sure where else this term is used). It's hard to wrap my head around, but that also makes sense given that our intuition is not adapted for such large numbers. Considering how long the time scale is, how many mutations are discarded, the driving force of survival, and probably other factors I'm forgetting, it's not that surprising we end up with a small (relative to the combinatorial possibilities) number of configurations. But again, I do still feel like I'm trying to convince myself right now, even though when I read the (better written) explanations, it makes sense in that moment.

dri_ft · on July 18, 2021

Dennett talks about this sort of thing in broad outline in Darwin's Dangerous Idea, which is aimed at the layperson (though fairly dense and demanding). I'm not sure if it's mathematically detailed enough for what you seek, but it might be worth a look.

jryb · on July 18, 2021

I'm sure the readings that other people linked will be useful, but two points to give some intuition:

1. Genes don't have to be optimal, or even close to optimal to work. They just have to be good enough. 2. Nature loves to copy. Large segments of DNA can be copied by a number of mechanisms and randomly placed elsewhere in a genome. So once nature "discovers" (for example) a DNA-binding motif, that motif can be added to other genes, and now you have a diverse set of DNA-binding proteins, which will continue evolving on their own.

elzr · on July 18, 2021

I think you'll enjoy Andreas Wagner's 2014 Arrival of the Fittest https://amzn.com/B00INIQTA6 The title is of course a play on the (in)famous phrase "survival of the fittest".

The book's about how nature manages to actually explore the vast, vast genetic space and harvest its bounties cumulatively, while under the constrain that every "step" must be a viable organism with offspring.

Zamicol · on July 18, 2021

I found the programmer.

gourneau · on July 18, 2021

I think it is very inspiring that this was found literally in the backyard of the scientist. Nature must have all sorts of novel biology like this waiting to be discovered.

farresito · on July 17, 2021

Related HN thread: https://news.ycombinator.com/item?id=27816108

fouc · on July 18, 2021

> Gene sequence similarity, phylogeny, and local divergence of sequence composition indicate that many of their genes were assimilated from methane-oxidizing Methanoperedens archaea. We refer to these elements as “Borgs”.

These elements are named due to the feature of being assimilated, seems like a star trek reference here. :)

Quote from the original paper referenced at https://news.ycombinator.com/item?id=27816108

richardfey · on July 18, 2021

What if genetic material has been free to move from protocells to protocells for eons before parasitic or predatory behaviour (and the necessary counteractions) arose?

noiv · on July 18, 2021

Like this thought, kinda describes biology before genes became selfish.

woleium · on July 17, 2021

An entertaining read, but I have no doubt that there will be a mundane explaination eventually.

gus_massa · on July 17, 2021

I always recommend to wait like 5 years until the dust settles and you can see pass the overhype of the authors and the overhype of the university press release, but after reading about megavirus[1], transposon[2], plasmids[3] and other stuff, this new category does not look impossible.

[1] https://en.wikipedia.org/wiki/Pandoravirus

[2] https://en.wikipedia.org/wiki/Transposable_element

[3] https://en.wikipedia.org/wiki/Plasmid

aritmo · on July 17, 2021

The sequencing machines were probably not designed to deal with dirt. Some small chance that it is a problem with the machines.

dragonwriter · on July 18, 2021

> The sequencing machines were probably not designed to deal with dirt.

The sequencing machines are probably designed to deal with DNA that's been extracted and purified, which is presumably want the scientist gave it from the mud, rather than shoveling mud into a sequencer. (And, since once they had characterized “Borgs” from their samples they then found a bunch more in public databases...)

Hokusai · on July 17, 2021

Confusing name. "Borg" means "castle" or "fort" in Swedish but it seems that the name refers to the "Borg collective" from Star Trek The Next Generation. So, it is a derivative of a derivative name. I kind of miss when new discoveries where named after Latin words, being a Spanish speaker many of them had a sense of coherence with the language. Better than Sonic hedgehog protein, thou.

update: For all the downvotes, I recommend to read "How new words are born": https://www.theguardian.com/media/mind-your-language/2016/fe... The rules are, of course, not enforced by anyone but emerge from a common understanding of the language.

Igelau · on July 17, 2021

Borg is short for Cyborg. Cyborg is a portmanteau of Cybernetic Organism. Cybernetic is from the Greek kybernetike, while Organism is from the Greek organismos.

You're happy now.

sp332 · on July 18, 2021

Huh, I didn't realize that Google's Borg and Kubernetes projects had names that were related that way.

phowat · on July 18, 2021

A person who makes a comment like that isn’t going to be happy with anything.

Hokusai · on July 18, 2021

> A person who makes a comment like that isn’t going to be happy with anything.

:)

Hokusai · on July 18, 2021

> You're happy now.

The problem still persists. The word only makes sense in the popular TV show reference as Cyborg is the actual short name for Cybernetic Organism. It would be like shortening Computer Language as ulang, not very meaningful and probably confusing.

Language has an arbitrary history, but it is molded by evolutionary mechanism. To add random words with little though makes communication more difficult.

For an Enlgish speaker to add Tok-Tik to their vocabulary is cumbersome, to add Tik-Tok is easy. Communication is hard enough to add new words without reason or rhyme.

groby_b · on July 18, 2021

Language changes. You'll get over it. Or maybe you won't, but the rest of the world will.

The only evolutionary mechanism for language is what gets used, not what makes sense to you. Your idea that "cyborg" is somehow special because that's the one contraction that makes sense to you just means you're not participating in that particular evolutionary language branch, not that it's an invalid branch.

Hokusai · on July 18, 2021

> Language changes.

Not as much as it should, the predominance of orthographic corrector software can slowdown the pace of needed change.

The rule I see for Cybernetic Organism to cyborg is to take a syllable of one word and another of the other one. Similarly to Electronic Mail becomes email.

There is also a rule that when reading the first and last letters are the most important ones (something like this https://www.dictionary.com/e/typoglycemia/)

So, I could be wrong and maybe there are other examples of words that are shortened by taking single letters from the middle of the word. English is not my mother tongue.

dragonwriter · on July 18, 2021

> So, I could be wrong and maybe there are other examples of words that are shortened by taking single letters from the middle of the word.

“Borg” derives from the already established “cyborg”, the letters are taken from the beginning (if you mean the elisions; the end if you mean the retained letters), not the middle.

mihaic · on July 18, 2021

Sorry you're being downvoted, I also strongly agree that new-word formation rules usually should be stricter for scientific purposes, and heuristics regarding prefixes and suffixes should apply.

There are cases that shorten words even worse (Richard to Rick to Dick by rhyming comes to mind), but all of these I know are colloquial.

There does feel like overall standards are lowered here.

dragonwriter · on July 18, 2021

> So, it is a derivative of a derivative name.

Most words in English (and a whole bunch of other languages) that aren't unchanged or one-step removed from Proto-Indo-European (and possibly those that are, since that's just as far back as we are kind of able to reconstruct) are either derivatives of derivatives or newer out-of-the-blue inventions (often derivatives of derivatives of inventions).

> For all the downvotes, I recommend to read "How new words are born"

But...the word you are complaining about is an example of #7 (followed by #4 for the use in question), from “cyborg”, so...your complaint about it beign abnormal is undermined by your own citation.

chrisseaton · on July 17, 2021

> So, it is a derivative of a derivative name.

Well that's how language works. Do you think those Latin names came out of nowhere? Or do you think they're themselves derived from Italic languages and other sources?

wolverine876 · on July 18, 2021

The Latin is known, stable, and a common ancestor for Spanish and English speakers.

robotresearcher · on July 18, 2021

So is Greek, where we got the roots of cybernetic and organism.

ojbyrne · on July 17, 2021

I think “Borg” comes from “Cyborg” which comes from “Cybernetic Organism.”

https://en.m.wiktionary.org/wiki/cyborg

chrisseaton · on July 18, 2021

> update: For all the downvotes, I recommend to read "How new words are born": https://www.theguardian.com/media/mind-your-language/2016/fe...

Right… so isn’t this ‘4 Repurosing’? They’ve done what your article describes.

fouc · on July 18, 2021

From the original paper they hint that the name comes from the "assimilated" feature, which does seem like a star trek reference after all.

see https://news.ycombinator.com/item?id=27870406

dmix · on July 18, 2021

They didn’t hint, it says it explicitly in the article… inspired by her son to use Star Trek references for naming things.

hanszarkov · on July 18, 2021

Borg... sounds swedish?

ST first contact

ant6n · on July 18, 2021

So borg is swedish after all!