Hacker News new | past | comments | ask | show | jobs | submit login
Mysterious DNA sequences, known as ‘Borgs,’ recovered from California mud (sciencemag.org)
225 points by walterbell on July 17, 2021 | hide | past | favorite | 62 comments



What if this is just an artifact of the DNA sequence assembly. They soil they collect probably contains DNA from many organisms. The sequence assembly algorithms are likely looking for patterns in noise [1].

[1] https://en.m.wikipedia.org/wiki/Apophenia


Was thinking the same and wondering how did they obtain the million-base long sequences. Turned out they used short read sequencing with 150 or 250 bp reads and computationally assembled the long reads [1]. While this is a traditionally valid method, the newer long read sequencing technologies such as Oxford Nanopore or PacBio would be an more appropriate and direct method.

[1] https://www.biorxiv.org/content/10.1101/2021.07.10.451761v1....


PacBio reads don't extend to the million base length and are a considerably more expensive approach than short read sequencing if assembly is happening anyway.

ONT is a great tool but the goal of this report isn't necessarily to show that these molecules are 1 million bases in length vs showing that they represent novel DNA sequences.

In summary, ONT and PacBio are neither more appropriate or necessarily more direct methods here.


Would have to disagree. They will need to confirm these were not artifacts. Long read sequencing assembly is the best approach to show the computationally assembled reads indeed coming from single contiguous molecules.


I'm not sure what kind of artifact would assemble into a series of novel ORFs but I suppose we will indeed have to disagree.


Long reads are still not as widely used since the error rates are much higher than in shotgun approach.


Actually, for the use case of genome assemblies you can compensate for the error rate with depth of coverage. Long reads are already the state-of-the-art for assembling genomes and are allowing us to get information that is simply invisible using short reads.

https://www.biorxiv.org/content/10.1101/2021.05.26.445798v1


One would think a hybrid approach will bring the best results...?

Long reads help with alignment and arrangement, and short reads eliminate small errors of just a few base pairs.


It's definitely getting better, but errors have to be corrected computationally, and it still seems like a big challenge.


Yes. This type of error could be easily determined. If it's an contamination or an error, it will have very low read coverage (typical sequencing project be 50-200x meaning if you realign and pile up the reads back to the main assembly, you get a normal distribution centered around 50-200 reads supporting a particular base or region. If you have a very low coverage region or a sharp drop off in coverage, then it's most likely an error.

Metagenomics assembly also do additional binning prior to assembly based on the abundance and GC content of the cluster reads to separate out the different taxas of the sample. How well the read clusters are distanced by this huerisirc is another measure of quality of assembly.


> If it's an contamination or an error, it will have very low read coverage (typical sequencing project be 50-200x meaning if you realign and pile up the reads back to the main assembly, you get a normal distribution centered around 50-200 reads supporting a particular base or region. If you have a very low coverage region or a sharp drop off in coverage, then it's most likely an error.

Wouldn’t sequence-specific biases(capture efficiency, amplification) result in distortions?


This could be the case. I remember this controversy few years ago about Tardigrade genome. The problem was that other paper said that "foreign" dna was the probable reason behind cryptobiosis in Tardigrade. This particular paper found that it was mostly likely a contamination.

Koutsovoulos, G., Kumar, S., Laetsch, D. R., Stevens, L., Daub, J., Conlon, C., Maroon, H., Thomas, F., Aboobaker, A. A., & Blaxter, M. (2016). No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini. Proceedings of the National Academy of Sciences, 113(18), 5053–5058. https://doi.org/10.1073/pnas.1600338113


Very unlikely, the algos used to assembly genomes can make errors but not to such an extent.


I bet you can calculate a probability that random short reads from a bunch of organisms can be assembled in a single 1 megabase-long sequence. And that probability is going to be low.


Are there any works consumable by a layperson that explore the mathematical probabilities of evolution and genetic diversity? When you look at it from the perspective of simple combinatorics, nothing seems possible. A single gene has say 4^300 permutations. You could devote the entire mass of the known universe iterating on versions of a protein a trillion times per second for a trillion years and not even remotely scratch the surface of possible configurations.


A single neural network with 8 bit weights and 1M parameters has 256^1000000 possibilities by that logic, but we can train it fine with simple evolutionary algorithms like SPSA (Simultaneous perturbation stochastic approximation)

E.g., https://openai.com/blog/evolution-strategies/

One of the remarkable results is that convergence rate for each parameter is not strongly dependent on the number of parameters.


Sure, but we can't really explain why that works either can we? However, I would suggest that there's a difference in evolution.

With the neural networks we train, there is actually a highly engineered process required to create the state in which a network can be trained and inferences be driven through it. The necessary combination of hardware that is able to perform and persist operations on information and the algorithms required to do so in a way to yield this outcome is an extremely complex set of pre-conditions that we wouldn't expect to find in the computing equivalent of a primordial soup.

With natural evolution, there is no obvious agency or intent behind it. Who is there to care whether or not life started on Earth, and/or who is driving the laws of nature such that the constructive, generative process of genetic evolution actually 'works' as well as it does? Seemingly nobody. Yet this process is able to create systems that operate on scales that we can only dream of. Look up YouTube videos on ATP Synthase for example. It's a nanomachine in every sense of the word. It uses the proton equivalent of a water wheel to spin a little machine that grabs a molecule of ADP, a molecule of inorganic phosphate, then literally snaps them together with mechanical leverage to make ATP. This little miracle machine that powers most of life on earth was built in literal and figurative darkness...it's so damn small light can't see it, and there was nobody there to appreciate its beauty until we came along billions of years later.

Ultimately I'm not surprised natural evolution works, I'm surprised at its speed and efficacy.


This is an insightful comment but I think you’re missing one thing in your understanding of natural evolution. Evolution has a “reward function” and it’s survival of the fittest.

As reproduction produces different variants of the same organism, some variations help the organism while others do not. Organisms with the helpful mutations will be more likely to pass those onto their offspring. Organisms with detrimental variations will be less likely to pass those variations to their kids.

It’s not a precise process like gradient descent but when there are billions (trillions?) of organisms evoking simultaneously and independently, it makes more sense how the complexity of biology has come about.


Actually the most remarkable thing you find is how _little_ structure you need to get evolution algorithms to work. I've thrown together random structures with feedback that are inherently nonlinear and chaotic, and evolutionary algorithms are able to quickly find parameters that stabilize them and optimize for the survival reward. The solutions are incredibly clever and in my case sometimes matched designed that were in PhD theses that took a human decades to design by hand.

Similarly the details of the evolutionary 'training' also tend not to matter much, just about any algorithm that prefers better instances (by whatever metric) with a very slightly higher probability will converge after a reasonable number of generations.

Exponential processes are always surprising. If you have a trait or parameter than confers only a 1% chance of helping survival, it will have an effect of (1.01)^100 =2.7x after only 100 generation. After 1000 generations the effect is 21,000x.


I think you might be severely underestimating the level of parallelization happening in living beings. There are 100 trillion bacteria in every single human. Extrapolated to the number of bacteria on the planet, that's a whole lot of computational ability.


I think you're severely underestimating 4^300. The answer to GP's question is that it's not an iterative brute-force. But I'm not a biologist, so I can't go into any more detail...


I think it's more important that it's a branch-and-prune algorithm. Many branches are created, but they are quickly pruned if they are worse or just unlucky. (Someone may be a the "best" human alive, but if a piano falls on the head, s/he ded and pruned.)

Branches are retried, so if some branch is an improvement and is cut by bad luck, it may be luckier a few million years later.

This does not guarantee that the "best" solution is found, but also avoid looking in the 4^300 combinations. Also, there are shorter genes, some proteins have ~50 amino acids (~150 bases), and some useful short amino acids chains have a length of 20 or even less (~60 bases or less). It's possible to start with a short versions that does something slightly useful, and slowly increase the length an efficiency.


The 'branch-and-prune' aspect of evolution seems to be the primary lifeline out of the combinatorics conundrum.


Look up the Logic of Chance by Eugene Koonin. Written by an actual evolution researcher of current times instead of legacy hacks like you-know-who.

This book is even better because it actually talks a lot of about horizontal gene transfer s role in prokaryotic evolution, and borgs might actually be something involved in a similar process. Kinda prescient if you ask me!


I am a layperson, so I would be really curious as to whom this one that is known is.


Richard Dawkins of course!


Looked perfect so I just bought the paperback, thank you!


Three factors, I think, might be at play:

1. Parallelism. The number of all sorts of organisms going through mutations is large. Like 10^40 kind of large.

2. Time. This has been going on at a very rapid tempo for quite a while.

3. Evolutionary pressure. In every generation, harmful mutations are radically weeded out so your search space is dramatically reduced at each generation.

While the first two could be roughly estimated, the third one involves non-linearity very sensitive to estimation errors. So I don’t think anybody can _prove_ this is how we ended up with Angela Merkel but it’s not implausible either and nobody has a better idea


Dawkins talks about this in The Blind Watchmaker, calling it cumulative selection (not sure where else this term is used). It's hard to wrap my head around, but that also makes sense given that our intuition is not adapted for such large numbers. Considering how long the time scale is, how many mutations are discarded, the driving force of survival, and probably other factors I'm forgetting, it's not that surprising we end up with a small (relative to the combinatorial possibilities) number of configurations. But again, I do still feel like I'm trying to convince myself right now, even though when I read the (better written) explanations, it makes sense in that moment.


Dennett talks about this sort of thing in broad outline in Darwin's Dangerous Idea, which is aimed at the layperson (though fairly dense and demanding). I'm not sure if it's mathematically detailed enough for what you seek, but it might be worth a look.


I'm sure the readings that other people linked will be useful, but two points to give some intuition:

1. Genes don't have to be optimal, or even close to optimal to work. They just have to be good enough. 2. Nature loves to copy. Large segments of DNA can be copied by a number of mechanisms and randomly placed elsewhere in a genome. So once nature "discovers" (for example) a DNA-binding motif, that motif can be added to other genes, and now you have a diverse set of DNA-binding proteins, which will continue evolving on their own.


I think you'll enjoy Andreas Wagner's 2014 Arrival of the Fittest https://amzn.com/B00INIQTA6 The title is of course a play on the (in)famous phrase "survival of the fittest".

The book's about how nature manages to actually explore the vast, vast genetic space and harvest its bounties cumulatively, while under the constrain that every "step" must be a viable organism with offspring.


I found the programmer.


I think it is very inspiring that this was found literally in the backyard of the scientist. Nature must have all sorts of novel biology like this waiting to be discovered.



> Gene sequence similarity, phylogeny, and local divergence of sequence composition indicate that many of their genes were assimilated from methane-oxidizing Methanoperedens archaea. We refer to these elements as “Borgs”.

These elements are named due to the feature of being assimilated, seems like a star trek reference here. :)

Quote from the original paper referenced at https://news.ycombinator.com/item?id=27816108


What if genetic material has been free to move from protocells to protocells for eons before parasitic or predatory behaviour (and the necessary counteractions) arose?


Like this thought, kinda describes biology before genes became selfish.


An entertaining read, but I have no doubt that there will be a mundane explaination eventually.


I always recommend to wait like 5 years until the dust settles and you can see pass the overhype of the authors and the overhype of the university press release, but after reading about megavirus[1], transposon[2], plasmids[3] and other stuff, this new category does not look impossible.

[1] https://en.wikipedia.org/wiki/Pandoravirus

[2] https://en.wikipedia.org/wiki/Transposable_element

[3] https://en.wikipedia.org/wiki/Plasmid


The sequencing machines were probably not designed to deal with dirt. Some small chance that it is a problem with the machines.


> The sequencing machines were probably not designed to deal with dirt.

The sequencing machines are probably designed to deal with DNA that's been extracted and purified, which is presumably want the scientist gave it from the mud, rather than shoveling mud into a sequencer. (And, since once they had characterized “Borgs” from their samples they then found a bunch more in public databases...)


Confusing name. "Borg" means "castle" or "fort" in Swedish but it seems that the name refers to the "Borg collective" from Star Trek The Next Generation. So, it is a derivative of a derivative name. I kind of miss when new discoveries where named after Latin words, being a Spanish speaker many of them had a sense of coherence with the language. Better than Sonic hedgehog protein, thou.

update: For all the downvotes, I recommend to read "How new words are born": https://www.theguardian.com/media/mind-your-language/2016/fe... The rules are, of course, not enforced by anyone but emerge from a common understanding of the language.


Borg is short for Cyborg. Cyborg is a portmanteau of Cybernetic Organism. Cybernetic is from the Greek kybernetike, while Organism is from the Greek organismos.

You're happy now.


Huh, I didn't realize that Google's Borg and Kubernetes projects had names that were related that way.


A person who makes a comment like that isn’t going to be happy with anything.


> A person who makes a comment like that isn’t going to be happy with anything.

:)


> You're happy now.

The problem still persists. The word only makes sense in the popular TV show reference as Cyborg is the actual short name for Cybernetic Organism. It would be like shortening Computer Language as ulang, not very meaningful and probably confusing.

Language has an arbitrary history, but it is molded by evolutionary mechanism. To add random words with little though makes communication more difficult.

For an Enlgish speaker to add Tok-Tik to their vocabulary is cumbersome, to add Tik-Tok is easy. Communication is hard enough to add new words without reason or rhyme.


Language changes. You'll get over it. Or maybe you won't, but the rest of the world will.

The only evolutionary mechanism for language is what gets used, not what makes sense to you. Your idea that "cyborg" is somehow special because that's the one contraction that makes sense to you just means you're not participating in that particular evolutionary language branch, not that it's an invalid branch.


> Language changes.

Not as much as it should, the predominance of orthographic corrector software can slowdown the pace of needed change.

The rule I see for Cybernetic Organism to cyborg is to take a syllable of one word and another of the other one. Similarly to Electronic Mail becomes email.

There is also a rule that when reading the first and last letters are the most important ones (something like this https://www.dictionary.com/e/typoglycemia/)

So, I could be wrong and maybe there are other examples of words that are shortened by taking single letters from the middle of the word. English is not my mother tongue.


> So, I could be wrong and maybe there are other examples of words that are shortened by taking single letters from the middle of the word.

“Borg” derives from the already established “cyborg”, the letters are taken from the beginning (if you mean the elisions; the end if you mean the retained letters), not the middle.


Sorry you're being downvoted, I also strongly agree that new-word formation rules usually should be stricter for scientific purposes, and heuristics regarding prefixes and suffixes should apply.

There are cases that shorten words even worse (Richard to Rick to Dick by rhyming comes to mind), but all of these I know are colloquial.

There does feel like overall standards are lowered here.


> So, it is a derivative of a derivative name.

Most words in English (and a whole bunch of other languages) that aren't unchanged or one-step removed from Proto-Indo-European (and possibly those that are, since that's just as far back as we are kind of able to reconstruct) are either derivatives of derivatives or newer out-of-the-blue inventions (often derivatives of derivatives of inventions).

> For all the downvotes, I recommend to read "How new words are born"

But...the word you are complaining about is an example of #7 (followed by #4 for the use in question), from “cyborg”, so...your complaint about it beign abnormal is undermined by your own citation.


> So, it is a derivative of a derivative name.

Well that's how language works. Do you think those Latin names came out of nowhere? Or do you think they're themselves derived from Italic languages and other sources?


The Latin is known, stable, and a common ancestor for Spanish and English speakers.


So is Greek, where we got the roots of cybernetic and organism.


I think “Borg” comes from “Cyborg” which comes from “Cybernetic Organism.”

https://en.m.wiktionary.org/wiki/cyborg


> update: For all the downvotes, I recommend to read "How new words are born": https://www.theguardian.com/media/mind-your-language/2016/fe...

Right… so isn’t this ‘4 Repurosing’? They’ve done what your article describes.


From the original paper they hint that the name comes from the "assimilated" feature, which does seem like a star trek reference after all.

see https://news.ycombinator.com/item?id=27870406


They didn’t hint, it says it explicitly in the article… inspired by her son to use Star Trek references for naming things.


Borg... sounds swedish?

ST first contact


So borg is swedish after all!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: