Hacker News new | past | comments | ask | show | jobs | submit login

Glad we're on the same page about the multiple techniques now. Statements you made like, "Pekar et al. do some complicated phylogenetic modeling that purports to show the MRCA in humans is too recent" and "This isn't any standard molecular clock approach. It's a byzantine stack of plausible but somewhat arbitrary assumptions" made it clear there was confusion before. Their tree is based off a couple novel modification to established techniques. Your characterizations were inaccurate and laughable.

> It carries a different set of plausible but arbitrary assumptions though, again about the stochasticity/overdispersion and sampling rate of early spread, just less directly.

So, you don't only have problems with the modeling of the authors, but their base phylogeny too? Do you reject their tMRCA? Good grief.

I'm still looking forward to discussing the molecular phylogenetics of this paper sometime.




On reflection, I believe the first of my statements that you've quoted was indeed incorrect, and that I was also incorrect when I just wrote:

> Their former model [...] purports to independently establish tMRCA in humans too recent for significant cryptic spread.

Even if SARS-CoV-2 really entered humans in December, with minimal cryptic spread, that's still enough time for the two lineages to evolve in humans, since they're (sorry) just two SNPs apart. I believe Worobey knows this, and that's the reason why he emphasizes the "Separate introductions" model, since their polytomy thing--and not any question of time for cryptic spread--is their best and only argument to exclude that. So I was wrong to mention the tMRCA at all, since even perfect knowledge of that wouldn't tell us confidently how the two lineages arose.

The second of my statements seems correct to me. Not only is their argument for two introductions not a standard molecular clock approach, but it's not a molecular clock approach at all, since "Inferring" provides no support. Their only support comes from the polytomy thing in "Separate". This makes the accuracy of their epidemiological simulation highly relevant, thus the "hand-wringing" over that.

I'd note that you yourself referred me to "Separate", back in:

https://news.ycombinator.com/item?id=32258096

So why did you switch to "Inferring"? I guess we could discuss that too, but per above I don't believe that could provide significant support for two introductions into humans, and thus not for natural vs. research-related origin. Do you believe otherwise? Or do you just mean the approach is of general interest, independently of that question of origin?


> Not only is their argument for two introductions not a standard molecular clock approach, but it's not a molecular clock approach at all, since "Inferring" provides no support

Okay, lets revisit this now that some of the terminology confusion is recognized.

"Inferring the MRCA of SARS-CoV-2" introduces their phylogenies. It was produced with BEAST as described in their methods. I believe this is the model you were referring to as "Inferring." Yes?

I don't understand what you're trying to say here. If you don't understand how their phylogeny helps support their theory of multiple introductions, I don't know what to tell you. Maybe just another clarification of what you're trying to say would help.

> I'd note that you yourself referred me to "Separate", back in ... So why did you switch to "Inferring"

Because we're discussing multiple things in the same paper?


> Even if SARS-CoV-2 really entered humans in December, with minimal cryptic spread, that's still enough time for the two lineages to evolve in humans, since they're (sorry) just two SNPs apart.

This isn't the evidence the authors present. The argument isn't "there isn't enough time to go from A -> B." IIRC, I've seen similar acknowledgements that even more rare mutations have been observed in a single transmission during the course of the pandemic. They're just highly improbable.

The most direct evidence (as I see it) for B not evolving from A in humans is the unexpected lack of genetic divergence in lineage A compared to B. Lineage B should show a younger molecular clock, it doesn't.

> I believe Worobey knows this, and that's the reason why he emphasizes the "Separate introductions" model, since their polytomy thing--and not any question of time for cryptic spread--is their best and only argument to exclude that. So I was wrong to mention the tMRCA at all, since even perfect knowledge of that wouldn't tell us confidently how the two lineages arose.

Nonsense. The tMRCA is key evidence in how the lineages arose. One of the reasons for the epi modeling was to figure out the plausible time between the primary case and index case. It shows there is at most a few dozen people infected before the genetic diversity was captured through sampling. (`Results: Minimal cryptic circulation of SARS`)

I don't think you understand their argument here, at all.

> Not only is their argument for two introductions not a standard molecular clock approach, but it's not a molecular clock approach at all, since "Inferring" provides no support

> So why did you switch to "Inferring"?

I don't understand why you're bristling and reading into the terminology here. https://plato.stanford.edu/entries/phylogenetic-inference/

Please elaborate why you think their use of the molecular clock is novel. It's really not.

> Do you believe otherwise? Or do you just mean the approach is of general interest, independently of that question of origin?

As explained above, I think the authors provide compelling evidence of multiple introductions using solid phylogenetic inference and solid molecular epidemiology. Bottom line is that there simply isn't an alternate hypothesis which explains the available evidence, and they illustrate why.

Here's a video you might not have seen, with Pekar and Wertheim. I've cued up the portion with a great explanation of why the evidence in the MRCA and genomics is so important. If you're going to continue to try and tear down their arguments, you probably want to really get this part.

https://www.youtube.com/watch?v=TYqJCdqdkio&t=3330 (especially 1h12m45, and 1h19m)


I think I understand what Worobey and Pekar write on Twitter, though I disagree with much of it. I don't understand what you're saying, so I'm afraid we're still talking past each other.

Do you agree that there are two mostly-independent models in the paper, one described in the section titled "Inferring the MRCA of SARS-CoV-2", and another in the section titled "Separate introductions of lineages A and B"? When I write "Inferring" and "Separate", I am referring to the models described in the sections with titles beginning with those respective words.

You wrote earlier:

> His epi simulations are separate from the tree-building, with the possible exception of rooting, which he was using the output of the models to inform. Otherwise, the epi modeling which everyone is hand wringing over is really separate and doesn't end "in a simulated phylogenetic tree."

As to "Separate", I believe that's incorrect. That model begins with an SIR-type simulation, and outputs the shape (polytomy structure) of the phylogenetic tree of that simulated pandemic, which they compare against the shape of the real pandemic's phylogenetic tree. Do you disagree? If so, what do you believe is the output of that "Separate" model?

I agree that the "Inferring" model does not depend on the epidemic simulation. I don't believe the "Inferring" model provides significant support for two introductions though. I believe that's the reason why most public debate has been about "Separate".


Yeah, I think we're basically on the same page with their methodology and models now.

I didn't realize you were nicknaming the models based on applying them to the result titles, so was quite confused, especially when we both used those words in the quoted sections, so it sounded like you were referring to portions of our conversation. So yeah, talking right past each other.

No, the two models don't correspond to the results cleanly. ie, when the authors claim "Separate introductions of lineages A and B" in the results, they provide evidence from both. (They're presenting the results of the models in support of their phylogeny.) I agree that "Inferring the MRCA of SARS-CoV-2" is pretty much independent of the epi stuff.

> As to "Separate", I believe that's incorrect. That model begins with an SIR-type simulation, and outputs the shape (polytomy structure) of the phylogenetic tree of that simulated pandemic, which they compare against the shape of the real pandemic's phylogenetic tree. Do you disagree? If so, what do you believe is the output of that "Separate" model?

I thought we were over this. We both agree that one of the results of the epi simulations was sampled genetics and a resulting tree from the simulation. That doesn't mean that their phylogeny is the direct result of their epi simulations. Their simulations are in support of their phylogeny. Their theorized phylogeny essentially existed prior to the modeling, and which is why I called them separate, ie, independent.

The `Materials and methods summary` is quite clear, especially `Phylodynamic inference and epidemic simulations`.

edit: Our thread is too deep for HN, might not be able to reply? I'll try and keep an eye for new replies if you want to fork off somewhere else.

But, where's your horse in this race? You speak a lot about what you think sucks and very little about what you actually believe here.

> I agree that the "Inferring" model does not depend on the epidemic simulation. I don't believe the "Inferring" model provides significant support for two introductions though. I believe that's the reason why most public debate has been about "Separate".

Funny. My theory is that most people don't have enough knowledge of molecular genetics to make heads or tails of the paper, and so are of course silent on those results. They didn't follow the debate over the past few years, and are showing up and trying to understand something without context or the requisite knowledge.

When you say "Public debate" you need to admit you're talking about a particular part of a particular website or two where a small number of people are picking at nits and can't even address the core of the findings the authors present here.


We're making some progress, at least. I believe this site rate-limits deep threads, but doesn't cut them off entirely.

So I guess we were also talking past each other on "Separate". By "simulated phylogenetic tree", I've always meant "phylogenetic tree for one of their simulated pandemics", not a tree for the real pandemic. We also agree that Pekar's argument isn't based on the time necessary for the two lineages to evolve in humans, since at least that much difference could arise even (with p ~ 10%) in a single human-to-human transmission.

So to exclude evolution of the two lineages in humans, they needed something else. Loosely, that's the observation that (stochasticity of spread aside) we'd expect the earlier lineage A to have more and more diverse descendants than the later lineage B. Their epi model in "Separate" is a formalization of that, and if they could correctly and confidently model that spread then I believe it would be sound.

It seems like we disagree as to what forms the paper's core result, though. I'm taking my own cue from Worobey's Twitter comments, because (a) he's an author, so he presumably should know better than most, and (b) while I disagree with his conclusion, I do see the flow of his argument. In the thread that you linked and I quoted, he describes the result of that "Separate" model--which fundamentally depends on the epi stuff--as the crux of the paper. That makes sense to me.

I believe you prefer to think in terms of construction of the phylogenetic tree for the real pandemic, like to frame the question of number of introductions in terms of the number of roots for the tree. That's in a certain sense equivalent, but it seems much less intuitive to me. The "Separate" approach makes the epidemiological assumptions explicit. Those assumptions are obviously always relevant though, so they're still relevant when you frame the problem in terms of the real tree; they're just much harder to express in the parameters (R0, serial interval, dispersion parameter k, etc.) typically used to model a pandemic.

When they built the real tree, they observed that any single root fits badly. (Per your other comment, I agree that's what they did in "Inferring" with BEAST.) More roots would fit better; but that's always true for any phylogeny unless there's a penalty for each additional root, since more roots improves all the other usual measures of fit. Without quantifying what that penalty per additional root should be, it's not possible to say whether the poor fit is because the tree really should have two roots, or for other reasons (unmodeled stochasticity of spread, imperfect sampling, etc.). It's not too easy to convert those pandemic parameters into that penalty. So it makes sense to me that they didn't try, and instead switched to the SIR-type simulations in "Separate", which they're treating as their most important result.

As I've noted earlier, I don't believe it's possible to reach any confident conclusion (as to research-related vs. natural origin, the number of introductions into humans, or most of the other topics of major contention) from the evidence currently available. I'd have little objection to this paper if it were framed as exploratory work, whose speculative conclusions should not be trusted without further verification. That's not how Worobey and others have portrayed it in the popular media, though, and also not how you've initially portrayed it here.


I think it might be productive to dive in on this part

> Loosely, that's the observation that (stochasticity of spread aside) we'd expect the earlier lineage A to have more and more diverse descendants than the later lineage B. Their epi model in "Separate" is a formalization of that, and if they could correctly and confidently model that spread then I believe it would be sound.

Yeah, that's the observation. However, you're invoking the epi model at the wrong time. If you read `Inferring the MRCA...`, all of this is already known and observed before the modeling is even run. The epi model doesn't contain these results. They constructed their SC2 tree, then brought it over to the epi model to play with it.

If you want a "formalization" of that observation, perhaps Table I will do.

The results are best read in order.

If you're trying to better understand the phylodynamic model, perhaps "Inference of Viral Evolutionary Rates from Molecular Sequences" by Drummond would be interesting.


I think you're failing to appreciate the reason why they built the "Separate" model. Their headline claim is that SARS-CoV-2 arose from two zoonotic introductions into humans. If you want to express that claim in terms of the real pandemic's tree, then the relevant tree is the tree in humans only, which would then have two roots.

The construction of such a tree inherently depends on our assumptions on the epi dynamics. For example, if you give me a hundred genomes and I propose a hundred roots, then that wouldn't usually be a very good tree; but if the disease in question were known to spread animal-to-human but not human-to-human, then that might be correct. Nothing in their "Inferring" model allows them to incorporate such obviously relevant information, so that seems like an obvious deficiency.

To put it another way, you write:

> If you read `Inferring the MRCA...`, all of this is already known and observed before the modeling is even run.

After "Inferring", I believe they know the real tree has structure that's obviously non-modal (i.e., not the most likely outcome) given any single introduction. I don't see how they'd know whether it's a p = 20% non-modal or p = 0.5% non-modal outcome without an epi model like "Separate", or some kind of ugly incorporation of the epi dynamics into BEAST that they wisely didn't attempt.

I believe that's why the authors built "Separate", and its basic form is good work. (If you don't, then why do you think they spent their time on that?) I just disagree with their parameter choices and excessive confidence in their result.

As to your other reply, I agree the 10% is a rough number, not considering mutation biases and such. That's just the probability in a single transmission though, and it's also possible (and more likely) that the two lineages formed in humans with intermediate lineages that went extinct before they could be sampled. I think we at least agree that timing alone is insufficient to exclude evolution of the two lineages in humans though, even assuming a December introduction? I'm just trying to confirm that none of the evidence you see for two introductions in "Inferring" comes from its tMRCA.


ploink

Enjoy your sealioning.


Sorry; maybe I'm too stupid or lazy, but I genuinely don't get your point. Is it just that when they construct the tree in "Inferring", it looks qualitatively surprising (non-modal) given any single introduction, assuming (as I do as well) that A predates B? But we've known that for literally years now. As I understand the paper, their novel contribution is to quantify how surprising that looks, whether it's p ~ 20% surprising (which wouldn't mean much) or their claimed p ~ 0.5%. That's what they do in "Separate", and it correctly and inherently depends on the epidemiological modeling that I don't trust.

Again, in the Twitter thread that you yourself linked, Worobey says:

> This [the real polytomy structure] is something that [we] DO NOT see in ~99.5% of simulations. That is the crux of the paper.

The simulations in question are the epidemiological simulations from "Separate". You've told me to disregard Worobey's comments here; but while it's possible that Worobey has misunderstood the significance of his own paper, it seems more likely to me that you have.


> (with p ~ 10%) in a single human-to-human transmission.

That math is absolute garbage. One, the odds of a C/T -> T/C double mutation in a single transmission for the clade-defining markers isn't the same as T/C -> C/T, so at the very least you need to state an ancestral lineage to do any math like this. It also doesn't take into account the different priors for reversions, synonymous mutations, and the C-T transition bias in humans.

> When they built the real tree, they observed that any single root fits badly.

No. Go read the paper again. ("Our unconstrained rooting strongly favors a lineage B or C/C ancestral haplotype...") It's when you try and root in lineage A that things go sideways.

> I believe you prefer to think in terms of construction of the phylogenetic tree for the real pandemic, like to frame the question of number of introductions in terms of the number of roots for the tree.

> More roots would fit better; but that's always true for any phylogeny unless there's a penalty for each additional root,

No, it's not multiple roots, they just place the likely MRCA of SARS-CoV-2 in animals. ("If lineages A and B arose from separate introductions...") It's one tree. With one root. However, that root is in an animal instead of a human.

You can calculate the MRCA for any portion of the tree, including the descendents from the two+ hypothesized introductions. This MRCA is distinct from the SARS-CoV-2 MRCA. Is this what you mean by multiple roots?

> It seems like we disagree as to what forms the paper's core result, though. I'm taking my own cue from Worobey's Twitter comments

If you're trying to understand the paper's core result, read the paper, not twitter.

The first paragraph in `Discussion` frames the crux of their argument I was trying to get across. Notice that they cite the paradox I'm trying to get you to understand, as well as citing genomic diversity as core evidence, as opposed to any argument about the exact timing of A and B samples, or the unlikelihood of multiple mutations.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: