AlphaFold reveals the structure of the protein universe

COGlory · on July 28, 2022

Before my comment gets dismissed, I will disclaim I am a professional structural biologist that works in this field every day.

These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc. The language from Google is so deceptive about what they've actually done, I think it's actually intentionally disingenuous.

At the end of the day, AlphaFold is amazing homology modeling. I love it, I think it's an awesome application of machine learning, and I use it frequently. But it's doing the same thing we've been doing for 2 decades: pattern matching sequences of proteins with unknown structure to sequences of proteins with known structure, and about 2x as well as we used to be able to.

That's extremely useful, but it's not knowledge of protein folding. It can't predict a fold de novo, it can't predict folds that haven't been seen (EDIT: this is maybe not strictly true, depending on how you slice it), it fails in a number of edge cases (remember, in biology, edge cases are everything) and again, I can't stress this enough, we have no new information on how proteins fold. We know all the information (most of at least) for a proteins final fold is in the sequence. But we don't know much about the in-between.

I like AlphaFold, it's convenient and I use it (although for anything serious or anything interacting with anything else, I still need a real structure), but I feel as though it has been intentionally and deceptively oversold. There are 3-4 other deep learning projects I think have had a much greater impact on my field.

EDIT: See below: https://news.ycombinator.com/item?id=32265662 for information on predicting new folds.

ramraj07 · on July 28, 2022

Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.

This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.

Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?

signed, someone with a PhD in biomedical engineering, did a ton of bio work.

COGlory · on July 28, 2022

> Not sure if you should be reminded of how alpha fold started, it started by winning a competition thought un winnable by academics. Top labs working in protein structure prediction have fundamentally changed direction after alpha fold and are working to do the same even better.

Not sure what part of "it does homology modeling 2x better" you didn't see in my comment? AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER which is exactly what I said in my comment.

>This is not the first (or even tenth) time I’m seeing an academic trying to undermine genuine progress almost to the level of gaslighting. Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.

It literally is homology modeling. The deep learning aspect is to boost otherwise unnoticed signal that most homology modeling software couldn't tease out. Also, I don't think I'm gaslighting, but maybe I'm wrong? If anything, I felt gaslit by the language around AlphaFold.

>Not sure what else to say. Structural biology has always been the weirdest field I’ve seen, the way students are abused (crystallize and publish in nature or go bust), and how every nature issue will have three structure papers as if that cures cancer every day. I suppose it warps one’s perception of outsiders after being in such a bubble?

What on earth are you even talking about? The vast, VAST majority of structures go unpublished ENTIRELY, let alone published in nature. There are almost 200,000 structures on deposit in the PDB.

dekhn · on July 28, 2022

What ramraj is talking about: if you go into a competitive grad program to get a PhD in structural biology, your advisor will probably expect that in 3-4 years you will: crystallize a protein of interest, collect enough data to make a model, and publish that model in a major journal. Many people in my program could not graduate until they had a Nature or Science paper (my advisor was not an asshole, I graduated with just a paper in Biochemistry).

In a sense both of you are right- DeepMind is massively overplaying the value of what they did, trying to expand its impact far beyond what they actually achieved (this is common in competitive biology), but what they did was such an improvement over the state of the art that it's considered a major accomplishment. It also achieved the target of CASP- which was to make predictions whose scores are indistinguishable from experimentally determined structures.

I don't think academics thought CASP was unwinnable but most groups were very surprised that an industrial player using 5 year old tech did so well.

lucidrains · on July 28, 2022

To add to this, the deep learning field has already moved on towards MSA-less structure prediction. None of this would be possible without building on top of the work open sourced by Deepmind.

https://www.biorxiv.org/content/10.1101/2022.07.21.500999v1 https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1

To be overly dismissive is to lack imagination.

tomp · on July 28, 2022

How do we know these "MSA-less" models aren't cheating (i.e. learning all MSAs implicitly from their training data)? If they are, they would similarly fail on any "novel" AA sequence (i.e. one without known/learned MSAs)

valarauko · on July 28, 2022

> What ramraj is talking about: if you go into a competitive grad program to get a PhD in structural biology, your advisor will probably expect that in 3-4 years you will: crystallize a protein of interest, collect enough data to make a model, and publish that model in a major journal.

All of that is generally applicable to molecular biology in general, and I don't see how the field of structural biology is especially egregious, the way ramraj is making it out to be.

flobosg · on July 28, 2022

Protein crystallization can be very difficult and there is no general solution. Kits that screen for crystal growth conditions usually help but optimization is needed in most cases. Then, that crystal must have certain properties that allow for good data acquisition at the X-ray facility. That’s another problem by itself and months or years can pass until you get a suitable protein crystal and X-ray diffraction dataset where you can model your structure.

valarauko · on July 28, 2022

I'm familiar with protein crystallization and the difficulties associated with it. What I don't agree with is the characterization of the field as especially difficult, above and beyond modern biology in general. Nor can I support the assertion that structural biology students are subject to special abuse that regular grad students are not.

> ... can be very difficult and there is no general solution

This is true of pretty much any graduate work in molecular biology.

flobosg · on July 28, 2022

> Nor can I support the assertion that structural biology students are subject to special abuse that regular grad students are not.

I didn’t say anything regarding that.

> This is true of pretty much any graduate work in molecular biology.

Just to elaborate my point: The process of protein cristallization is not understood at a level that allows the design of general and reproducible protocols. This inherent obscurity means that every new protein needs to undergo an ad hoc, heuristic, iterative process to obtain high quality crystals. This is an early methodological hurdle, at a stage where other routine procedures in biochemistry or molecular biology are usually successful.

dekhn · on July 28, 2022

I said that. We had a saying in grad school, "the very best protein structures are crystallized from postdoc tears".

valarauko · on July 28, 2022

As a current postdoc (genetics) I think postdoc tears are the fuel that academia runs on - as well as those of our significant others and kids.

valarauko · on July 28, 2022

> I didn’t say anything regarding that.

I know you didn't - this was one of the claims of ramraj I was responding to.

> The process of protein cristallization is not understood at a level that allows the design of general and reproducible protocols. This inherent obscurity means that every new protein needs to undergo an ad hoc, heuristic, iterative process to obtain high quality crystals. This is an early methodological hurdle, at a stage where other routine procedures in biochemistry or molecular biology are usually successful.

I don't disagree, though I would suggest that there's just as much grunt work, frustration, and hand wringing in other fields of molecular biology at the graduate level and above. Even if other fields have reproducible protocols established, that's not what gets papers published. With the possible exception of clinical samples, more often than not we have no clue if the analyses we're doing will yield anything, and the high risk zone is where all grad students live.

ramraj07 · on July 28, 2022

In most other sub fields you don’t get to not publish if exactly one endpoint never comes to pass. I know I didn’t have something like that, and most of my non crystallographer friends didn’t.

There’s a lot of structural biology apologists here in this thread. Happy to crap on DeepMind but not ready to take criticism of their own field.

For anyone outside of the field wanting to learn more, check out this documentary: https://en.m.wikipedia.org/wiki/Naturally_Obsessed

valarauko · on July 28, 2022

> In most other sub fields you don’t get to not publish if exactly one endpoint never comes to pass. I know I didn’t have something like that, and most of my non crystallographer friends didn’t.

How is this a problem unique to structural biology? In every subfield we're hoping to publish interesting results, and that endpoint is defined by the nature of the field. As a geneticist, in the early 90s, sequencing & characterizing a single bacterial gene would have been the focus of an ambitious PhD thesis and would yield multiple papers. Sequencing at that time period had a dozen points of failure and high risk to set as the goal for a thesis. Today, sequencing a whole genome is unlikely to yield a single publication. If you're setting the ability to crystallize as the single point of failure endpoint, that logic applies to every subfield. We all have something that could potentially derail our plans, and I fail to see how structural biology is unique in that respect.

> There’s a lot of structural biology apologists here in this thread. Happy to crap on DeepMind but not ready to take criticism of their own field.

I'm not a structural biologist - I'm a Geneticist who disagrees with your characterization of SB. The issues you've mentioned are not unique to SB, but apply to pretty much all subfields. I see grad students in general lament their life choices when their cell culture fails, their mice die, protocols just don't work, or their results just don't make sense.

flobosg · on July 28, 2022

> If you're setting the ability to crystallize as the single point of failure endpoint, that logic applies to every subfield.

I agree that there are other fields with similar issues. What baffles me is how long protein crystallization has been a problem.

I’ll use your example:

Nowadays, sequencing a gene is unlikely to yield a single publication by itself but is no early point of failure. It’s a solved problem with protocols that have been thoroughly developed and explained to boredom. New early points of failure arise (sample related, maybe?).

Nowadays, determining the structure of a protein is unlikely to yield a single publication by itself but has a clear, early, unsolved point of failure. No understandable protocol other than buying $creening plate$, fetching cat whiskers, drawing a theoretical phase diagram that tells you nothing, and pray that your crystallization tray doesn’t show a scrambled egg tomorrow or in six weeks. This has been an issue for more than fifty years and almost 200k published structures. The jump you mentioned in sequencing hasn’t happened yet in protein crystallography and might never happen because our understanding of macromolecular crystallization is lacking and thus we cannot predict proper crystallization conditions.

valarauko · on July 28, 2022

Sure, I agree that crystallization in particular has faced this particular bottleneck for a long time. The field of SB, however, has still managed to advance massively too. For example, Cryo-EM can do things we could barely imagine a decade ago.

The point I'm trying to make is that from the perspective of a grad student, no field is devoid of risk, and it's surprisingly easy to be stuck by something that's a solved problem on paper. For example, I know of a grad student that's been trying to develop a mouse line for about a year now, and has now discovered that this strain just won't work for what they have in mind - and must now recreate the mutant combinations in a different strain that's at least a year's work - if it even works. I've heard stories of entire mouse lines die, and you're back to square one - years of work lost.

The other thing that complicates some of these fields is the massive pace of innovation they're undergoing that it is very hard for an individual lab to keep up to date. Grad students are using techniques that were published less than 5 years ago, and there's no locally available expertise to tap into. What remains the same is the level of grunt work grad students and postdocs have to do, even if the techniques get more sophisticated over time.

dekhn · on July 28, 2022

I did rotations in multiple types of lab as part of my program and I can't say I ever found that students in regular molecular biology labs had nearly as hard a time as structural biologists; SB is its own class of hell. Given the number of papers published in molecular biology that turn out to be "gel was physically cut and reasssembled to show the results the authors desired" (it's much harder to cheat on a protein structure)...

valarauko · on July 28, 2022

I think this is highly subjective and that every field has its own special hells. For example, in computational biology it's a lot easier to generate results (when things actually work) but conversely it's a lot harder to convince journals. The burden of proof required to publish is sometimes ridiculously high - I had a paper spend almost 3 years in review.

flobosg · on July 28, 2022

Hear, hear. This is probably the best take.

underdeserver · on July 28, 2022

> Not sure what part of "it does homology modeling 2x better" you didn't see in my comment? AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER which is exactly what I said in my comment.

Wait, stop, I don't know anything about proteins but 84% success is not ~2x better than 42%.

It doesn't really make sense to talk about 2x better in terms of success percentages, but if you want a feel, I would measure 1/error instead (a 99% correct system is 10 times better than a 90% correct system), making AlphaFold around 3.6 times better.

palmtree3000 · on July 28, 2022

I think odds ratio ( p/(1-p) ) is the thing I'd use here. It gives the right limiting behavior (at p ~= 0, doubling p is twice as good, and at p~=1, halving 1-p is twice as good) and it's the natural way to express Bayes rule, meaning you can say "I'm twice as sure (in odds ratio terms) based on this evidence" and have that be solely a property of the update, not the prior.

camjw · on July 28, 2022

For the lazy, this would make alphafold 7.25x better than the previous tools

bscphil · on July 29, 2022

Excellent comment. I think the issue is that "better" is underspecified and needs some precisification to be useful. The metric you are using here is the proper response to the question "how many times more surprising is it when method A fails than method B?". This is in many cases what we care about. Probably, it's what we care about here. The odds ratio seems to do a good job of capturing the scale of the achievement.

On the other hand, it's not necessarily the only thing we might care about under that description. If I have a manufacturing process that is 99.99% successful (the remaining 0.01% has to be thrown out), it probably does not strike me as a 10x improvement if the process is improved to 99.999% success. What I care about is the cost to produce the average product that can be sent to market, and this "10x improvement" changes that only a very small amount.

underdeserver · on July 31, 2022

TIL, thanks for this.

teawrecks · on July 29, 2022

> AlphaFold scored something like 85% in CASP in 2020, in CASP 2016, I-TASSER had I think 42%? So it's ~2x as good as I-TASSER

As someone who doesn't know proteins, but is decent at math, I would not describe it this way. You are assuming a linear relationship between effort and value, but more often than not, effort has diminishing returns. 80dB is not 2x as loud as 40 dB. An 8K image doesn't have 2x the fidelity of a 4K image. If Toyota unveiled a new engine that was 60% efficient tomorrow, no one in their right mind would say "eh, it's just 2x better". If we came out with a CPU that could clock up to 10Ghz we wouldn't say "meh, that's just 2x what we had".

Without being able to define the relationship here, I could just as well say that 85% is 1000x better than 42%. There's just no way to put a number on it. What we can say is that we completely blew all projections out of the water.

Again, I'm not someone working with proteins, but to me it sounds as revolutionary as a 60%+ efficient engine, or a 10Ghz CPU. No one saw it coming or thought it feasible with current technology.

joe_the_user · on July 28, 2022

I think the debate between "does amazing on metric X" versus "doesn't really understand the problem" reappears many places and doesn't have any direct way to be resolved.

That's more or less because "really understands the problem" generally winds-up being a placeholder for things the system can't. Which isn't to say it's not important. One thing that is often included in "understanding" is the system knowing the limits of its approach - current AI systems have a harder time giving a certainty value than giving a prediction. But you could have a system that satisfied a metric for this and other things would pop up - for example, what kind of certainty or uncertainty are we talking about (crucial for decision making under uncertainty).

shpongled · on July 28, 2022

> Comparing alphafold to conventional homology modeling is disingenuous at its most charitable interpretation.

It's really not - have you played around with AF at all? Made mutations to protein structures and asked it to model them? Go look up the crystal structures for important proteins like FOXA1 [1], AR [2], EWSR1 [3], etc (i.e. pretty much any protein target we really care about and haven't previously solved) and tell me with a straight face that AF has "solved" protein folding - it's just a fancy language model that's pattern matching to things it's already seen solved before.

signed, someone with a PhD in biochemistry.

[1] https://alphafold.ebi.ac.uk/entry/P55317 [2] https://alphafold.ebi.ac.uk/entry/P10275 [3] https://alphafold.ebi.ac.uk/entry/Q01844

ramraj07 · on July 28, 2022

I can see the loops in these structures. I dont see the problem. It still added a structure to every embl page, and people are free to judge the predictions themselves. For all I care (ostensibly as the end customer of these structures) I don’t mind having a low confidence structure for any arbitrary protein at all. It’s only marginally less useful to actual biology than full on X-ray structures anyway.

entee · on July 29, 2022

> It’s only marginally less useful to actual biology than full on X-ray structures anyway.

I'm not sure what you're implying here. Are you saying both types of structures are useful, but not as useful as the hype suggests, or that an X-Ray Crystal (XRC) and low confidence structures are both very useful with the XRC being marginally more so?

An XRC structure is great, but it's a very (very) long way from getting me to a drug. Observe the long history of fully crystalized proteins still lacking a good drug. Or this piece on the general failure of purely structure guided efforts in drug discovery for COVID (https://www.science.org/content/blog-post/virtual-screening-...). I think this tech will certainly be helpful, but for most problems I don't see it being better than a slightly-more-than-marginal gain in our ability to find medicines.

Edit: To clarify, if the current state of the field is "given a well understood structure, I often still can't find a good medicine without doing a ton of screening experiments" then it's hard to see how much this helps us. I can also see several ways in which a less than accurate structure could be very misleading.

FWIW I can see a few ways in which it could be very useful for hypothesis generation too, but we're still talking pretty early stage basic science work with lots of caveats.

Source: PhD Biochemist and CEO of a biotech.

Gatsky · on July 28, 2022

This isn’t a good use of the term gaslighting. Accusing someone of gaslighting takes what we used to call a ‘difference of opinion’ and mutates it into deliberate and wicked psychological warfare.

Incidentally, accusing someone of gaslighting is itself a form of gaslighting.

tomjakubowski · on July 28, 2022

Well, it can be gaslighting but not always. A knowingly false accusation, repeated often enough and in a way to make the accused question their own perception of reality, would be gaslighting.

stonogo · on July 28, 2022

Not only is CASP not "unwinnable," it's not even a contest. The criteria involved are rated as "moderately difficult." Alphafold is a significant achievement but it sure as hell hasn't "revealed the structure of the protein universe," whatever that means.

Which top labs have changed direction? Because Alphafold can't predict folds, just identify ones it's seen.

dekhn · on July 28, 2022

I've directly communicated with the leaders of CASP and at DM that they should stop representing this as a form of protein folding and just call it "crystal/cryoEM structure prediction" (they filter out all the NMR structures from PDB since they aren't good for prediction). They know it's disingenuous and they do it on purpose to give it more impact than it really deserves.

I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.

To me what's important about this is that it demonstrated the obvious (I predicted this would happen eventually, shortly after losing CASP in 2000).

COGlory · on July 28, 2022

>I would like to correct somethign here- it does predict structures de novo and predict folds that haven't been seen before. That's because of the design of the NN- it uses sequence information to create structural constraints. If those constraints push the modeller in the direction of a novel fold, it will predict that.

Could you expand on this? Basically it looks at the data, and figures out what's an acceptable position in 3D space for residues to occupy, based on what's known about other structure?

I will update my original post to point out I may be not entirely correct there.

The distinction I'm trying to make is that there's a difference between looking at pre-existing data and modeling (ultimately homology modeling, but maybe slightly different) and understanding how protein folding works, being able to predict de novo how an amino acid sequence will become a 3D structure.

Also thank you for contacting CASP about this.

dekhn · on July 28, 2022

From what I can tell, the model DM built is mining subtle relationships between aligned columns of multiple sequence alignments and any structural information which is tangibly related to those sequences. Those relationships can be used to infer rough atomic distances ("this atom should be within 3 and 7 angstroms of this other atom"). A large matrix (partially filled out) of distances is output, and those distances are used as constraints in a force field (which also includes lots of prior knowledge about protein structure) and then they run simulations which attempt to minimize both the force field and constraint terms.

In principle you don't even need a physical force field- if you have enough distance information between pairs of atoms, you can derive a plausible structure by embedding the distances in R3 (https://en.wikipedia.org/wiki/Distance_geometry and https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21...

Presumably, the signal they extract includes both rich local interactions (amino acids near in sequence) and distant ones inferred through sequence/structure relationships, and the constraints could in fact push a model towards a novel fold, presumably through some extremely subtle statistical relationships to other evolutionarily related proteins that adopt a different fold.

bawolff · on July 28, 2022

> The distinction I'm trying to make is that there's a difference between looking at pre-existing data and modeling (ultimately homology modeling, but maybe slightly different) and understanding how protein folding works, being able to predict de novo how an amino acid sequence will become a 3D structure.

Your objection is that alphafold is a chinese room?

What does that matter? Either it generates useful results or it doesn't. That is the metric we should evaluate it on.

COGlory · on July 29, 2022

Because it's being presented as something that it isn't. It's a better way to analyze data that we got experimentally, and to predict how new data will fit into what we know. It's not de novo understanding, which is the holy grail and what the field is ultimately trying to accomplish. It's Tesla's adaptive cruise control being sold as full self driving. Yes, they are close things - one is an approximation of the other, but being really really good at adaptive cruise control has basically zero carryover to full self driving. FSD isn't a linear progression from adaptive cruise control, and understanding how proteins fold isn't a linear progression from AlphaFold sequence homology/homology modeling. It's not even close to the same thing, AlphaFold doesn't even move the needle for our understanding of how proteins fold, and yet it's sucking all the air out of the conversation by presenting itself like it solved this problem.

It's a really good, fancy model completely reliant on data we already have empirically (and therefore subject to all the same biases as well).

bawolff · on July 29, 2022

I'm assuming "de novo" means from first principles?

i really don't think anyone is presenting alphafold as if its a physics simulator operating from first principles.

Like obviously alphafold does not "understand". Maybe i have blinders on for being in the computer field, but i would assume that it goes without saying that a statistical deep learning AI model does not tell us how to solve the problem from first principles.

Like yes, alphafold isn't the final chapter in protein folding and that is obvious. But it seems a stretch to dismiss it on those grounds. If that's the metric we're going with then we can dismiss pretty much everything that has happened in science for the past thousand years.

> re self driving car metaphor

I think this is a bad metaphor for your purposes, because self-driving cars aren't de novo understanding, and arguably do have some carry over from things like adaptive cruise control.

mupuff1234 · on July 28, 2022

> There are 3-4 other deep learning projects I think have had a much greater impact on my field.

Don't leave us hanging... which projects?

COGlory · on July 28, 2022

1) Isonet - takes low SNR cryo-electron tomography images (that are extremely dose limited, so just incredibly blurry and frequently useless) and does two things:

* Deconvolutes some image aberrations and "de-noises" the images

* Compensates for missing wedge artifacts (missing wedge is the fact that the tomography isn't done -90° --> +90°, but usually instead -60° --> +60°, leaving a 30° wedge on the top and bottom of basically no information) which usually are some sort of directionality in image density. So if you have a sphere, the top and bottom will be extremely noisy and stretched up and down (in Z).

https://www.biorxiv.org/content/10.1101/2021.07.17.452128v1

2) Topaz, but topaz really counts as 2 or 3 different algorithms. Topaz has denoising of tomograms and of flat micrographs (i.e. images taken with a microscope, as opposed to 3D tomogram volumes). That denoising is helpful because it increases contrast (which is the fundamental problem in Cryo-EM for looking at biomolecules). Topaz also has a deep learning particle picker which is good at finding views of your protein that are under-represented, or otherwise missing, which again, normally results in artifacts when you build your 3D structure.

https://emgweb.nysbc.org/topaz.html

3) EMAN2 convolutional neural network for tomogram segmentation/Amira CNN for segmentation/flavor of the week CNN for tomogram segmentation. Basically, we can get a 3D volume of a cell or virus or whatever, but then they are noisy. To do anything worthwhile with it, even after denoising, we have to say "this is cell membrane, this is virus, this is nucleic acid" etc. CNNs have proven to be substantially better at doing this (provided you have an adequate "ground truth") than most users.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5623144/

simsla · on July 28, 2022

I asked a structural biologist friend of mine (world class lab) about the impact of alphafold.

They said it's minimal.

In most cases, having a "probably" isn't good enough. They use alphafold to get early insights, but then they still use crystallography to confirm the structure. Because at the end of the day, you need to know for sure.

bawolff · on July 28, 2022

I'm not a biologist, but that doesn't sound minimal if crystallography is expensive.

It sounds like how we model airplanes in computers, but still test the real thing - i wouldn't call the impact of computer modelling on airplane design to be minimal.

adamsmith143 · on July 28, 2022

> it can't predict folds that haven't been seen

This seems strange to me. The entire point of these types of models is to predict things on unseen data. Are you saying Deepmind is completely lying about their model?

Deepmind solved CASP, isn't the entire point of that competition to predict unseen structures?

If AlphaFold doesn't predict anything then what are you using it to do?

COGlory · on July 28, 2022

AlphaFold figures out that my input sequence (which has no structural data) is similar to this other protein that has structural data. Or maybe different parts of different proteins. It does this extremely well.

flobosg · on July 28, 2022

This is a gross misrepresentation of the method.

COGlory · on July 28, 2022

Perhaps you'd care to explain how? AlphaFold does not work on new folds. It ultimately relies on mapping sequence to structure. It does it better than anyone else, and in ways a human probably couldn't, but if you give it a brand new fold with no relation to other folds, it cannot predict it. I routinely areas of extremely low confidence many of my AlphaFold models. I work in organisms that have virtually 0 sequence identity. This is a problem I deal with every day. I wish AlphaFold worked in the way you are suggesting, but it just flat out does not.

flobosg · on July 28, 2022

> It ultimately relies on mapping sequence to structure.

So does every structural prediction method.

> if you give it a brand new fold with no relation to other folds, it cannot predict it

That will depend on the number of effective sequences, not the actual fold.

> I work in organisms that have virtually 0 sequence identity.

Then the problem is low sequence coverage, not the protein fold. On a side note, there are sensitive homology search protocols that rely very little on actual sequence identity.

bamboozled · on July 28, 2022

So then based on your counter arguments to the OP, have they mapped the entire protein universe ? Or should it say, the “already known protein universe” ?

flobosg · on July 28, 2022

Neither the protein sequence nor structure spaces have been fully explored, and the sequence set of UniProt does not represent every single extant protein. My answer is “no”.

johndfsgdgdfg · on July 28, 2022

There's hype and then there's anti-hype hype, which tries to undermine any genuine progress in a hip contrarian fashion. Eg look, I'm the only who can see the truth. There's AI hype and then there's anti-AI Gary Marcus hype, who never produces any novel criticism. It's the same banal broken record every single time put in a very self-aggrandizing manner.

DM is probably hyping it up and you are most likely hyping up your own criticism. It's a great symbiotic relationship outwardly presented as opposition.

dekhn · on July 28, 2022

No organisms have virtually 0 sequence identity. That's nonsense. Can you give an example? n Even some random million-year-isolated archae shares the majority of its genes with common bacteria.

biomcgary · on July 28, 2022

Organisms, yes. Individual genes within an organism may have no sequence identity to genes in other organisms (outside of what you would expect at random). See: https://en.wikipedia.org/wiki/Orphan_gene

dekhn · on July 28, 2022

Yes, that's what I thought. I worked with m. genitalium and we were always looking for proteins that had no homology or no existing structure (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.20...)

COGlory · on July 28, 2022

MKVLMKKESLPIVKPFDEVIIEVLQAPKEVEREVALKDGTIKKIQDYSIIVKPVSGKFESVTEKVTSKTEDGDEVVKPKKYDASELKDKVVMKLTQKAFEVLYDAWQNKEIGEGTKLKIKVTKKQNKTYFDEITVLDEKEEEETEEEAKVKPKPKLKG

dekhn · on July 28, 2022

That's a single protein not an organism.

andrewflnr · on July 29, 2022

They obviously mean organisms that have notable numbers of proteins with virtually no sequence identity. The difference is only germane to the conversation if you're looking for something to nitpick. The only point of bringing it up was that they encounter non-trivial numbers of really weird proteins.

Protostome · on Aug 2, 2022

Disclaimer: I'm a professional (computational) structural biologist. My opinion is slightly different.

The problem with the structure prediction problem is not a loss/energy function problem, even if we had an accurate model of all the forces involved we'd still not have an accurate protein structure prediction algorithm.

Protein folding is a chaotic process (similar to the 3 body problem). There's an enormous number of interactions involved - between different amino acids, solvent and more. Numerical computation can't solve chaotic systems because floating point numbers have a finite representation, which leads to rounding errors and loss of accuracy.

Besides, Short range electro static and van der waals interactions are pretty well understood and before alphafold many algorithms (like Rosetta) were pretty successful in a lot of protein modeling tasks.

Therefore, we need a *practical* way to look at protein structure determination that is akin to AlphaFold2.

panabee · on July 28, 2022

as an outsider learning more about protein folding, could you elaborate on the assertion that the sequence is (mostly) all you need (transformer/ML reference intended).

doesn't this assume the final fold is static and invariant of environmental and protein interactions?

put another way, how do we know that a protein does not fold differently under different environmental conditions or with different molecular interactions?

i realize this is a long-held assumption, but after studying scientific research for the past year, i realize many long-held assumptions aren't supported by convincing evidence.

mellosouls · on July 28, 2022

These threads are always the same: lots of comments about protein folding, how amazing DeepMind is, how AlphaFold is a success story, how it has flipped an entire field on it's head, etc.

I don't think that's necessarily so - there is a lot of justified scepticism about the wilder claims of ML in this forum; it is in fact quite difficult at times to know as an outsider to the field in question how kneejerk it is.

robbintt · on July 28, 2022

Additionally, folding doesn't focus on what matters. Generally you want to understand the active site, you already know the context (globular, membrane, embedded, conjugated) of the protein. It is interesting whether the folding could help identify active sites for further analysis. But -- I don't think alphago is identifying new active sites or improving our understanding of their nuances.

ehsankia · on July 28, 2022

Right, but even a speed up / quality increase can flip workflows on their head. Take ray tracing for example, when you speed it up by an order of magnitude, you can suddenly go from taking a break every time you want to render a scene, vs being able to iteratively work on a scene and preview it as you work.

uoaei · on July 28, 2022

I got a lot of shit (still do) when the news first broke for pushing back against the notion that AlphaFold "solved" protein folding. People really wanted to attach that word to the achievement. Thank you for providing a nuanced take on exactly why that doesn't make any sense.

eachro · on July 28, 2022

I'm curious to read more on the 3-4 other deep learning projects you mentioned that have had a larger impact on your fields. Can you share some links to those works?

zack-m · on July 28, 2022

Yup. It’s great, but there are still many aspects to unpack and work on. Hence why Rosetta is a thing.

flobosg · on July 28, 2022

Rosetta methods are also moving towards ML. Here’s an article from last week: https://www.science.org/doi/10.1126/science.abn2100

flobosg · on July 28, 2022

> AlphaFold is amazing homology modeling

If it is homology modelling, then how can it work without input template structures?

COGlory · on July 28, 2022

It has template structures. AlphaFold uses the following databases:

    BFD,
    MGnify,
    PDB70,
    PDB (structures in the mmCIF format),
    PDB seqres – only for AlphaFold-Multimer,
    Uniclust30,
    UniProt – only for AlphaFold-Multimer,
    UniRef90.

flobosg · on July 28, 2022

Those databases are used to derive the evolutionary couplings and distance matrices used by the algorithm. Several of those databases aren’t even structural ones. Furthermore, AlphaFold can function with only a MSA as an input, without retrieving a single PDB coordinate.

COGlory · on July 28, 2022

It's all about boosting signal by finding other proteins that are similar, until you get to the point that you can identify a fold to assign to a region of the protein. That's why some are structural, and some are not.

>Furthermore, AlphaFold can function with only a MSA as an input, without retrieving a single PDB coordinate.

Yes, it has a very nice model of what sequences should look like in 3D. That model is derived from experimental data. So if I give AlphaFold an MSA of a new, unknown protein fold (substantively away from any known fold), it cannot predict it.

flobosg · on July 28, 2022

> Yes, it has a very nice model of what sequences should look like in 3D.

A structural model, you would say.

> That model is derived from experimental data.

That doesn’t make it a template-based model, or a homology one.

> if I give AlphaFold an MSA of a new, unknown protein fold (substantively away from any known fold), it cannot predict it

That will depend on the number of effective sequences found to derive couplings. Domains with novel folds usually have a low number of remotely homolog sequences and for that reason the method will fail, not just because they are novel.

COGlory · on July 28, 2022

>Domains with novel folds usually have a low number of remotely homolog sequences and for that reason the method will fail, not just because they are novel.

How can you say this but not believe it's doing homology modeling?

flobosg · on July 28, 2022

Because homology search is not homology modelling. And a multiple sequence alignment is not a structural (i.e, with three-dimensional coordinates) template.

ssivark · on July 28, 2022

For someone who knows very little about this field, could you elaborate on what specific aspect of “homology modeling” AF violates/circumvents which makes you call it “homology search” instead?

flobosg · on July 28, 2022

Homology search is a method to find homologous sequences, that is, evolutionary related sequences that posess a common ancestor. This was usually done based on how identical sequences were, but newer algorithms allow to find remote homologs even when the identity between the sequences is very low. The first step in AlphaFold is to retrieve as many remotely homolog sequences as possible to generate a multiple sequence alignment (MSA) that will be used to generate the embedding.

On the other hand, homology (or comparative) modelling is a method that generates a structural model of a query sequence based on one or more experimentally solved structure of a close protein homolog. The model generation details depend on the specific protocol but, broadly speaking, spatial restraints are extracted from the template structures and mapped to the query sequence to be modelled.

Note that AlphaFold also uses a type of geometrical restraint (pairwise residue distances) in its modelling, although they are not derived from protein structures but the MSA embeddings. Both are related but are not exactly the same.

One difference between AlphaFold and homology modelling is that the latter requires templates having a certain sequence identity with the query sequence (≥30% is the rule of thumb), while the former can have in its MSA remotely homolog sequences well below any discernible identity.

wrycoder · on July 28, 2022

“Disclaim” stopped me.

Disclaim means to deny or renounce.

Uehreka · on July 28, 2022

Can we just chill on the whole “using this single word incorrectly breaks your whole argument” thing?

A lot of folks on HN end posts about a company with a sentence like “Disclaimer: I used to work for X”. This language (probably taken from contract law or something) is meant an admission of possible bias but in practice is also a signal that this person may know what they’re talking about more-so than the average person. After reading a lot of posts like this, it might feel reasonable for someone to flip the word around say something like “I need to disclaim…” when beginning a post, in order to signal their proximity to a topic or field as well as any sort of insider bias they may possess.

So sure, “I need to disclose” would’ve been the better word choice, but we all knew what GP was saying. It seems pedantic to imply otherwise.

wrycoder · on July 29, 2022

Let me translate. They said, “I will disclaim I am a professional structural biologist that works in this field every day.”

That is synonymous with saying, “I will deny I am a professional structural biologist that works in this field every day.”

The person posting is actually a structural biologist. What they stated was cognitively dissonant with the intent of their post, and that’s what stopped me.

I don’t pay attention to typos or minor usage issues, but in this case, I read two more sentences and said, “What??”

EDIT: Two more things. First, I found the post interesting and useful. I didn’t say anything about breaking the argument.

Second, “I need to disclose…” is the exact opposite of what they said.

martijnarts · on Aug 2, 2022

Looks like the downvotes say that your interpretation of this language in this context is not the most common interpretation.

alasdair_ · on July 28, 2022

>we all knew what GP was saying

I was confused initially too.

Andoryuuta · on July 29, 2022

Or to make a disclaimer.. like the OP post did?

Merriam webster[1]: " Definition of disclaim

intransitive verb 1 : to make a disclaimer ... "

[1]: https://www.merriam-webster.com/dictionary/disclaim

wrycoder · on July 29, 2022

The verb was used transitively.

Transitive verb:

2 : DENY, DISAVOW disclaimed any knowledge of the contents of the letter

Lloydksk · on July 28, 2022

I mean like whats this about AlphaFold is gone

crispyambulance · on July 28, 2022

I got a 5th grader question about how proteins are used/represented graphically that I've never been able to find a satisfying answer for.

Basically, you see these 3D representations of specific proteins as a crumple of ribbons-- literally like someone ran multi-colored ribbons though scissors to make curls and dumped it on the floor (like a grade school craft project).

So... I understand that proteins are huge organic molecules composed of thousands of atoms, right? Their special capabilities arise from their structure/shape. So basically the molecule contorts itself to a low energy state which could be very complex but which enables it to "bind?" to other molecules expressly because of this special shape and do the special things that proteins do-- that form the basis of living things. Hence the efforts, like Alphafold, to compute what these shapes are for any given protein molecule.

But what does one "do" with such 3D shapes?

They seem intractably complex. Are people just browsing these shapes and seeing patterns in them? What do the "ribbons" signify? Are they just some specific arrangement of C,H,O? Why are some ribbons different colors? Why are there also thread-like things instead of all ribbons?

Also, is that what proteins would really look like if you could see at sub-optical wavelength resolutions? Are they really like that? I recall from school the equipartition theorem-- 1/2 KT of kinetic energy for each degree of freedom. These things obviously have many degrees of freedom. So wouldn't they be "thrashing around" like rag doll in a blender at room temperature? It seems strange to me that something like that could be so central to life, but it is.

Just trying to get myself a cartoonish mental model of how these shapes are used! Anyone?

dekhn · on July 28, 2022

The ribbons and helices you see in those pictures are abstract representations of the underlying positions of specific arrangements of carbon atoms along the backbone.

There are tools such as DSSP https://en.wikipedia.org/wiki/DSSP_(hydrogen_bond_estimation... which will take out the 3d structure determined by crystallography and spit out hte ribbons and helices- for example, for helices, you can see a specific arrangement of carbons along the protein's backbone in 3d space (each carbon interacts with a carbon 4 amino acids down the chain).

Protein motion at room temperature varies depending on the protein- some proteins are rocks that stay pretty much in the same single conformation forever once they fold, while others do thrash around wildly and others undergo complex, whole-structure rearrangements that almost seem magical if you try to think about them using normal physics/mechanical rules.

Having a magical machine that could output the full manifold of a protein during the folding process at subatomic resolution would be really nice! but there would be a lot of data to process.

crispyambulance · on July 28, 2022

Thanks, awesome! So what do molecular biologists do with these 3D representations once they have them? Do they literally just see how they fit to other proteins?

dekhn · on July 28, 2022

There are many uses for structure. Personally, I find the 3d structures to be useful as a mental guide for picturing things, and certainly people do try to "dock" proteins that have complementary structures, but unfortunately, the biophysics of protein complexes suggests that the conformation change on binding is so large that the predicted structures aren't super-helpful.

Certainly, in a corpo like mine (Genentech/Roche) protein structures have a long history of being used in drug discovery- not typically a simple "dock a ligand to a protein" but more for constructing lab experiments that help elucidate the actual mechanistic biology going on. That is only a tiny part of a much larger process to work on disease targets to come up with effective treatments. Genentech is different from most pharma in that their treatments are themselves typically proteins, rather than small molecules.

edmcnulty101 · on July 28, 2022

How bad is our understanding of force fields?

It seems like that's the basic principle to understand.

dekhn · on July 28, 2022

I think many people would say that in principle, you could make a QM force field with an accurate enough basis function that an infinitely long simulation would recapitulate the energy landscape of a protein, and that information could be used to predict the kinetically accessible structures the protein adopts.

In practice, the force fields are well understood but to be computationally efficient, they have to approximate just about everything. Examples: since number of inter-atom distance pairs goes up with N**2 atoms, you need to have tricks to avoid that and instead scale around n log n or even n if you can do it. When I started, we just neglected atoms more than 9 angstrom apart, but for highly charged molecules like DNA, that leads to errors in the predicted structure. Next, typically the force fields avoid simulating polarizability (the ability of an atom's electron cloud to be drawn towards another atom with opposite charge), also because expensive. They use simplified spring models (lterally hooke's equation) for bond lengths, bond angles. The torsions (the angle formed by 4 atoms in a row) haev a simplified form. The interatomic relationships are not handled in a principled way, instead treating atoms as mushy spheres....

After having made major contributions in this area, I don't think that improvements to force fields are going to be the most effective investment in time and energy. There are other bits of data that can get us to accurate structures with less work.

edmcnulty101 · on July 28, 2022

That's interesting. Didn't realize that. It sounds like we're just working around slower computation speed.

In an fantasy world if we had infinite computation speed/space we'd be able to just model the force field and predict from there.

dekhn · on July 29, 2022

Yes, that's a fantasy world. I explored this using the Exacycle system at Google and we did actually do a couple things that nobody else could have at the time, but even that extraordinary amount of computing power really is tiny. The problem is the "force field" isn't just the enthalpic contributions I listed above, but also depends intimately on much more subtle entropic details- things like the cost of rearranging water into a more ordered structure have to be paid for. Estimating those is very expensive- far worse than just enumerating over large numbers of proteins "in vacuo", and probably cannot be surmounted, unless quantum computing somehow becomes much better.

Instead, after spending an ordinate amount of Google's revenue on extra energy, I recommended that Google instead apply machine learning to protein structure prediction and just do a better job of extracting useful structural information (note: this was around the time CNNs were in vogue, and methods like Transformers didn't exist yet) from the two big databases (all known proteins/their superfamily alignments, and the PDB).

Note that this conclusion was a really hard one for me since I had dedicated my entire scientific career up to that point in attempting to implement that fantasy world (or a coarse approximation of it), and my attempts at having people develop better force fields (ones that didn't require as much CPU time) using ML weren't successful. What DeepMind did was, in some sense, the most parsimonious incremental step possible to demonstrate their supremacy, which is far more efficient. Also, once you have a trained model, inference is nearly free compared to MD simulations!

edmcnulty101 · on July 29, 2022

That's interesting. Thanks for the info. They're getting better at Quantum. It's going to be fascinating to see the future of this field and all the potential medicine waiting to be discovered and the lifespan improvements and just sheer biological discoveries. It feels almost like the new panning for gold. :) It's pretty crazy to see how human advancement parallels computing power in so many areas.

beowulfey · on July 28, 2022

A structure is bascially another tool for producing hypotheses. In my case, I often use structures to predict effects of genetic lesions. If your protein has a clearly defined active site, you can get a rough sense of where on the enzyme that active site is relative to other mutations. Often residues that are distant in sequence end up right next to each other in the folded structure, so certain residues can have unexpected roles.

It gives a picture of the enzyme as a machine, and lets you look at specific parts and say “this residue is probably doing this job in the whole system”.

buntsai · on July 28, 2022

Often the ribbons (alpha-helices and beta=sheets) form "protein domains". Canonically, these are stable, folded structures with conserved shapes and functions that serve as the building blocks of proteins, like lego pieces. These protein domains can be assembled in different ways to form proteins of different function. Different protein domains that have the same evolutionary origin have conserved structure even when the underlying amino acid sequence, or DNA sequence has changed beyond recognition over millions of years of evolution. In other words, molecular biologists use structure as a proxy for function. Looking at how the same protein domains works in different proteins in different species can give us clues as to how a protein might work in human biology or disease.

thfuran · on July 28, 2022

Basically, the shape of the protein determines how it interacts with other things. So knowing the structure enables better prediction of how the pathways it is involved in work and how other things (say, potential drugs) would affect that pathway.

comicjk · on July 28, 2022

All of the loops and swirls are summary representations of known atomic positions: really, knowing a protein structure means knowing the position of every atomic nucleus, relative to the nuclei, down to some small resolution, and assuming a low temperature.

The atoms do wiggle around a bit at room temperature (and even more at body temperature), which means that simulating them usefully typically requires sampling from a probability distribution defined by the protein structure and some prior knowledge about how atoms move (often a potential energy surface fitted to match quantum mechanics).

There are many applications of these simulations. One of the most important is drug design: knowing the structure of the protein, you can zoom in on a binding pocket and design a set of drug molecules which might disable it. Within the computer simulation, you can mutate a known molecule into each of your test molecules and measure the change in binding affinity, which tells you pretty accurately which ones will work. Each of these simulations requires tens of millions of samples from the atomic probability distribution, which typically takes a few hours on a GPU given a good molecular dynamics program.

Alop3x · on July 28, 2022

If you want something that leaves a little less to the imagination, check out https://en.wikipedia.org/wiki/Staphylococcus_aureus_alpha_to... . It looks just like what it does: drill a giant hole in cell membranes.

Some proteins have 3D structures that look like abstract art only because we don't have an intuitive understanding of what shape and amino acids are necessary to convert chemical A to chemical B, which is the main purpose of many enzymes in the body. If you look at structural proteins or motor proteins, on the other hand, their function is clear from their shape.

There are a lot of other things you can do with the shape. If it has a pore, you can estimate the size and type of small molecule that could travel through it. You can estimate whether a binding site is accessible to the environment around it. You can determine if it forms a multimer or exists as a single unit. You can see if protein A and protein B have drastically different shapes given similar sequences, which might have implications for its druggability or understanding its function.

e_i_pi_2 · on July 28, 2022

https://alphafold.ebi.ac.uk/entry/W6KDG8

The ribbon shape for GFP is a very cool barrel thing

flobosg · on July 28, 2022

One of my favorites are some antifreeze proteins:

* https://www.rcsb.org/structure/1m8n

* https://iiif.elifesciences.org/lax/05142%2Felife-05142-fig1-...

bckr · on July 28, 2022

> Are people just browsing these shapes and seeing patterns in them

That's one approach.

The thing to understand is that proteins form "binding sites": areas that are more likely to attract other particular regions of proteins or other molecules, or even atoms. Think about hemoglobin. The reason it holds onto oxygen atoms is because it has binding sites.

Binding sites are great because they represent more freedom to do things than molecules typically have. Normal chemistry consists of forming strong electronic bonds between atoms, or forming rigid lattices/crystals.

Binding sites allow molecules to do things like temporarily attach to each other and let each other go under certain circumstances, for instance when another binding site is active/inactive. This can happen through "conformation change", where a molecule bound/unbound on some binding site makes the protein change shape slightly. This is how proteins can act like machines.

> What do the "ribbons" signify

Different regions of the protein have different sequences of amino acids. Amino Acids have somewhat different shapes from each other. The ribbons are actually broader than the spindles (or threads), and less flexible. Not sure about the different colors, maybe someone else can fill in.

> Also, is that what proteins would really look like if you could see at sub-optical wavelength resolutions?

Not really, it's an abstraction. They're big molecules, so if you look closely they're made of atoms, which are (kinda, sorta not really, quantum stuff) spherical.

> So wouldn't they be "thrashing around" like rag doll in blender at room temperature?

Yes, but the attractions between the different parts of the molecule keeps it somewhat under control. So more like an undulating little creature, jellyfish perhaps.

> It seems strange to me that something like that could be so central to life

Yep, gotta remember that it's all statistical. These things are getting made, do their job, breaking, and getting degraded some insane number of times per second. Swarm behavior, sort of.

biofox · on July 28, 2022

Short answer is that the ribbon representation is a visual simplification based on known structures -- they are actually composed of atoms.

They certainly do "thrash around", but that thrashing is constrained by the bonds that are formed, which greatly limits the degrees of freedom. Here's a short video of a simulation to demonstrate:

https://www.youtube.com/watch?v=fggqPtaZj8g

joshuahedlund · on July 28, 2022

I've been going through MIT's online Introduction to Biology course[0] that answers some of your questions here with regards to the shapes and what they signify - specifically the "Proteins and Protein Structure" lessons in the second unit, although some of the previous lectures are helpful setup as well - really interesting and engaging stuff, taught by Eric Lander (who ended up being one of the CRISPR pioneers featured in Isaacson's latest book)

[0]https://learning.edx.org/course/course-v1:MITx+7.00x+2T2022/...

crispyambulance · on Aug 1, 2022

That's cool, I just happened to have picked up a used copy of the text on which the course based... "Molecular Biology of the Cell" -- the huge grey book. Geez, there's a lot of material in there!

Back in the day, I had steered away from chemistry in college because I didn't like to memorize stuff. Now I realize I missed out on some amazing knowledge.

paraph1n · on July 28, 2022

> I recall from school the equipartition theorem-- 1/2 KT of kinetic energy for each degree of freedom. These things obviously have many degrees of freedom. So wouldn't they be "thrashing around" like rag doll in a blender at room temperature?

It's funny you say that, because the first image on the English Wikipedia page for Equipartition Theorem[1] is an animation of the thermal motion of a peptide.

[1]: https://en.wikipedia.org/wiki/Equipartition_theorem

dekhn · on July 28, 2022

BTW, in terms of protein dynamics, before you even think about the thrashing around- 1.2kt at room temperature is enough to form and break hydrogen bonds in real time (around 1-2kcal) so presumably, protein h-bonds are breaking and reforming spontaneously at scale.

roncesvalles · on July 28, 2022

Your "now what?" question is legitimate and reminiscent of reactions after the completion of the Human Genome Project.

Just like having a human genome sequence, this is not a magic key that solves all problems of biology but a building block for use by researchers. An investigator may look up the folded structure of a protein and use that information to glean certain context-specific insights from it such as how exactly two interacting proteins interact mechanically.

The other significant benefit is that this frees up resources that were spent having to figure out the structure in other ways. It's an efficiency improvement.

kdkirsch · on July 29, 2022

Watch this video on DNA polymerase [1]. Obviously it’s an illustration, but I think it helps answer you question because cartoons are great. (MD, not PhD biologist)

[1] https://youtu.be/sKe3UgH1AKg

cletus · on July 28, 2022

The ability for another molecule (probably another protein) to "react" or interact with the protein depends not only on the chemistry but also the shape. An otherwise compatible sequence of atoms might not be able to react because it and the binding site are just incompatibly shaped.

This is hugely important for developing drugs and vaccines.

To see the effect of this look no further than prions. Prions are the exact same protein that are folded in weird ways. Worse, they can "transmit" this misfolded shape to other otherwise normal proteins. Prions behave differently just because of the different shape and can lead to disease. This is exactly what Mad Cow's Disease (BSE) is.

What we get taught in high school about chemistry is incredibly oversimplified.

One example of this I like is the geometry of a water molecule. When we first learn about atoms, we learn the "solar system" model (aka Bohr). The reality is instead that we have 3D probability distributions of where electrons might be. These clouds are in pairs. I believe this is to do with the inverted wavefunction really we're getting beyond my knowledge of quantum mechanics here so that's just a guess.

Well those clouds additionally form valence shells. We learn about these and how atoms want to form completely valence shells. So Oxygen has 8 electrons ie 4 pairs of electrons. When bonding with 2 hydrogen atoms we end up with a weird geometry of ~104.5 degrees between the two hydrogen atoms because of how these pairs interact. The naive assumption might expect that the two hydrogen atoms are 180 degree apart.

So back to proteins, you may have learned about hydrogen bonds. This affects molecular shape because when a hydrogen atom shares an electron, it is often positively charged. That positive charge pushes away other positive charges. This is the realy difficulty in protein folding because with a molecule of thousands of atoms and weird geometry you may find distant parts of the molecule interacting with hydrogen bonds.

So a single cell consists of thousands (IIRC) of different proteins. Figuring out those interactions is important but incredibly difficult.

ghkbrew · on July 28, 2022

In addition to /u/dekhn 's excellent description, this phenomenon is referred to as a protein's "secondary structure" [0]

[0] https://en.m.wikipedia.org/wiki/Protein_secondary_structure

jarenmf · on July 28, 2022

This is probably one of the best applications of AI in science in terms of impact so far. I can't think of any other problem with the same potential impact.

EDIT: grammar

axg11 · on July 28, 2022

AlphaFold is the best counterpoint to tech cynics.

One of the largest public tech companies in the world funded a multi-year scientific project, executed the research flawlessly and moved forward an entire scientific field. They then went on to openly release the code _and_ data, working with a publicly funded organization (EMBL-EBI) to ensure researchers across the globe can easily access the outputs.

I'm not arguing that every tech company is a net positive for humanity. Google itself isn't perfect. Google + DeepMind is setting a bloody high bar though.

edmcnulty101 · on July 28, 2022

Elon Musk moved EVs to the mainstream. Starlink. Has the vision to go to Mars.

Amazon basically put malls out of business which are hugely environmentally destructive.

Bill Gates is doing stuff too I think.

Big tech does some good things.

dirtybird04 · on July 28, 2022

>Amazon basically put malls out of business which are hugely environmentally destructive.

I'm all for tech doing good but this is a hugeeeee stretch

edmcnulty101 · on July 29, 2022

I thought it was a stretch as I typed it :P BUT I hate malls so it improved my life.

calebkaiser · on July 28, 2022

This is definitely one of the most exciting spaces in AI right now. Another somewhat-related startup is PostEra (medicinal chemistry for drug discovery via AI) https://postera.ai/about/

t00 · on July 28, 2022

You are right and when thinking about it I can see 2 problems which I hope in the future can have even more impact:

1. Using AI to determine the most efficient methods of doing mathematical expressions, transformations and computation algorithms - division, square root, maybe traveling salesman - these which take relatively high amount of CPU cycles to compute and are used everywhere. If inputs and outputs can be assigned to it, AI can eventually build a transformation which can be reproduced using a silicon.

2. Physics phenomena in general, not only organic protein, can be measured and with sufficient ability to quantize them to inputs and experimentally obtained outputs to train the network, we could in theory establish new formulas or constants and progress the understanding of the Universe.

lrhegeba · on July 28, 2022

the groundworks, at least partially, happen as you typed this: https://www.nature.com/articles/d41586-021-01627-2

323 · on July 28, 2022

AI translate has probably a bigger worldwide impact so far.

jebarker · on July 28, 2022

jarenmf said "in science" - but it is an interesting question how much automated translation has helped scientists translate papers from other languages.

hijodelsol · on July 28, 2022

It even goes both ways - it allows non-native English speakers to publish their work in correct technical/scientific English with far less barriers.

dalbasal · on July 28, 2022

Can someone put AlphaFold's problem space into perspective for me?

Why is protein folding important? Theoretical importance? Can we do something with protein folding knowledge? If so, what?

I've been hearing about AlphaFold from the CS side. There they seem to focus on protein folding primarily as an interesting space to apply their CS efforts.

axg11 · on July 28, 2022

If we knew:

(a) the structure of every protein (what DeepMind is doing here)

(b) how different protein structures interact (i.e. protein complexes - DeepMind is working on this but not there yet)

Then we could use those two building blocks to design new proteins (drugs) that do what we want. If we solve those two problems with very high accuracy, we can also reduce the time it takes to go from starting a drug discovery programme to approved medicine.

Obtaining all protein structures and determining how they interact is a key step towards making biology more predictable. Previously, solving the structure of a protein was very time consuming. As a result, we didn’t know the structure for a majority of proteins. Now that it’s much faster, downstream research can move faster.

Caveat: we should remember that these are all computational predictions. AlphaFold’s predictions can be wrong and protein structures will still need to be validated. Having said that, lots of validation has already occurred and confidence in the predictions grows with every new iteration of AlphaFold.

jcranmer · on July 28, 2022

> Then we could use those two building blocks to design new proteins (drugs) that do what we want. If we solve those two problems with very high accuracy, we can also reduce the time it takes to go from starting a drug discovery programme to approved medicine.

Drugs are usually not proteins, but instead small molecules that are designed to help or interfere with the operation of proteins instead.

axg11 · on July 28, 2022

That is only true because of our current tools and capabilities. With improved manufacturing techniques and AlphaFold++ I think biologics will dominate. Even still, there are ~2000 approved biologics [0].

[0] - https://purplebooksearch.fda.gov/advanced-search

bckr · on July 28, 2022

Yep, proteins are so much more flexible / precise than small molecules. Also we can get the body to produce them. Think mRNA vaccines.

lamename · on July 28, 2022

How are the predictions validated? Waiting for the old fashioned way for... very difficult crystal structure experiments? Or something else?

f38zf5vdt · on July 28, 2022

Most of them are not, just estimations based on previous results given sequences with known structure.

Every couple years there is a massive competition called CASP where labs submit previously unresolved protein structures derived from experimental EM, x-ray crystallography, or NMR studies and other labs attempt to predict these structures using their software. AlphaFold2 absolutely destroyed the other labs in the main contest (regular monomeric targets, predominantly globular) for structure resolution two years ago, in CASP 14.

https://predictioncenter.org/casp14/zscores_final.cgi

The latest contest, CASP15, is currently underway and expected to end this year. As with all ML, the usual caveats apply to the models Google generated -- the dangers of overfitting to existing structures, artifacts based on the way the problem was modelled, etc

flobosg · on July 28, 2022

> very difficult crystal structure experiments?

Apart from X-ray crystallography there are other methods for structure determination such as nuclear magnetic resonance (NMR) or cryo-electron microscopy (cryo-EM). The latter has seen a dramatic improvement in resolution over the last decade.

RandomLensman · on July 28, 2022

If the predictions are generally good enough, could also skip the validation and directly try to get a desired effect or reaction. That isn't strictly speaking validating the structure, but depending on the use case might be easier to just go for an outcome - really a question of application and cost efficiency.

siver_john · on July 28, 2022

I mean nothing is stopping you from skipping validation with pre-alphafold techniques and say for drug discovery to already do drug screening using the predicted structure. It's just the drug screening software is already error prone so you are still going to have to do some validation. However having an idea on a potential structure means that you can do other techniques that are simpler to validate it that are less expensive/time consuming (I'm thinking of things similar to FRET).

Another idea is these may come into play for anti-verification, so if you are drug screening against a known structure. You could potentially use these more flawed structures of proteins you don't want to target but may be similar, and try to reduce the drug's efficacy at binding them. Or something to that effect. All of that is fun ideas that are currently being explored in that space but we'll see where it takes us.

misnome · on July 28, 2022

For a lot of X-ray crystallography cases, some of the difficulty is working out with no prior information, the actual structure from the collected data. This makes a lot of that... much easier because with https://en.wikipedia.org/wiki/Molecular_replacement something that is "close, but not correct" can be used to bootstrap the actual structure from.

fabiospampinato · on July 28, 2022

You are basically made of proteins, which are basically folded sequences of amino acids, proteins are molecular machines that are the fundamental building block of animals, plants, bacteria, fungi, viruses etc.

So yeah the applications are enormous, from medicine to better industrial chemical processes, from warfare to food manufacturing.

jebarker · on July 28, 2022

> proteins are molecular machines

Does that imply proteins have some dynamics that need to be predicted too? I remember seeing animations of molecular machines that appeared to be "walking" inside the body - are those proteins or more complex structures?

Alop3x · on July 28, 2022

Reference: https://en.wikipedia.org/wiki/Kinesin , https://en.wikipedia.org/wiki/Dynein , https://en.wikipedia.org/wiki/Myosin

They are called motor proteins because they convert chemical energy into kinetic energy. In the case of kinesin, it forms a dimer (two copies of itself bind together to form the two "legs") and also binds to light chains (accessory proteins that modulate its behavior) so that it can walk along filaments and drag cargo around your cells. They are both proteins and more complex structures because multiple proteins are interacting, as well as binding small molecules and catalyzing them into chemical products, all to produce the motion.

gilleain · on July 28, 2022

Yes, very much so. Even for proteins that seems like they are just scaffolding for a catalytic centre can have important dynamics.

A classic example is haemoglobin, that 'just' binds to oxygen at the iron in the middle of the haem. Other binding sites remote from the oxygen binding one can bind to other molecules - notably carbon dioxide. The 'Bohr effect' mechanism is outlined here : https://en.wikipedia.org/wiki/Bohr_effect#Allosteric_interac...

Even at the lowest level, there is some evidence that ligand binding can affect the structure of the backbone of the protein. For example, peptide plane flipping https://en.wikipedia.org/wiki/Peptide_plane_flipping although I'm not sure where the research is on this nowadays.

jebarker · on July 28, 2022

Fascinating. I'm a mathematician turned AI researcher by day. The more I learn about biology the more I feel that our human engineering (hard and soft) is trivially simple compared to what evolution has already created and left for us to study.

gilleain · on July 28, 2022

Heh, I'm somewhat the opposite, as I went from a degree in biochemistry into bioinformatics (protein folds and structure) and have an amateur interest in maths.

I often recommend the book 'Cats' Paws and Catapults' by Steven Vogel, which discusses the differences between human and biological engineering. There are very different constraints, as well as the more obvious difference between intelligently directed design (by humans) and evolutionary design.

It's not totally true that our engineering is simpler than biological systems. After all, we can now manipulate matter down to the atomic level. There have been attempts to move towards self-assembling systems, or even self-repairing ones. Not quite on the same scale or sophistication as biological systems, of course.

Is a modern computer chip more complex than a virus? Is a large connected system of computers more complex than a mycelial network? Are there more parts connected in more ways in a nuclear power station as there are in a simple bacterial cell?

I think it is definitely worth considering the similarities and differences.

jebarker · on July 28, 2022

Definitely interesting comparisons.

I think my sense that biology is more complex than human engineering is that our engineering seems much more homogeneous. Computer chips are certainly very dense but the individual number of different component types are very small. Biology on the other hand seems to have a huge number of fundamental building blocks, e.g. proteins, with different behaviors. I suppose that maybe that's the wrong comparison though and software is where our engineering diversity is present.

It may well just be the case that my lack of familiarity with biology makes it seem more complex to me.

ncmncm · on July 28, 2022

One person can, at least in principle, understand every last thing in a nuclear reactor. A few elements have roles at different levels of abstraction. And of course they have computers in them that invariably do stuff radically simpler than they could be doing.

But nobody will ever understand everything about a natural cell. Levels are an ad hoc phenomenon adhered to in varying degrees in certain places to contain variation. But few elements have only one role at one identifiable level, and you can never be sure one doesn't have another. And huge amounts of apparatus might radically change behavior in a subtly different environment.

gilleain · on July 28, 2022

You're right in focusing on the diversity of the parts. After all, a large pile of sand has a lot of parts, but they are all (to a decent approximation) the same. Therefore (?) they have the same small set of interactions, and the whole system has a 'small' set of states. (However, see 'self-organized criticality').

Software is where the diversity is, for computers. A cells systems are a set of interlocking networks of DNA, RNA, protein, small molecules, and signalling. As you say, these diverse parts then interact in diverse ways.

Another aspect is the large number of 'levels' in biological systems - humans have around seven; atoms, molecules, organelles, cells, tissues, organs, whole system. The interactions within and between these levels is a further source of complexity.

Balgair · on July 29, 2022

May career path took me from the very hard sciences to the biological ones. Though bio is experimentally trivial compared to the complexity of astrophysics, bio theory is millennia harder.

One of my favorite quotes about biology goes something like:

"An alien spaceship lands in your yard. You are kindly invited in to study and tour the ship. The technology is clearly billions of years more advanced than our own. Not decades more, not millennia more, not megayears more, but clearly gigayears more.

This is biology"

Sorry, can't find the exact quote and citation right now!

flobosg · on July 29, 2022

Bert Hubert, “Our amazing immune system”: https://berthub.eu/articles/posts/immune-system/

Balgair · on July 29, 2022

Thank you!

fabiospampinato · on July 28, 2022

Yes proteins can be dynamic, and multiple proteins can come together to form bigger structures (like a cell's membrane for example), and they can very much interact with other proteins and molecules in complex ways.

I think AlphaFold gives us more of a static picture that doesn't tell us much about how the protein will interact with the world. Predicting these dynamic aspects, protein-to-protein interactions, and ultimately an entire cell, is probably what they'll like to do next.

I don't know if the walking buddy is a single protein, it may be a handful of different ones joined together.

feet · on July 28, 2022

As others have already mentioned, proteins are the machinery of the cell. They perform an immense array of functions and they must fold in a certain way to perform these functions. This is part of what's known as the structure-function relationship.

Misfolded proteins are contributors to numerous pathological conditions and the more we can understand about how and why this folding happens, the better we can treat these conditions.

Another aspect is that while we can at least partially determine the primary structure (the amino acid sequence) of proteins from DNA and RNA, we don't necessarily know their secondary or tertiary structures (3 dimensional conformation). This is a key piece of the puzzle for figuring out how these proteins do their proteiny things and how they interact with other proteins and even how they form quaternary structures with other proteins (an assembly of multiple proteins that perform some function, many pores are assemblies like this). Once we know these structures and understand how they work on a structural and chemical level, we can manipulate them far more easily.

In order to do rational drug design, which is designing a drug for a specific target or active site on a protein, we need to understand these structures. Working to solve protein folding is a key step in treating disease states and understanding how cells work on a fundamental level. The impact is hard to understate.

turkeygizzard · on July 28, 2022

My understanding is that protein folding is a major cost bottleneck in drug design.

Researchers can come up with candidate molecule formulas that might work as good drugs, but the problem is that these proteins organize/fold themselves physically in a hard-to-predict way. And how they fold directly affects their properties as drugs.

If AlphaFold can accurately predict folding, it’ll allow researchers to prioritize drug candidates more accurately which will reduce research time and costs. Supposedly the major pharmaceutical companies can spend up to billions when designing a single drug. Optimistically, predicting protein folding better will allow for much more rapid and cheaper drug development

ray__ · on July 28, 2022

I love AlphaFold, but this is a big misconception. The biggest cost bottle neck in drug development and design, by orders of magnitude, is associated with assaying (and potentially reducing) off-target binding or toxicity and assaying (and potentially increasing) efficacy. Determining a protein structure empirically with cryoEM, NMR, or crystallography will generally cost less than $1M (sometimes far less), which is tiny compared to the many millions or billions of dollars that get poured into clinical trials for a single drug. AF2 is useful in some basic research cases but isn't really that useful for traditional drug design and development.

A machine learning approach for predicting toxicity would have a far greater impact on public health than AF2 does.

sanxiyn · on July 28, 2022

My understanding is that protein folding is not a bottleneck in drug design.

Yes, once you identified a target protein, its structure is useful to selectively target it. But the main bottleneck is identifying such targets. In other words, the main difficulty is to figure out what to hit, not how to hit it, and protein folding mostly helps with how at the moment.

epups · on July 28, 2022

Proteins are what makes everything in a cell work. They are produced as a "linear" structure that must fold into a proper shape to execute its function, such as acting as a pore that only lets a specific chemical through the cell membrane.

The importance here is to figure out potential targets for treatments that take into account particularities of certain proteins. That could produce better drugs with less side effects.

pbw · on July 28, 2022

The genome, all of our DNA combined, is just a bunch of 1D strings like "cgtattctgcttgta". Those strings encode proteins, which fold up into a 3D shape once created. This 3D shape is what determines what the protein actually does inside the cell. Without understanding protein folding we don't understand what the DNA actually does.

flobosg · on July 28, 2022

This might be an interesting resource for you: https://pdb101.rcsb.org/

Freebytes · on July 28, 2022

The applications and importance has been discussed, but let me explain why what we are doing right now does not work which will also emphasize the importance of this.

At this time, we create drugs, test them on animals, and see what the side effects and results actually are. We are very limited in our capabilities and basically throw mud at the wall and see what sticks. This would allow us to try potential drug candidates without so much randomness.

green-eclipse · on July 28, 2022

There are a million articles and podcasts explaining exactly your question. Those will be better than HN responses. I suggest you take 15 seconds to Google it.

dev_tty01 · on July 28, 2022

Yes there are a million articles. That is why asking a question here on HN is useful. The HN community more often than not offers intelligent insight as well as curated recommended links for learning more about a topic. Yes, the signal-to-noise ratio isn't perfect on HN, but it is a lot better than random Google searches. If nothing else, it often leads to enough basic understanding so that someone can then perform more refined and therefore more productive Google searches. I appreciate the basic questions and the generous nature of many HN commenters who offer thoughtful responses.

empiricus · on July 28, 2022

The endgame of protein folding is nanotechnology. Life is also nanotechnology. So basically the end game is to take full control of life: change existing life forms, design new life forms, new ecosystems. Eat the earth first, then the universe. Not yet there, but getting much closer.

matheusmoreira · on July 28, 2022

Proteins are the molecular machines of all living beings on this planet. They do almost everything. We need to understand their structure and how that structure changes when they perform some function. Essentially we have to reverse engineer nature's machines.

Biology has traditionally studied these things from the outside in, top down. Advances in biology are correlated with advances in microscope technology. We went from seeing with the naked eye to being able to magnify tissue to see cells and even what's inside of them.

Studying biology from the inside out, bottom up is significantly harder. You have chemistry which is hard science. On top of that you have biochemistry which is equally rigorous. Above that level of abstraction things start to get a lot less well defined. We go from organic chemistry (carbon, hydrogen, oxygen, etc.) to more complex things found in life (aminoacids, carbohydrates, lipids) to polymers of those things (proteins, polysaccharides, triglycerides) to a point where they get so big and complex their tridimensional structure starts influencing how they interact with other molecules then we start seeing aggregates (quaternary proteins) of these complex things!

Important proteins like ATP synthase are well understood (it's a friggin' turbine like in a hydroelectric power plant!) but there are so many others haven't received the same attention. Bioinformatics technology is changing that. The human genome's been mapped. Now it's time to map the proteome.

andrepd · on July 28, 2022

https://en.m.wikipedia.org/wiki/Protein_folding

epups · on July 28, 2022

AlphaFold is a phenomenal tool that demonstrates how AI can already outclass humans for certain tasks. It is a prime example of a problem space where conventional approaches are simply inferior, and that AI is not just a fancy name but can be extraordinarily powerful.

hackernewds · on July 28, 2022

This reads like something created by an AI. And the mere fact that I can't be sure if it is or not now is testament to the statement this is making

jcims · on July 28, 2022

Just for shits & giggles.

Prompt:

Write a short abstract for an article on why AlphaFold isn't just a tremendous technological achievement, but an example of how Artificial Intelligence will allow humanity to continue to achieve technological progress in the future:

Response:

When DeepMind’s AlphaFold was announced as the winner of the CASP13 protein folding prediction competition, it was hailed as a major breakthrough in both artificial intelligence (AI) and biology.

While AlphaFold is undoubtedly a remarkable achievement, it is also a clear example of how AI can be used to solve complex problems that have traditionally been seen as beyond the reach of machines.

In particular, AlphaFold represents a significant advance in the ability of AI systems to learn from limited data. Unlike previous AI approaches to protein folding, which relied on large datasets of known protein structures, AlphaFold was able to produce accurate predictions with only a small number of known protein structures to work from.

This ability to learn from limited data is likely to be increasingly important as we face problems that are too complex for us to fully understand and collect all the data we would need to solve them using traditional methods.

In the future, AI systems like AlphaFold will become increasingly important in helping us to make progress on a wide range of pressing challenges, from developing new medicines to tackling climate change.

beanjuice · on July 28, 2022

To add to this:

Prompt: Write a hackernews comment about why artificial intelligence will never be profitable

Response: Artificial intelligence will never be profitable because the cost of developing and maintaining the hardware and software is too high. Also, the market for AI is still too small and the demand is not high enough to justify the costs.

TremendousJudge · on July 28, 2022

Almost perfect, needs a bit more snark

btdmaster · on July 28, 2022

Prompt: Write a hackernews comment about why artificial intelligence will never be profitable

Response: Artificial intelligence will never be profitable because the cost of developing and maintaining the hardware and software is too high. Also, the market for AI is still too small and the demand is not high enough to justify the costs.

Add snark to your response:

Prompt: Write a hackernews comment about why artificial intelligence will never be profitable

Response: Artificial intelligence will never be profitable because the world has limited resources, and I can make more money selling cocaine than selling AI.

From GPT-NeoX 20B

TremendousJudge · on July 28, 2022

Well, it's not wrong. But I think that comment would get flagged, it parses too obviously an off-topic joke

jcims · on July 28, 2022

Prompt:

The following exchange is from HackerNews. The first long-winded comment is from an AI evangelist, extolling the virtues of artificial intelligence with a starry eyed perspective that considers only the value and not the cost. The second comment is in reply to the first, and is a very short and blunt takedown of the previous comment and the perspective of its author:

AIRocksMaaan:

Response:

AI is definitely the future. Just think about all of the possibilities that it can bring! For instance, it can help us create more efficient algorithms, it can help us automate repetitive tasks, and it can even help us make better decisions by providing us with more accurate data.

Not to mention, it can also help us create new and innovative products and services that can improve our quality of life.

AI definitely has a lot of potential and I think we should definitely explore it further.

threethirty:

This is all bullshit.

epups · on July 28, 2022

It's a new tool, AlphaFool

desindol · on July 28, 2022

It's the superstition and the amount of parasite words.

nitwit005 · on July 28, 2022

> demonstrates how AI can already outclass humans for certain tasks

I'm not sure how clear the edge over humans in this case is. There were some attempts at machine assisted human solving like Foldit that did produce results: https://en.wikipedia.org/wiki/Foldit#Accomplishments

sabujp · on July 28, 2022

Many thanks to Deepmind for releasing predicted structures of all known protein monomers. What I'd like next is for Alphafold (or some other software) to be able to show us multimeric structures based on the single monomer/subunit predictions and protein-protein interactions (i.e. docking). For example the one I helped work on back in my structural biology days was the circadian clock protein KaiC : https://www.rcsb.org/structure/2GBL, that's the "complete" hexameric structure that shows how each of the subunits pack. The prediction for the single monomer that forms a hexamer is very close to the experimental https://alphafold.ebi.ac.uk/entry/Q79PF4 and in fact shows the correct structure of AA residues 500 - 519 which we were never able to validate until 12 years later (https://www.rcsb.org/structure/5C5E) when we expressed those residues along with another protein called KaiA which we knew binds to the "top" CII terminal (AAs 497-519) of KaiC. If we would have had this data then, it would have allowed us to not only make better predictions about biological function and protein-protein interactions but would have helped better guide future experiments.

What we can do with this data now is use methods such as cryo-em to see the "big picture", i.e. multi-subunit protein-protein interactions where we can plug in the Alphafold predicted structure into the cryo-em 3d density map and get predicted angstrom level views of what's happening without necessarily having to resort to slower methods such as NMR or x-ray crystallography to elucidate macromolecular interactions.

A small gripe about the alphafold ebi website: it doesn't seem to show the known experimental structure, it just shows "Experimental structures: None available in PDB". For example the link to the alphafold structure above should link to the 2GBL, 1TF7, or any of the other kaic structures from organism PCC7942 at RCSB. This would require merging/mapping data from RCSB with EBI and at least doing some string matching, hopefully they're working on it!

arolihas · on July 28, 2022

You might be interested in https://www.biorxiv.org/content/10.1101/2021.10.04.463034v2

alphabetting · on July 28, 2022

Obtaining this dataset prior to alphafold would have cost on the order of $200 trillion. https://twitter.com/wintonARK/status/1552653527670857729

Anyone knowledgeable know if this estimate is accurate? Insane if true

green-eclipse · on July 28, 2022

It's impossible to really put a number on it, because the task itself was impossible. PHDs and the field's top scientists simply couldn't figure out many complicated protein structures after years of attempts, and the fact that there's so many (200M+) mean that the problem space is vast.

dekhn · on July 28, 2022

It doesn't make any sense on multiple levels. This is a computational prediction and there was no computational alternative- for many of these proteins would never have had a structure solved even if you spent the money. They are just taking $cost_per_structure_solved * number_of_remaining_structures and assuming that things scale linearly like that.

Note that crystallographers are now using these predicftions to bootstrap models of proteins they've struggled to work with, which indicates the level of trust in the structural community for these predictions is pretty high.

shauryamanu · on July 28, 2022

Even if that's exaggerated, it might have taken significant time to reach to this stage. Probably on the order of >50 years.

knbknb · on Aug 2, 2022

Off the top of my head:

(200 trillion cost) / (200 million structures predicted) = 1 million per structure.

That reflects the personnel cost (5 Yr PHP scholarship, PostDoc/Prof mentorship; inverstment+depreciation for the lab equipment). All this to crystallize 1 structure and characterize its folding behavior.

I don't know if this calculation is too simplistic, just coming up with something.

bifftastic · on July 28, 2022

How do they know their structures are correct?

lrem · on July 28, 2022

Disclaimer: I work in Google, organizationally far away from Deep Mind and my PhD is in something very unrelated.

They can't possibly know that. What they know is that their guesses are very significantly better than the previous best and that they could do this for the widest range in history. Now, verifying the guess for a single (of the hundreds of millions in the db) protein is up to two years of expensive project. Inevitably some will show discrepancies. These will be fed to regression learning, giving us a new generation of even better guesses at some point in the future. That's what I believe to be standard operating practice.

A more important question is: is today's db good enough to be a breakthrough for something useful, e.g. pharma or agriculture? I have no intuition here, but the reporting claims it will be.

f38zf5vdt · on July 28, 2022

The press release reads like an absurdity. It's not the "protein universe", it's the "list of presumed globular proteins Google found and some inferences about their structure as given by their AI platform".

Proteins don't exist as crystals in a vacuum, that's just how humans solved the structure. Many of the non-globular proteins were solved using sequence manipulation or other tricks to get them to crystallize. Virtually all proteins exist to have their structures interact dynamically with the environment.

Google is simply supplying a list of what it presumes to be low RMSD models based on their tooling, for some sequences they found, and the tooling is based itself on data mostly from X-ray studies that may or may not have errors. Heck, we've barely even sequenced most of the DNA on this planet, and with methods like alternative splicing the transcriptome and hence proteome has to be many orders of magnitude larger than what we have knowledge of.

But sure, Google has solved the structure of the "protein universe", whatever that is.

dekhn · on July 28, 2022

People have been making grand statements about the structure of the protein universe for quite some time (I've seen a fair number of papers on this, such as https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... and https://faseb.onlinelibrary.wiley.com/doi/abs/10.1096/fasebj... from a previous collaborator of mine).

Google didn't solve the structure of the protein universe (thank you for saying that). But the idea of the protein structure universe is fairly simple- it's a latent space that allows for direct movement over what is presumably the rules of protein structures along orthogonal directions. It would encompass all the "rules" in a fairly compact and elegant way. Presumably, superfamilies would automagically cluster in this space, and proteins in different superfamilies would not.

lrem · on July 28, 2022

I recognize your superior knowledge in the topic and assume you're right.

But you also ignore where we're at in the standard cycle:

https://phdcomics.com/comics/archive_print.php?comicid=1174

;)

f38zf5vdt · on July 28, 2022

That's exactly what this is, but it's embarrassing that it's coming from somewhere purported to be a lab. Any of the hundreds or more of labs working in protein structure prediction for the past 50 years could have made this press release at any time and said, "look, we used a computer and it told us these are the structures, we solved the protein universe!"

It's not to diminish the monumental accomplishment that was the application of modern machine learning techniques to outpace structure prediction in labs, but other famous labs have already moved to ML predictions and are competitive with DeepMind now.

gfodor · on July 28, 2022

> but other famous labs have already moved to ML predictions and are competitive with DeepMind now.

it seems obvious this was going to happen, because https://github.com/deepmind/alphafold

VikingCoder · on July 28, 2022

> but other famous labs have already moved to ML predictions and are competitive with DeepMind now.

That's great! AlphaFold DB mas made 200 million structure predictions available for everyone. How many structure predictions have other famous labs made available for everyone?

f38zf5vdt · on July 28, 2022

As many as you wanted to throw at them, considering the vast majority are open source and could be run on your own server cluster. CASP15 is ongoing so by the end of the year we will know how much absolute progress has been made by others.

Google has the advantage of the biggest guns here: the fastest TPUs with the most memory in the biggest clusters, so running inference with a massive number of protein sequences is much easier for them.

dekhn · on July 28, 2022

Many teams have been predicting structures for enormous numbers of proteins for some time now. It's just that nobody had any confidence in the predictions.

gilleain · on July 28, 2022

edit: I should have read the post first! What do you mean 'only globular proteins'? They say they have predictions for all of UniProt...

---------------

Yes, the idea of a 'protein universe' seems like it should at least encompass 'fold space'.

For example, WR Taylor : https://pubmed.ncbi.nlm.nih.gov/11948354/

I think the rough estimate was that there were around 1000 folds - depending on how fine-grained you want to go.

Absolutely agree, though, that a lot of proteins are hard to crystalise (i understand) due to being trans-membrane or just the difficulty of getting the right parameters for the experiment.

f38zf5vdt · on July 28, 2022

I don't think non-globular proteins are well represented by the predictions. All our predictions for proteins are based on proteins we were able to crystallize, so my guess is that even if many of them aren't globular proteins the predictions themselves are made from the foundations of structures we do have, which are predominantly globular proteins and it's presumed that the inference treats folding as if they were globular and crystallized (non-dynamic). X-ray crystallography and fitting to electron density maps itself is a bit of an art form.

For example for transmembrane proteins, there is a gross under-representation of structures derived from experimental evidence, so we would expect that whatever your algorithm is "solving" is going to have a much higher degree of error than globular proteins, and likely artifacts associated with learning from much more abundant globular proteins.

edit: As an example, "Sampling the conformational landscapes of transporters and receptors with AlphaFold2". AF2 was able to reproduce the alternative conformations of GPCRs, but only with non-default settings. With default settings there is clear evidence of overfitting.

> Overall, these results demonstrate that highly accurate models adopting both conformations of all eight protein targets could be predicted with AF2 by using MSAs that are far shallower than the default. However, because the optimal MSA depth and choice of templates varied for each protein, they also argue against a one-size-fits-all approach for conformational sampling.

gilleain · on July 28, 2022

Fair point. I guess if their training data is biased towards existing known structures (via xray or nmr or whatever) then there is the risk of incorrect predictions.

At a guess, the core packing in non-globular proteins might be different? Also the distribution of secondary structure might also vary between classes. Might be worth someone studying how much structural constraints depend on fold (if they have not already).

luma · on July 28, 2022

Same as any other prediction I'd presume. Run it against a known protein and see how the answer lines up. Predict the structure of an unknown protein, then use traditional methods (x-ray crystallography, maybe STEM, etc) to verify.

gilleain · on July 28, 2022

As a simple example, one measure used to compare a predicted structure against a reference is the RMSD (root mean square deviation).

https://en.m.wikipedia.org/wiki/Root-mean-square_deviation_o...

The lower the RMSD between two structures, the better (up to some limit).

iandanforth · on July 28, 2022

"Verify" is almost correct. The crystallography data is taken to be "ground truth" and the predicted protein structure from AlphaFold is taken to be a good guess starting point. Then other software can produce a model that is a best fit to the ground truth data starting from the good guess. So even if the guess is wrong in detail it's still useful to reduce the search space.

christudor · on July 28, 2022

This is exactly right.

christudor · on July 28, 2022

This video goes some way to explaining how they know the structures are correct: https://www.youtube.com/watch?v=vXZzftX03VY

tomrod · on July 28, 2022

This is the right line of questioning.

As we solve viewability into the complex coding of proteins, we need to be right. Next, hopefully, comes causal effect identification, then construction ability.

If medicine can use broad capacity to create bespoke proteins, our world becomes both weird and wonderful.

seydor · on July 28, 2022

they don't but they are more correct than what others have predicted. Some of their predictions can be compared with structures determined with x-ray crystallography

cupofpython · on July 28, 2022

did they come up with their structures independently of the x-ray crystallography, or was that part of a ML dataset for predicting structure

unlikelymordant · on July 28, 2022

The casp competition that they won consists of a bunch of new proteins, the structures of which havnt been published. So the test set is for brand new proteins in that case.

cupofpython · on July 28, 2022

nice, very cool then

__rito__ · on July 28, 2022

They won a decades-long standing challenge predicting the protein structures of a much smaller (yet significantly quite large) set of proteins using a model (AlphaFold).

Then they use the model to predict more.

Although we don't know if they are correct, these structures are the best (or the least bad) we have for now.

ArnoVW · on July 28, 2022

We know the structure of some proteins. It's not that it's impossible to measure, it's just very expensive. This is why having a model that can "predict" it is so useful.