It also proved that deep learning models are a valid approach to bioinformatics - for all its flaws and shortcomings, AlphaFold solves arbitrary protein structure in minutes on commodity hardware, whereas previous approaches were, well, this: https://en.wikipedia.org/wiki/Folding@home
A gap between biological research and biological engineering is that, for bioengineering, the size of the potential solution space and the time and resources required to narrow it down are fundamental drivers of the cost of creating products - it turns out that getting a shitty answer quickly and cheaply is worth more than getting the right answer slowly.
AlphaFold and Folding@home attempt to solve related, but essentially different, problems. As I already mentioned here, protein structure prediction is not fully equivalent to protein folding.
Yeah, this is what I mean by "a shitty answer fast" - structure prediction isn't a canonical answer, but it's a good enough approximation for good enough decision-making to make a bunch of stuff viable that wouldn't be otherwise.
I agree with you, though - they're two different answers. I've done a bunch of work in the metagenomics space, and you very quickly get outside areas where Alphafold can really help, because nothing you're dealing with is similar enough to already-characterized proteins for the algorithm to really have enough to draw on. At that point, an actual solution for protein folding that doesn't require a supercomputer would make a difference.
> this is what I mean by "a shitty answer fast" - structure prediction isn't a canonical answer
A proper protein structural model is an all-atom representation of the macromolecule at its global minimum energy conformation, and the expected end result of the folding process; both are equivalent and thus equally canonical. The “fast” part, i.e., the decrease in computational time comes mostly from the heuristics used for conformational space exploration. Structure prediction skips most of the folding pathway/energy funnel, but ends up at the same point as a completed folding simulation.
> At that point, an actual solution for protein folding that doesn't require a supercomputer would make a difference.
Or more representative sequences and enough variants by additional metagenomic surveys, for example. Of course, this might not be easily achievable.
> ends up at the same point as a completed folding simulation.
Well, that's the hope, at least.
> Or more representative sequences and enough variants by additional metagenomic surveys, for example. Of course, this might not be easily achievable.
For sure, but for ostensibly profit-generating enterprises, it's pretty much out of the picture.
I think the reason an actual computational solution for folding is interesting is that the existing set of experimentally verified protein structures are for proteins we could isolate and crystalize (which is also the training set for AlphaFold, so that's pretty much the area its predictions are strongest, and even within that, it's only catching certain conformations of the proteins) - even if you can get a large set of metagenomic surveys and a large sample of protein sequences, the limitations on the methods for experimentally verifying the protein structure means we're restricted to a certain section of the protein landscape. A general purpose computationally tractable method for simulating protein folding under various conditions could be a solution for those cases where we can't actually physically "observe" the structure directly.
Most proteins don't fold to their global energy minimum- they fold to a collection of kinetically accessible states. Many proteins fail to reach the global minimum because of intermediate barriers from states that are easily reached from the unfolded state.
Attempting to predict structures using mechanism that simulate the physical folding process waste immense amount of energy and time sampling very uninteresting areas of space.
You don't want to use a supercomputer to simulate folding; it can be done with a large collection of embarassingly parallel machines much more cheaply and effectively. I proposed a number of approaches on supercomputers and was repeatedly told no because the codes didn't scale to the full supercomputer, and supercomputers are designed and built for codes that scale really well on non-embarassingly parallel problems. This is the reason I left academia for google- to use their idle cycles to simulate folding (and do protein design, which also works best using embarassingly parallel processing).
As far as I can tell, only extremely small and simple proteins (like ribonuclease) fold to somewhere close to their global energy minimum.
Except, you know, if you're trying to understand the physical folding process...
There are lots of enhanced sampling methods out there that get at the physical folding process without running just vanilla molecular dynamics trajectories.
> It also proved that deep learning models are a valid approach to bioinformatics
A lot of bioinformatics tools using deep learning appeared around 2017-2018. But rather than being big breakthroughs like AlphaFold, most of them were just incremental improvements to various technical tasks in the middle of a pipeline.
and since a lot of those tools are incremental improvements they disappeared again, imho - what's the point for 2% higher accuracy when you need a GPU you don't have?
Not many DL based tools I see these days regularly applied in genomics. Maybe: Tiara for 'high level' taxonomic classification, DeepVariant in some papers for SNP calling, that's about it? Some interesting gene prediction tools coming up like Tiberius. AlphaFold, of course.
Lots of papers but not much day-to-day usage from my POV.
Most Oxford Nanopore basecallers use DL these days. And if you want a high quality de novo assembly, DL based methods are often used for error correction and final polishing.
There are a lot of differences between the cutting-edge methods that produce the best results, the established tools the average researcher is comfortable using, and whatever you are allowed to use in a clinical setting.
AlphaFold doesn’t work for engineering though. Getting a shitty answer ends up being worse than useless.
It seems to really accelerate productivity of researchers investigating bio molecules or molecules very similar to existing bio molecules. But not de novo stuff.
that's just not true. In a lot of cases in engineering, there are 10000000 possibilities, and deeplearning shows you 100 potentially more promising ones to double check, and that's worth huge amounts of money.
In a lot of cases deep learning is able to simulate complex system at a precision that is more than precise enough, that otherwise would not be tracktable (like is the case with alphafold), and again this is especially valuable if you can double check the output.
Ofc, in the field of language and vision and in a lot of other fields, deep learning is straight up the only solution.
Eh, in many cases for actual customer-facing commercial work, they're sticking remarkably close to stuff that's in genbank/swissprot/etc - well characterized molecules and pathways, because working with genuinely de novo stuff is difficult and expensive. In those cases, Alphafold works fine - it always requires someone to actually look at the results and see whether they make sense or not, but also "the part of the solution space where the tools work" is often a deciding factor in what approach is chosen.
A gap between biological research and biological engineering is that, for bioengineering, the size of the potential solution space and the time and resources required to narrow it down are fundamental drivers of the cost of creating products - it turns out that getting a shitty answer quickly and cheaply is worth more than getting the right answer slowly.