Shameless self promotion: I wrote one of the more cited papers in the field [0], back in 2016.
A key challenge: very few labs have enough data.
Something I view as a key insight: a lot of labs are doing absurdly labor intensive exploratory synthesis without clear hypotheses guiding their work. One of our more useful tasks turned out to be interactively helping scientists refine their experiments before running them.
Another was helping scientists develop hypotheses for _why_ reactions were occuring, because they hadn't been able to build principled models that predicted which properties were predictive of reaction formation.
Going all the way to synthesis is nice, but there's a lot of lower hanging fruit involved in making scientists more effective.
This is true. Getting datasets with the necessary quality and scale for molecular ML is hard and uncommon. Experimental design is also a huge value add, especially given the enormous search space (estimates suggest there are more possible drug-like structures than there are stars in the universe). The challenge is figuring out how to do computational work in a tight marriage with the lab work to support and rapidly explore the hypotheses generated by the computational predictions. Getting compute and lab to mesh productively is hard. Teams and projects have to be designed to do so from the start to derive maximum benefit.
Also shameless plug: I started a company to do just that, anchored to generating custom million-to-billion point datasets and using ML to interpret and design new experiments at scale.
> A key challenge: very few labs have enough data.
It is also getting harder, not easier, to get.
I am working right now on a retro synthesis project. Our external data provider is raising prices while removing functionality, and no one bats an eye. At the same time our own data is considered a business secret and therefore impossible to share.
As someone who does NLP research where the code, data and papers are typically free, this drives me insane.
Experienced chemists can look at molecule diagrams and have an intuition as to its activity and similarity to other known molecules. It’s like most of science and math: most discoveries begin with intuition and are demonstrated rigorously afterwards. I believe Poincare said something to this end.
I was implying that you still need a human to make the final decision. AI can be a valuable aid in both fields. Doctors can't just let the AI do all the work in the same way synthetic chemists can't blindly trust the AI to spit out correct and feasible results. Research time is expensive and thus the effort needs to be evaluated, and usually the intuition of said chemists trump that of the AI.
True. But perhaps you can eliminate 9 out of 10 chemists, and replace them by an AI that generates ideas. Then use the 1 chemist to validate those ideas.
Not to generate ideas, there's always more ideas than resources in chemistry.
Mainly to do more automated routines than ever.
9 out of 10 chemists aren't that great at the bench anyway.
Everyone would probably benefit from getting them in front of a computer full-time to leverage their training in a way, and freeing up the bench space to those who can really make the most of it.
Not the focus of the article, but analytical chemists need to do a lot of proper detecting themselves to be high-performing just like the radiologists do.
The brain is incredibly good at pattern matching while not necessarily being able to articulate why they came to that decision. Organic chemistry has these types of relations in spades. Say for example crystallization. You can kinda brute force it; there's only a few dozen realistic solvents to try, but that's a single solvent system. Then there's binary and ternary solvent systems. Then there's heat/cooling profiles, antisolvent addition, all kinds of things. Hundreds or thousands of possible experiments.
You might just decide that a compound "needs" isopropanol/acetone, plus a bit of water, cause something vaguely similar you encountered years ago crystallized well. You often start with some educated guesses and refine based on what you see.
But there's often no clear hypothesis, no single physical law the system obeys.
> Something I view as a key insight: a lot of labs are doing absurdly labor intensive exploratory synthesis without clear hypotheses guiding their work.
This lets you stumble over unknown unknowns. Taylor et al discovered high-speed steel by ignoring the common wisdom and doing a huge number of trials, arriving at a material and treatment protocol that improved over the then-state-of-the-art tool steels by an order of magnitude or more. The treatment mechanism was only understood 50-60 years later.
I don't know if I have missed the big thing here (was supposed to do exactly the same flowery thing described there for crystals around 2019-2020), but the graphic with the Autoencoder is roughly what people did in 2018 (Gomez-Bombarelli, https://pubs.acs.org/doi/10.1021/acscentsci.7b00572), I think the review cited reproduced this. Also notice: it has the MLP in the middle, the performance of which was/is not really helping either - especially if your model should actually produce novel stuff, e.g. extrapolate.
Finally: every kid can draw up novel structures. Then: how do you actually fabricate these (in the case of real novel chemistry and not some building-block stuff). Noone has a clue!
I for myself have decided that for now (with the data at hand and non-Alphafold-budgets) the 2 keys areas, where you can actually help computational chemistry are:
> every kid can draw up novel structures. Then: how do you actually fabricate these (in the case of real novel chemistry and not some building-block stuff). Noone has a clue!
Yep. I worked at a biotech startup in the early/mid 2000s.
We had a 2-pronged approach to finding small molecule drugs: 1) traditional medicinal chemistry based on simple SAR (structure-activity relationships) and 2) predictive modeling (before ML was hot).
The traditional med chemists were, in my opinion, rightfully skeptical of the suggestions coming out of the predictive modeling group ("That's a great suggestion, but can you tell me how to synthesize it?").
As one of my co-workers said to me: "The predictions made by the modeling group range from pretty bad to ... completely worthless."
It's possible that things have gotten better, though, as I haven't done that type of work since about 2008.
It's gotten better. The future looks like generative or screening models that feed structures to ADMET/retrosynthesis models which close a feedback loop by penalizing/rewarding the first-pass models. Then there's machine-human pair design flows that are really promising, basically ChemDraw with the aforementioned models providing real-time feedback and suggestions. It wouldn't surprise me if in a decade you could seed generative models with natural language, receive a few structures, alter them in a GUI with feedback, and then publish a retrosynthetic proposal + ADMET estimations with the candidate structure.
With high-throughput screening and automation, even small/medium-sized players can start building internal databanks for multi-objective models.
> The traditional med chemists were, in my opinion, rightfully skeptical of the suggestions coming out of the predictive modeling group ("That's a great suggestion, but can you tell me how to synthesize it?").
Start by plugging it into askcos.mit.edu/retro/ then, do your job?
> As one of my co-workers said to me: "The predictions made by the modeling group range from pretty bad to ... completely worthless."
Workers feeling threatened by technology think the technology is bad or worthless, news at 11.
Chemical synthesis is much more than just retrosynthesis. Even if it was, you can not just plug in the novel previously unsynthesized molecule and expect any reasonable results. Furthermore, the tool is based on Reaxys which in turn uses already established reaction routes and conditions from literature and patents. Good luck optimizing yields for something you don't know even how to synthesize, let alone what conditions to use.
> Chemical synthesis is much more than just retrosynthesis.
Of course, but we are talking about chemical discovery - after which you want to test if the compound theoretical capabilities work on cells. Yields are not yet a concern!
> expect any reasonable results
No it won't do all the work, but it will direction, and suggestion for which pathways could be used.
> Workers feeling threatened by technology think the technology is bad or worthless, news at 11.
I appreciate the sentiment and I think it's understandable to think that. In this particular case, however, my co-worker was one of the smartest / most talented people I've worked with. I can assure you that he did not feel threatened in any way. His comment was sardonic, but not borne of insecurity.
To be fair, the members of the modeling group were also quite talented. They were largely derived from one of the more famous physical/chemical modeling groups at one of the HYPS schools. But even they acknowledged that on a good day, the best they could do was offer suggestions / ideas to the medicinal chemists.
In fact, one of the members of the modeling group said this to me once (paraphrasing): The medicinal chemists are the high-priests of drug discovery. We can help, but they run the show.
As mentioned by someone who responded to my original comment, the usefulness of ML/modeling has likely gotten much better over the past 10 - 15 years.
> Finally: every kid can draw up novel structures. Then: how do you actually fabricate these (in the case of real novel chemistry and not some building-block stuff). Noone has a clue!
I personally have a clue, and the entire field of organic chemistry has a clue, given enough time and money most reasonable structures can be synthesized (and QED+SAScore+etc and then human filter is often enough to weed out the problem compounds that will be unstable or hard to make). Actually even some of the state of the art synthesis prediction models are able to predict decent routes if the compounds are relatively simple [0]. The issue is that in silico activity/property prediction is often not reliable enough for the effort to design and execute a synthesis to be worth it, especially because as typically the molecules will get more dissimilar to known compounds with the given activity, the predictions will also often become less reliable. In the end, what would happen is that you just spend 3 months of your master student's time on a pharmacological dead end. Conversely, some of the "novel predictions" of ML pipelines includign de novo structure generation can be very close to known molecules, which makes the measured activity to be somewhat of a triviality.[1] For this reason, it makes sense to spend the budget on building block-based "make on demand" structures that will have 90% fulfillment, that will take 1-2 months from placed order to compound in hand and that will be significantly cheaper per compound, because you can iterate faster. Recent work around large scale docking has shown that this approach seems to work decently for well behaved systems.[2] On the other hand, some truly novel frameworks are not available via the building block approach, which can also be important for IP.
More fundamentally, of course you are correct, and I agree with you: having a lot of structures is in itself not that useful. Getting closer to physically more meaningful and fundamental processes and speeding them up to the extent possible can generate way more transparent reliable activity and novelty.
There's a lot that can be learned with building-block based experiments. If you do a building block based experiment then train a model, then predict new compounds, the models do generalize meaningfully outside the original set of building blocks into other sets of building blocks (including variations on different ways of linking the building blocks). Granted that's not the "fully novel scaffold" test, however it suggests that there should be some positive predictive value on novel scaffolds.
We've done work in this area and will be publishing some results later in the year.
Don't mean to sound cynical here, but is AI actually gonna change chemical discovery? Or is AI gonna "change chemical discovery" in the same way that it will "make radiologists obsolete", or "revolutionize healthcare with Watson", or "put millions of truckers out of business"? There's certainly a lot of marketing, technical talent, and hope behind modern "AI", but I'm not really aware of any major part of our economy that's been THAT changed by it.
Chemist and blogger Derek Lowe has expressed his opinion [1] that, while significant and worth celebrating, AlphaFold addresses just one part of the long complex process of drug discovery. In one example [2] he cites a biotech startup that saved ~1 month in time and lab work and then discusses the many steps that remain before that drug candidate could become an actual drug.
OP describes what could potentially be AlphaFold with small molecules instead of with just proteins, and specifically calls it “chemical discovery” and not “drug discovery” more broadly. I think it’s fair to cheer for such a big advance while recognizing it wouldn’t solve everything.
It seems like AI is good at finding what's popular (video views/ad clicks), but it doesn't seem to be very good at science/engineering (drug discovery/self driving cars/radiology).
Have a look at how driving a taxi has changed. (And I include the likes of Uber here.)
The hard part for humans used to be knowing all the roads in a city and selecting the best route quickly and reliably. Almost any adult can do the actual second-to-second driving reasonably competently in almost any city on the planet.
Now, we have outsourced the 'hard' part to Google Maps. But we are still far from a machine that drives in arbitrary locations on the globe. As far as I know, Waymo has the most mature system currently in development, but requires absurd amounts of precise mapping data for any location they want to drive in.
And let's not even talk about the even more 'trivial' task of chatting to the passengers.
> [...], but it doesn't seem to be very good at science/engineering (drug discovery/self driving cars/radiology).
Technology has already automated huge chunks for science and engineering. We just don't call any of the already solved chunks by the name of AI anymore.
Wild. I was first introduced to heuristic search in drug discovery at a biotech company in 1993.
Support vector networks were searching for good shapes by 1997, maybe before then, but the vocabulary we used to describe the search technique was different.
A huge computational hurdle was protein folding. We had brute-force searches for plausible shapes, lots of supervision by the chemists, weeks per iteration on the best workstations we could get. $250,000+ SGI Onyx, then DEC sent over an Alpha workstation.
I have an AI chemistry project that appears to work on my test data, but I had to put it on the backburner because I simply can't (couldn't) find a 3070/80 anywhere! I stopped looking 6 months ago, does anybody know a reliable place where I can snag one?
If you don't mind having a gaming PC (stupid RGB lights and fans) then Build Redux [1] is not price gouging. You can basically just think of the GPU markup as a builder fee. They are incredibly slow to ship but the lead time is something like a month which beats your 6 months. At this point, it would take bitcoin dropping below 10k for the market not to be silly.
You know what, this might actually be the guy. Good tip. It's a bummer to pay for a Windows license I don't want, but the fee for that is less than the markup I'd pay to some scalper just to get the card.
I wonder why you've not teamed up with some uni-lab, they usually have some hardware and also some free labor if you share credits. A single GPU also means, whatever your idea, it probably would have finished already on a CPU ;).
If you're willing to pay scalper prices, there's a relatively consistent availability. If you're looking for closer to MSRP, they're in short supply and your choices are either waiting lists or racing other people for online restocks. I got a 3080 a month ago only after waiting on EVGA's wait list for just over a year.
Why do you need a 3070/3080 specifically? If it's to run something like Tensorflow or CUDA code more generally, could you do it with an older card, or the more available 3060s?
This is a little disingenuous, that is like saying scarcity doesn't exist since you can buy almost anything at any price. There is always an implied "in a reasonable budget".
Yeah, I guess I could go down that route, I've used the AWS Gx instances for projects before, but this dataset would justttt fit in memory for a 3080, which really simplified the rest of the code, and the speed of iteration, and, at the end of the day, quite frankly I just want one. I'll do more weird stuff if I don't have the meter of how much I'm paying to Jeff Bezos running in my head every time I run an experiment.
The price is great, the downside is that you do not know whether the server owner reads your data. The "jobs" run in a Docker container on somebody's machine.
A similar thing is happening in inorganic chemistry. On https://materialsproject.org probably the majority of materials are 'discovered' computationally, by using DFT simulations to determine if and how stable any crystal structure would be. Using this database, AI can be used to find suitable materials for any application. [1]
Isn't AI chemical discovery essentially machine brute forcing solutions to see which one fits/works? I think more attention needs to be put into designing better software solutions then merely putting more computational power into problem solving.
You could say that it’s optimized brute forcing. It seems to me that machine learning applied to combinatorial search problems of this nature cut down the search space massively by recognizing patterns of combinations that have a high probability of being good, and then traversing those paths, similar to AlphaGo. This is a completely naive take, I’d like to hear other thoughts on this.
Could you explain what advantages there are to having the software work better as opposed to just throwing more computational power at it?
I know this doesn't necessarily apply, but if the solution space for certain niche problems is so small that we can just drown the problem in compute, I couldn't care less that the algo was N^2 or whatever, or that the UI was less than ideal. Maybe I'm not thinking deeply enough about what you mean by "better software solutions".
If I run a batch job and it spits out 10,000 compounds that I can try to a certain affect, then it then becomes a filtering problem where I can apply humans and do more traditional science, and if it was feasible to just try everything in parallel that option is nice as well, feels like how you got to that 10,000 compounds doesn't matter much.
Looking forward to hearing just how wrong my simplistic view is.
Ah yeah so I was guessing here that we're talking about a subset that is solvable with the current methods -- as in within reach enough to be solvable by relatively naive methods
A key challenge: very few labs have enough data.
Something I view as a key insight: a lot of labs are doing absurdly labor intensive exploratory synthesis without clear hypotheses guiding their work. One of our more useful tasks turned out to be interactively helping scientists refine their experiments before running them.
Another was helping scientists develop hypotheses for _why_ reactions were occuring, because they hadn't been able to build principled models that predicted which properties were predictive of reaction formation.
Going all the way to synthesis is nice, but there's a lot of lower hanging fruit involved in making scientists more effective.
[0] https://www.nature.com/articles/nature17439