Hacker News new | past | comments | ask | show | jobs | submit login
AI is changing chemical discovery (thegradient.pub)
125 points by andreyk on Feb 14, 2022 | hide | past | favorite | 65 comments



Shameless self promotion: I wrote one of the more cited papers in the field [0], back in 2016.

A key challenge: very few labs have enough data.

Something I view as a key insight: a lot of labs are doing absurdly labor intensive exploratory synthesis without clear hypotheses guiding their work. One of our more useful tasks turned out to be interactively helping scientists refine their experiments before running them.

Another was helping scientists develop hypotheses for _why_ reactions were occuring, because they hadn't been able to build principled models that predicted which properties were predictive of reaction formation.

Going all the way to synthesis is nice, but there's a lot of lower hanging fruit involved in making scientists more effective.

[0] https://www.nature.com/articles/nature17439


This is true. Getting datasets with the necessary quality and scale for molecular ML is hard and uncommon. Experimental design is also a huge value add, especially given the enormous search space (estimates suggest there are more possible drug-like structures than there are stars in the universe). The challenge is figuring out how to do computational work in a tight marriage with the lab work to support and rapidly explore the hypotheses generated by the computational predictions. Getting compute and lab to mesh productively is hard. Teams and projects have to be designed to do so from the start to derive maximum benefit.

Also shameless plug: I started a company to do just that, anchored to generating custom million-to-billion point datasets and using ML to interpret and design new experiments at scale.


> A key challenge: very few labs have enough data.

It is also getting harder, not easier, to get.

I am working right now on a retro synthesis project. Our external data provider is raising prices while removing functionality, and no one bats an eye. At the same time our own data is considered a business secret and therefore impossible to share.

As someone who does NLP research where the code, data and papers are typically free, this drives me insane.


Are you using NLP to guide what molecules are probably worthwhile to try and synthesize?


A bit. But my main project was to use NLP to identify failed reactions in old lab notebooks to use as negative training data.


Question: How are labs doing the exploratory work without a clear hypothesis? Are they essentially doing some version of brute force?


Experienced chemists can look at molecule diagrams and have an intuition as to its activity and similarity to other known molecules. It’s like most of science and math: most discoveries begin with intuition and are demonstrated rigorously afterwards. I believe Poincare said something to this end.


Ok, so these experienced chemists can be replaced by AI now?


In the same way radiologist can be replaced by AI. So, no.


Radiologists have a high responsibility of detecting the right things.

Chemists can just try out things.

I don't think you can compare the two.


I was implying that you still need a human to make the final decision. AI can be a valuable aid in both fields. Doctors can't just let the AI do all the work in the same way synthetic chemists can't blindly trust the AI to spit out correct and feasible results. Research time is expensive and thus the effort needs to be evaluated, and usually the intuition of said chemists trump that of the AI.


True. But perhaps you can eliminate 9 out of 10 chemists, and replace them by an AI that generates ideas. Then use the 1 chemist to validate those ideas.


And that's why I want to build me a robot.

Not to generate ideas, there's always more ideas than resources in chemistry.

Mainly to do more automated routines than ever.

9 out of 10 chemists aren't that great at the bench anyway.

Everyone would probably benefit from getting them in front of a computer full-time to leverage their training in a way, and freeing up the bench space to those who can really make the most of it.


Not the focus of the article, but analytical chemists need to do a lot of proper detecting themselves to be high-performing just like the radiologists do.


The brain is incredibly good at pattern matching while not necessarily being able to articulate why they came to that decision. Organic chemistry has these types of relations in spades. Say for example crystallization. You can kinda brute force it; there's only a few dozen realistic solvents to try, but that's a single solvent system. Then there's binary and ternary solvent systems. Then there's heat/cooling profiles, antisolvent addition, all kinds of things. Hundreds or thousands of possible experiments.

You might just decide that a compound "needs" isopropanol/acetone, plus a bit of water, cause something vaguely similar you encountered years ago crystallized well. You often start with some educated guesses and refine based on what you see.

But there's often no clear hypothesis, no single physical law the system obeys.


I'm trying to get a startup off the ground that tackles this.

Would love to chat more with you about this.


Me too, also tech nomad. I'll email you


> Something I view as a key insight: a lot of labs are doing absurdly labor intensive exploratory synthesis without clear hypotheses guiding their work.

This lets you stumble over unknown unknowns. Taylor et al discovered high-speed steel by ignoring the common wisdom and doing a huge number of trials, arriving at a material and treatment protocol that improved over the then-state-of-the-art tool steels by an order of magnitude or more. The treatment mechanism was only understood 50-60 years later.


I don't know if I have missed the big thing here (was supposed to do exactly the same flowery thing described there for crystals around 2019-2020), but the graphic with the Autoencoder is roughly what people did in 2018 (Gomez-Bombarelli, https://pubs.acs.org/doi/10.1021/acscentsci.7b00572), I think the review cited reproduced this. Also notice: it has the MLP in the middle, the performance of which was/is not really helping either - especially if your model should actually produce novel stuff, e.g. extrapolate.

Finally: every kid can draw up novel structures. Then: how do you actually fabricate these (in the case of real novel chemistry and not some building-block stuff). Noone has a clue!

I for myself have decided that for now (with the data at hand and non-Alphafold-budgets) the 2 keys areas, where you can actually help computational chemistry are:

- creating really robust and generally applicable ML-MD-potentials, potentially using graphs https://arxiv.org/abs/2106.08903 (or a traditional approach: https://www.nature.com/articles/s41467-020-20427-2). Facebook is also working in this area: https://pubs.acs.org/doi/10.1021/acscatal.0c04525

- and approximating exchange correlation functionals (... Google and some guys at Oxford, which got stomped over by the deepmind-PR machine https://arxiv.org/pdf/2102.04229.pdf): https://www.science.org/doi/10.1126/science.abj6511

If anyone can tell me how those generative models spit out graphs which look like reality (actually this is imho part of AlphaFold), wake me up.


> every kid can draw up novel structures. Then: how do you actually fabricate these (in the case of real novel chemistry and not some building-block stuff). Noone has a clue!

Yep. I worked at a biotech startup in the early/mid 2000s.

We had a 2-pronged approach to finding small molecule drugs: 1) traditional medicinal chemistry based on simple SAR (structure-activity relationships) and 2) predictive modeling (before ML was hot).

The traditional med chemists were, in my opinion, rightfully skeptical of the suggestions coming out of the predictive modeling group ("That's a great suggestion, but can you tell me how to synthesize it?").

As one of my co-workers said to me: "The predictions made by the modeling group range from pretty bad to ... completely worthless."

It's possible that things have gotten better, though, as I haven't done that type of work since about 2008.


It's gotten better. The future looks like generative or screening models that feed structures to ADMET/retrosynthesis models which close a feedback loop by penalizing/rewarding the first-pass models. Then there's machine-human pair design flows that are really promising, basically ChemDraw with the aforementioned models providing real-time feedback and suggestions. It wouldn't surprise me if in a decade you could seed generative models with natural language, receive a few structures, alter them in a GUI with feedback, and then publish a retrosynthetic proposal + ADMET estimations with the candidate structure.

With high-throughput screening and automation, even small/medium-sized players can start building internal databanks for multi-objective models.


> The traditional med chemists were, in my opinion, rightfully skeptical of the suggestions coming out of the predictive modeling group ("That's a great suggestion, but can you tell me how to synthesize it?").

Start by plugging it into askcos.mit.edu/retro/ then, do your job?

> As one of my co-workers said to me: "The predictions made by the modeling group range from pretty bad to ... completely worthless."

Workers feeling threatened by technology think the technology is bad or worthless, news at 11.


Chemical synthesis is much more than just retrosynthesis. Even if it was, you can not just plug in the novel previously unsynthesized molecule and expect any reasonable results. Furthermore, the tool is based on Reaxys which in turn uses already established reaction routes and conditions from literature and patents. Good luck optimizing yields for something you don't know even how to synthesize, let alone what conditions to use.


> Chemical synthesis is much more than just retrosynthesis.

Of course, but we are talking about chemical discovery - after which you want to test if the compound theoretical capabilities work on cells. Yields are not yet a concern!

> expect any reasonable results

No it won't do all the work, but it will direction, and suggestion for which pathways could be used.


> Workers feeling threatened by technology think the technology is bad or worthless, news at 11.

I appreciate the sentiment and I think it's understandable to think that. In this particular case, however, my co-worker was one of the smartest / most talented people I've worked with. I can assure you that he did not feel threatened in any way. His comment was sardonic, but not borne of insecurity.

To be fair, the members of the modeling group were also quite talented. They were largely derived from one of the more famous physical/chemical modeling groups at one of the HYPS schools. But even they acknowledged that on a good day, the best they could do was offer suggestions / ideas to the medicinal chemists.

In fact, one of the members of the modeling group said this to me once (paraphrasing): The medicinal chemists are the high-priests of drug discovery. We can help, but they run the show.

As mentioned by someone who responded to my original comment, the usefulness of ML/modeling has likely gotten much better over the past 10 - 15 years.


> Finally: every kid can draw up novel structures. Then: how do you actually fabricate these (in the case of real novel chemistry and not some building-block stuff). Noone has a clue!

I personally have a clue, and the entire field of organic chemistry has a clue, given enough time and money most reasonable structures can be synthesized (and QED+SAScore+etc and then human filter is often enough to weed out the problem compounds that will be unstable or hard to make). Actually even some of the state of the art synthesis prediction models are able to predict decent routes if the compounds are relatively simple [0]. The issue is that in silico activity/property prediction is often not reliable enough for the effort to design and execute a synthesis to be worth it, especially because as typically the molecules will get more dissimilar to known compounds with the given activity, the predictions will also often become less reliable. In the end, what would happen is that you just spend 3 months of your master student's time on a pharmacological dead end. Conversely, some of the "novel predictions" of ML pipelines includign de novo structure generation can be very close to known molecules, which makes the measured activity to be somewhat of a triviality.[1] For this reason, it makes sense to spend the budget on building block-based "make on demand" structures that will have 90% fulfillment, that will take 1-2 months from placed order to compound in hand and that will be significantly cheaper per compound, because you can iterate faster. Recent work around large scale docking has shown that this approach seems to work decently for well behaved systems.[2] On the other hand, some truly novel frameworks are not available via the building block approach, which can also be important for IP.

More fundamentally, of course you are correct, and I agree with you: having a lot of structures is in itself not that useful. Getting closer to physically more meaningful and fundamental processes and speeding them up to the extent possible can generate way more transparent reliable activity and novelty.

[0] https://www.sciencedirect.com/science/article/pii/S245192941... [1] http://www.drugdiscovery.net/2019/09/03/so-did-ai-just-disco... [2] https://www.nature.com/articles/s41586-021-04175-x.pdf


There's a lot that can be learned with building-block based experiments. If you do a building block based experiment then train a model, then predict new compounds, the models do generalize meaningfully outside the original set of building blocks into other sets of building blocks (including variations on different ways of linking the building blocks). Granted that's not the "fully novel scaffold" test, however it suggests that there should be some positive predictive value on novel scaffolds.

We've done work in this area and will be publishing some results later in the year.


Don't mean to sound cynical here, but is AI actually gonna change chemical discovery? Or is AI gonna "change chemical discovery" in the same way that it will "make radiologists obsolete", or "revolutionize healthcare with Watson", or "put millions of truckers out of business"? There's certainly a lot of marketing, technical talent, and hope behind modern "AI", but I'm not really aware of any major part of our economy that's been THAT changed by it.


Chemist and blogger Derek Lowe has expressed his opinion [1] that, while significant and worth celebrating, AlphaFold addresses just one part of the long complex process of drug discovery. In one example [2] he cites a biotech startup that saved ~1 month in time and lab work and then discusses the many steps that remain before that drug candidate could become an actual drug.

OP describes what could potentially be AlphaFold with small molecules instead of with just proteins, and specifically calls it “chemical discovery” and not “drug discovery” more broadly. I think it’s fair to cheer for such a big advance while recognizing it wouldn’t solve everything.

[1]: https://www.science.org/content/blog-post/more-protein-foldi... "More Protein Folding Progress - What's It Mean?"

[2]: https://www.science.org/content/blog-post/alphafold-exciteme... "AlphaFold Excitement"


Advertising and media? Robots bid on the ads that billions see every day, and decide what videos billions of people watch.


It seems like AI is good at finding what's popular (video views/ad clicks), but it doesn't seem to be very good at science/engineering (drug discovery/self driving cars/radiology).


This reminds me of Moravec's paradox (https://en.wikipedia.org/wiki/Moravec%27s_paradox). Basically, tasks that seem hard for humans are easy for computers and vice versa.

Have a look at how driving a taxi has changed. (And I include the likes of Uber here.)

The hard part for humans used to be knowing all the roads in a city and selecting the best route quickly and reliably. Almost any adult can do the actual second-to-second driving reasonably competently in almost any city on the planet.

Now, we have outsourced the 'hard' part to Google Maps. But we are still far from a machine that drives in arbitrary locations on the globe. As far as I know, Waymo has the most mature system currently in development, but requires absurd amounts of precise mapping data for any location they want to drive in.

And let's not even talk about the even more 'trivial' task of chatting to the passengers.

> [...], but it doesn't seem to be very good at science/engineering (drug discovery/self driving cars/radiology).

Technology has already automated huge chunks for science and engineering. We just don't call any of the already solved chunks by the name of AI anymore.


Wild. I was first introduced to heuristic search in drug discovery at a biotech company in 1993.

Support vector networks were searching for good shapes by 1997, maybe before then, but the vocabulary we used to describe the search technique was different.

A huge computational hurdle was protein folding. We had brute-force searches for plausible shapes, lots of supervision by the chemists, weeks per iteration on the best workstations we could get. $250,000+ SGI Onyx, then DEC sent over an Alpha workstation.

It's come a long ways.


I have an AI chemistry project that appears to work on my test data, but I had to put it on the backburner because I simply can't (couldn't) find a 3070/80 anywhere! I stopped looking 6 months ago, does anybody know a reliable place where I can snag one?


If you don't mind having a gaming PC (stupid RGB lights and fans) then Build Redux [1] is not price gouging. You can basically just think of the GPU markup as a builder fee. They are incredibly slow to ship but the lead time is something like a month which beats your 6 months. At this point, it would take bitcoin dropping below 10k for the market not to be silly.

[1] - https://buildredux.com/


Isn't it Ethereum driving the GPU shortage? I thought Bitcoin was not really profitable without an ASIC rig.


That's correct. Saying BTC was lazy of me but I meant that basically the price of all things crypto need to be quartered due to the correlations.


You know what, this might actually be the guy. Good tip. It's a bummer to pay for a Windows license I don't want, but the fee for that is less than the markup I'd pay to some scalper just to get the card.


If you don't want Windows, you can remove it!


It is probably about the principle.


Given that the price goes down $109 with Windows removed, I'd suspect you're not buying a license in this case.


Have you considered renting cloud GPUs by the minute? You can rent some pretty powerful GPUs from cloud providers (i.e. https://cloud.google.com/compute/docs/gpus)


If you are really on a tight budget, but have time to fiddle, you can try to make something work on top of Google's colab, too.


I wonder why you've not teamed up with some uni-lab, they usually have some hardware and also some free labor if you share credits. A single GPU also means, whatever your idea, it probably would have finished already on a CPU ;).


Good idea with the uni lab!

About CPU vs GPU: things seldom work out on first try. A GPU gives you much lower latency for trying out a series of ideas.


If you're willing to pay scalper prices, there's a relatively consistent availability. If you're looking for closer to MSRP, they're in short supply and your choices are either waiting lists or racing other people for online restocks. I got a 3080 a month ago only after waiting on EVGA's wait list for just over a year.

Why do you need a 3070/3080 specifically? If it's to run something like Tensorflow or CUDA code more generally, could you do it with an older card, or the more available 3060s?


I dont know if it has enough ram for your purposes but I also gave up and just bought an entire computer to get a 3080 and got it in 3 weeks before Christmas https://skytechgaming.com/product/shiva-amd-ryzen-5-5600x-nv...


> because I simply can't (couldn't) find a 3070/80 anywhere

This is false.

You can easily find thousands of brand new Nvidia 3070/3080 GPUs online.

The problem is, you wish to pay MSRP. Supply and demand doesn’t work in your favor here.


This is a little disingenuous, that is like saying scarcity doesn't exist since you can buy almost anything at any price. There is always an implied "in a reasonable budget".


In this situation, yes.

In general: price competition might be effectively outlawed in some circumstances.

Eg you can't really buy a replacement kidney for any amount of money (outside of Iran).


What about moving to the cloud? Something like CoLab?


Yeah, I guess I could go down that route, I've used the AWS Gx instances for projects before, but this dataset would justttt fit in memory for a 3080, which really simplified the rest of the code, and the speed of iteration, and, at the end of the day, quite frankly I just want one. I'll do more weird stuff if I don't have the meter of how much I'm paying to Jeff Bezos running in my head every time I run an experiment.


Colab GPUs have more vram than a 3080 let alone a 3070.


just bought one with a PC from falcon northwest. Not sure about buying it standalone but they sell them with their PCs


vast.ai, nothing else is remotely as cheap


The price is great, the downside is that you do not know whether the server owner reads your data. The "jobs" run in a Docker container on somebody's machine.


Yeah. Use fresh ssh keys, and don't train with sensitive data. If that's an option for you, vast is great.


A similar thing is happening in inorganic chemistry. On https://materialsproject.org probably the majority of materials are 'discovered' computationally, by using DFT simulations to determine if and how stable any crystal structure would be. Using this database, AI can be used to find suitable materials for any application. [1]

[1] https://scholar.google.com/scholar?q=materials%20project%20h...


Isn't AI chemical discovery essentially machine brute forcing solutions to see which one fits/works? I think more attention needs to be put into designing better software solutions then merely putting more computational power into problem solving.


You could say that it’s optimized brute forcing. It seems to me that machine learning applied to combinatorial search problems of this nature cut down the search space massively by recognizing patterns of combinations that have a high probability of being good, and then traversing those paths, similar to AlphaGo. This is a completely naive take, I’d like to hear other thoughts on this.


Could you explain what advantages there are to having the software work better as opposed to just throwing more computational power at it?

I know this doesn't necessarily apply, but if the solution space for certain niche problems is so small that we can just drown the problem in compute, I couldn't care less that the algo was N^2 or whatever, or that the UI was less than ideal. Maybe I'm not thinking deeply enough about what you mean by "better software solutions".

If I run a batch job and it spits out 10,000 compounds that I can try to a certain affect, then it then becomes a filtering problem where I can apply humans and do more traditional science, and if it was feasible to just try everything in parallel that option is nice as well, feels like how you got to that 10,000 compounds doesn't matter much.

Looking forward to hearing just how wrong my simplistic view is.


Computer power grows ~N but molecule complexity grows something like ~N!


Ah yeah so I was guessing here that we're talking about a subset that is solvable with the current methods -- as in within reach enough to be solvable by relatively naive methods


To give the reader some perspective, it has been estimated that the pharmacologically active chemical space (i.e. the number of molecules) is 1060!

Not sure if they mean 1060 factorial or 10 to the 60.


> Chemistry is what we turn to when we are looking for a new superconductor, a vaccine

I am not sure that is very true at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: