- They created an entirely new dataset of ~5000 molecules, all hand-labeled by perfume experts.
- They held a competition (presumably Kaggle or a similar platform) to classify this dataset, and used the results as a strong baseline.
- Their GNNs get comparable (slightly better, but not statistically significant) results than the winning random forest model of the competition.
The embeddings show promise, but I'm curious why they omitted a simple "fully connected layer" on the Morgan bit descriptors as a baseline classifier. Seems like that would outperform the random forest.
Graph convolutional networks are really cool and widely applicable. The fastest intro to the field me was actually not a paper or blog post, but the docs of Pytorch Geometric. The definition of their message passing framework [0] gets you to the right frame of mind, after which there are well documented and cited implementations of various papers which you can reuse [1].
Neat. Can you point to any particularly compelling applications? I'm looking into a graph representation for something myself and this looks incredibly helpful.
The applications that speak to me most are those involving predicting properties of molecules, and also properties of biochemical networks, though I appreciate that’s not what many others would find compelling! Sorry not to be of more help.
>>> it should be possible to directly predict the end sensory result of an input molecule, even without knowing the intricate details of all the systems involved
Maybe we're missing the most interesting aspect. Olfactory Receptor Genes in humans comprise ~1% of the total genome. The benefit here is in understanding how environmental changes trigger beneficial mutations and enhance sensory features.
I doubt the olfactory complex evolves via simple mutations directly on the receptors, but rather on other dna constructs that can quickly (and badly) replicate genes like retrotransposons
if anybody could answer questions like that definitely, it would be a great advance. There is fairly strong evidence that olfaction evolution occurs by gene duplication followed by selection.
This is quite similar to a project that we did in college as part of a introduction to data science course. The professor built a sensor using an array of different smoke detectors, then pumped air over different liquids (coffee, Coke, OJ, etc) then through the sensor array, capturing the signal strength in a text document. We used different classification techniques to determine the composition of unknown liquids.
I know I may regret saying this* but have they considered vaporising/heating/burning the air sample to produce a spectrograph and then running an image recognition for comparison to known smells. A bit like you can produce a spectrograph of an mp3 and get an almost instantaneous hit if the track is previously known. Subtract out the known smells and the remaining is an unknown. Find similar shaped molecules and return their likely traits (musky, floral, almonds etc.) with a visual keyword cloud.
It is two different things – you either recognize a scent OR you identify a scent. Why not speed things up by running it as a two step process? Recognize = Rapid results, Identify = Best Guess at what it might smell like. I’m not sure how a human determines that there is a smell of gasoline, freshly cut grass and a hint of something else/unknown in the air rather than thinking hmmm, I do not recognize a scent that has aspects of cut grass AND gasoline AND an unknown therefore the whole scent is ‘unknown’.
*anon-experts that go ‘gosh, why didn’t those idjuts with multiple PhDs think of that’
What you're describing is GC-MS, which indeed returns a spectrographic signature that is used to identify a molecule. As noted by @Terr_, that tells you what a molecule is made of but not what shape it is. That being said, the smaller the molecule the easier it is to accurately predict what shape it will be. When computational methods become questionable, you move on to NMR (and a whole host of other techniques).
The real issue here is that olfaction involves a small molecule interacting with multiple different proteins for your body to "read" it. We're not so great at the whole predicting drug-protein interaction thing just yet - some closely related fields include virtual screening, molecular dynamics, protein crystallography, X-ray diffraction, and cryoEM (among others).
I think a spectrograph tells you more about the constituent elements in the sample, instead of the various shapes of various molecules that existed before you heated it.
I won't claim it's impossible, but... olfaction boils down to a small molecule protein interaction, similar to drug discovery. The scales in question are far too small for typical imaging devices; these objects are far smaller than the wavelength of visible light. CryoEM and X-ray diffraction are used in these domains, but don't apply in the manner you appear to have in mind. I suppose CryoEM technically counts as "clever use of a camera" though.
Your comment lead me down an interesting rabbit hole. Stumbled across this which might count - "Photoacoustic spectroscopy has become a powerful technique to study concentrations of gases at the part per billion or even part per trillion levels."
https://chem.libretexts.org/Bookshelves/Physical_and_Theoret...
When I was a teenager and was interested in chemistry 30 years ago, the question that bothers me all the time if it is possible to predict physical characteristics such as color, phase diagrams, and - yes - smell from chemical composition and structure. Tried to do that with pen and paper, but did not get much farther that acids smell acidic and alkali smell alkaline, and salts largely do not smell unless they easily dissociate. I quickly realized that the most interesting part of this problem is in organic compounds but that was well beyond my reach. Thinking about it now I am wondering about the choice of “variables”. To me it looks like we are trying to describe complex smells in terms of combination of other probably also complex smells. Is it the right base? If I were researching that I would try to find the bases - either chemical compounds with the simplest structure, like benzol, or compounds that trigger the minimal number of receptors and different and disjoint sets of receptors at that. Is there any research in that area?
It's a little surprising to me (not necessarily bad, just unexpected) that Google is researching this.
Yes, it's good for companies to do some R&D, and sometimes the R part of the R&D gets pretty theoretical, which is usually a good sign that a company is trying to really innovate.
But usually also there's some indirect way to tie it back to some kind of application that could possibly somehow make the company money in the long term. Otherwise, you're just a for-profit company spending money on something because it's interesting.
So what's the application here? Are there novel ideas or techniques here that can be applied to other AI problems? Is there some kind of application for smell in a Google product?
Google invests fairly heavily into research and there is a fair amount of freedom among engineers and researchers to do projects with Google resources, even when there is no direct product application.
For example, I ran an idle-cycle-harvesting service at Google called Exacycle that ran problems like protein folding, protein design, drug discovery, telescope discovery, and more. The only pushback I got was to run problems where the results could reasonably be considered "useful" (scientifically).
One way to think about it is that many of the people with power at Google really like science and have the resources to support it. Once you've built things like TPUs, it would be a waste not to dedicate some about of their resources to problems that people wouldn't be able to address.
Another way to think about it is these things have indirect effects- even if Google didn't want to make some sort of product with smell (like a phone with a builtin GC-MS?), publishing this work gets the attention of scientists, who will then read the paper, and consider Google Cloud as a place they'd like to do their work.
The backbone of this work is graph convolutional neural networks. These have wide spread applicability in predicting molecular properties (useful to Google Health), building recommendation systems (useful in many obvious ways to Google), and modelling phenomena in social networks, to name a few off the top of my head.
So while the scent identifying doesn’t seem directly relevant, nailing the underlying techniques is valuable.
Google Cloud wants to be known as the best place for all kinds of machine learning. Google may not want to get into perfume manufacturing, but it wants Unilever (?) to run all their ML research on their cloud. This kind of research strengthens that part of their brand.
If we're feeling positive about the world, since it's Friday:
Google wants to help the world by reducing our reliance on farming animals for meat, and thought that research like this would help humanity's nascent artificial meat endeavours.
From the second paragraph "Solving the odor prediction problem would aid in discovering new synthetic odorants,". Reading between the lines - a cheaper way to produce more realistic industrial aromatics cheaper.
My chemical intuition says that yes, it could be done. However, I imagine psychedelic states are much more neurologically complex and molecules more chemically complex than the analagous experiences vis-à-via scent (although we certainly do tie smells to memories and other secondary and tertiary experiences).
Great question, one could use existing known chemicals as a starting point. There could be a potential to use fMRI readings on a model organism in realtime to generate data.
Compelling. I wonder what else this could be applied to in addition to psychedelics? Anti-anxiety and other sensory affecting drugs?
If you wanna get Black Mirror-esque, perhaps a Soma-like medication from Brave New World (essentially pacifies/zombifies you by creating endless bliss) could be made. Or the "bliss" drug episode of Doctor Who.
- They created an entirely new dataset of ~5000 molecules, all hand-labeled by perfume experts.
- They held a competition (presumably Kaggle or a similar platform) to classify this dataset, and used the results as a strong baseline.
- Their GNNs get comparable (slightly better, but not statistically significant) results than the winning random forest model of the competition.
The embeddings show promise, but I'm curious why they omitted a simple "fully connected layer" on the Morgan bit descriptors as a baseline classifier. Seems like that would outperform the random forest.