Hacker News new | past | comments | ask | show | jobs | submit login
fMRI-to-image with contrastive learning and diffusion priors (stability.ai)
146 points by tmabraham on July 18, 2023 | hide | past | favorite | 64 comments



Wasn't there something similar a few months ago on HN and where the top comment talked about how it's not as impressive as it sounds [0]? The main issue is that this type of methodology is pulling from a pool of images, not literally reconstructing what image was seen in the brain directly.

> I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead.

> The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.

> If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.

> 1. https://i.imgur.com/ILCD2Mu.png

> 2. https://i.imgur.com/ftMlGq8.png

[0] https://news.ycombinator.com/item?id=35012981


Our model generates CLIP image embeddings from fMRI signals and those image embeddings can be used for retrieval (using cosine similarity for example) or passed into a pretrained diffusion model that takes in CLIP image embeddings and generates an image (it's a bit more complicated than that but that's the gist, read the blog post for more info).

So we are doing both reconstruction and retrieval.

The reconstruction achieves SOTA results. The retrieval demonstrates that the image embeddings contain fine-grained information, not just saying it's just a picture of a teddy bear and then the diffusion model just generates a random teddy bear picture.

I think the zebra example really highlights that. The image embedding generated matches the exact zebra image that was seen by the person. If the model only could say it's just a zebra picture, it wouldn't be able to do that. But the model is picking up on fine-grained info present in the fMRI signal.

The blog post has more information and the paper itself has even more information so please check it out! :)


So what's the output if I show a completely novel image to the subject? E.g. a picture of my armpit covered in blue paint?


Why are you building this, and what kind of ethical considerations have you taken, if any?


I'm curious what answers you would find acceptable? I'm not being snarky - I genuinely struggle with this line of thinking. People seem to find "if I don't then someone else will" to be an unacceptable answer but it seems to me to be fairly central.

There's a inevitability about most scientific discoveries (there are notable exceptions but they are few) and unless we're talking about something with capital outlay in the trillions of dollars then it's going to happen whether we like it or not - short of a global totalitarian state capable of deep scrutiny of all research.


>People seem to find "if I don't then someone else will" to be an unacceptable answer but it seems to me to be fairly central.

Because you can use this as a cop out for truly heinous work. I.e. gain of function research, autonomous weapons, chemical weapons, etc. It's not a coherent world view for someone that actually cares about doing good.


I think you've hit upon some interesting examples. Maybe the way to look at this is cost vs "benefit" (in the broadest sense of the word).

When research has an obvious and immediate negative outcome that's a cost. The difficulty/expense of the research is also a cost.

The "benefit" would be the incentive to know the outcome. This may be profit, military advantage, academic kudos etc.

Maybe the problem with the type of research being discussed here is that there isn't neccesarily any agreement that the outcome is negative. For many people, I suspect this will remove a lot of the weight on the "cost" side of things.

I'm not making a specific point here - I'm actually trying to work this out in my head as I write.


> I think you've hit upon some interesting examples. Maybe the way to look at this is cost vs "benefit" (in the broadest sense of the word).

This is obviously a better framework to be in.

"If I don't do it someone else will" is really fraught and that's why people reject it.

So one would really need to ask is there a net benefit to having a "mind reading" system out in the world. In fact I find it hard to think of positive use cases that aren't just dwarfed by the possibility of Orwellian/panopticon type hellscapes.


> In fact I find it hard to think of positive use cases

Firstly - forcing people to think of positive use-cases up front is a terrible way to think about science. Most discoveries would have failed this test.

Secondly - can you really not? Off the top-of my head:

a) Research tools for psychology and other disciplines

b) Assistive devices for the severely disabled

c) An entirely new form of human-computer interface with many possible areas of application


As I mentioned do any of those outweigh the possibility that some 3 letter agency might start mass scanning US Citizens for what amounts to thought crime? The very fundamental idea of privacy would cease to exist.


That's a very big leap. If we're at the stage where a three letter agency can put you in an fMRI machine, then we're probably also at the stage where they can beat you with a rubber hose until you confess.

My point is that there's already a wide variety of things a future draconian state can do. This doesn't seem to move the dial very much.


I'm not suggesting I have some ability to judge whatever the answer is, I'm just curious because TFA didn't include a lot of detail on this point except some vague bullet points at the end.


The underlying NSD dataset used in the three prominent (and impressive) recent papers on this topic (including the one linked here) is a bit problematic as it invites this (classification/identification, not reconstruction): It only has 80 categories. It has not been recorded with reconstruction in mind.

Reconstruction is the primary and difficult aim, but is what you want and expect when people talk such „mind reading”. Classifying something on brain activity has long been solved and is not difficult, it is almost trivial with modern data sizes and quality. At 80 categories and with data from higher visual areas you could even use an SVM for the basic classifier and then some method for getting a similar blob shape from the activity (V1-V3 are map-like), and get good results.

If you are ignorant about the question whether you are just doing classification you can easily get too-good-to-be-true results. With these newer methods relying on pretrained features this classification case can hide deep inside the model too, and can easily be missed.

The community is currently discussing to what extent this applies to these newer papers (start with original post): https://twitter.com/ykamit/status/1677872648590864385?s=20

One thing they showed is that the 80 categories of that data collapse to just 40 clusters in the semantic space.

(Kamitani has been working on the reconstruction question for long time and knows all these traps quite well.)

The deeprecon dataset proposed as an alternative has been around for a few years and been used in multiple reconstruction papers. It has many more classes, out of distribution „abstract“ images and no class overlap between train and test images, so it’s quite suitable for proving that it is actually reconstruction. But it’s also one order of magnitude smaller than the NSD data used for the newer reconstruction studies. If you modify the 80-class NSD data to not have train-test class overlap, the two diffusion methods tested there do not work as well, but still look like they do some reconstruction.

On deeprecon the two tested diffusion methods fail at reconstructing the abstract OOD images (which NSD does not have), something previous reconstruction methods could do.


Yes there was. However this is a different paper, describing a different method, applied to a different dataset, with different results.

As the abstract says, "In particular, MindEye can retrieve the exact original image even among highly similar candidates indicating that its brain embeddings retain fine-grained image-specific information. This allows us to accurately retrieve images even from large-scale databases like LAION-5B. We demonstrate through ablations that MindEye's performance improvements over previous methods result from specialized submodules for retrieval and reconstruction, improved training techniques, and training models with orders of magnitude more parameters."

Note that LAION-5B has five billion images.


> To achieve the goals of retrieval and reconstruction with a single model trained end-to-end, we adopt a novel approach of using two parallel submodules that are specialized for retrieval (using contrastive learning) and reconstruction (using a diffusion prior).

What you can think of contrastive learning as is: two separate models that take different inputs and make vectors of the same length as outputs. This is achieved by training both models on pairs of training data (in this case fMRI images and observed images).

What the LAION-5B work shows is that they did a good enough job of this training that the models are really good at creating similar vectors for nearly any image and fMRI pair.

Then, they make a prior model which basically says “our fMRI vectors are essentially image vectors with an arbitrary amount of randomness in them (representing the difference between the contrastive learning models). Let’s train a model to learn to remove that randomness, then we have image vectors.”

So yes, this is an impressive result at first glance and not some overfitting trick.

It’s also sort of bread and butter at this point (replace fMRI with “text” and that’s just what Stable Diffusion is).

They’ll be lots of these sort of results coming out soon.


This is mostly correct, except that there is only one model. This model takes an fMRI and predicts 2 outputs. The first is specialized for retrieval and the second can be fed into a diffusion model to reconstruct images.

You can see the comparison in performance between LAION-5B retrieval and actual reconstructions in the paper. When retrieving from a large enough database like LAION-5B, we can get images that are quite similar to the seen images in terms of high level content, but not so similar in low-level details (relative position of objects, colors, texture, etc). Reconstruction with diffusion models does much better in terms of low-level metrics.


How is contrastive learning done with one model, exactly?

I agree only one is used in inference, but two are needed for training (otherwise how do you calculate a meaningful loss function?). Notice in the original CLIP paper, there's an image encoder and a text encoder, even though only the text encoder is used during inference. [0]

[0] https://arxiv.org/pdf/2103.00020.pdf


There are 2 submodules in our model — a contrastive submodule and a diffusion prior submodule, but they still form 1 model because they are trained end-to-end. In the final architecture that we picked there is a common backbone that maps from fMRIs to an intermediate space. Then there is an MLP projector that produces the retrieval embeddings and a diffusion prior that produces the stable diffusion embeddings.

Both the prior and MLP projector makes use of the same intermediate space, and the backbone + projector + prior are all trained end-to-end (the contrastive loss on the projector output and mse loss on prior outputs are simply added together).

We found that this works better than first training a contrastive model then freezing it and training a diffusion prior on its outputs (similar to CLIP + DALLE-2). That is, the retrieval objective improves reconstruction and the reconstruction objective slightly improves retrieval.


If it's still retrieving an image and not reconstructing it, if the dataset is large enough that's decently fine, but this is generally not how diffusion models work in general and I'd have expected the model to map the fMRI data to a wholly new image.


Please read the paper. Or at least the blog post. It's really quite readable.

They explain that they've done both retrieval and reconstruction, and have lots of pictures showing examples of each.

https://medarc-ai.github.io/mindeye/


If you can retrieve an image using a latent vector, it’s trivial to reconstruct it (decently well) with a diffusion model.


They tested themselves both on retrieval and reconstruction.


That one was a bit like not hotdogs: https://www.youtube.com/watch?v=ACmydtFDTGs


This is SO COOL. I'd guess (I did analysis for an fMRI lab for a year so I'm not a pro but not totally talking out of my orifice) that detecting images like this is among the easier things you could do (it probably wouldn't be so easy to do things like "guess the words I'm thinking of") and I suspect other sensory stuff might be harder but I have little knowledge there.

One of the biggest issues with any attempt to extract information from an fMRI scan is resolution, both spatial and temporal - this study used 1.8mm voxels which is a TON of neurons (also recall that fMRIs scan blood flow, not neuron activity - we just count on those things being correlated). Temporally, fMRI sample frequency are often <1hz. I didn't see that they mentioned a specific frequency, but they showed images to the subject for 3 seconds at a time so I'd guess that's designed to ensure you get a least a frame or three while the subject is looking at the image. You can sort of trade voxel size for sample frequency - so you can get more voxels, or more samples, but not both. So detecting things that happen quickly (like, say, moving images or speech) would probably be quite hard (even if you could design an ai thingey that could do it, getting the raw data at the resolution you'd need is not currently possible with existing scanners)

Also, not all brain functions are as clearly localized as vision - the visual cortex areas in the back of the brain map pretty directly to certain kinds of visual stimulus, while other kinds of stimulus and activity are much less localize (there isn't a clear "lighting up" of an area). You can get better resolution if you only scan part of the brain (i.e. the visual cortex) (I don't know if that's what they did for this study), but that's obviously only useful for activity happening in a small part of the brain.

ANYWAY SO COOL!!! I wonder if you could use this to draw people's faces with a subject who is imagining looking at a face? fMRI police sketch? How do brains even work!?


Yeah using data from a 7T MRI giving higher spatial resolution definitely helps!

The fMRI dataset includes signal from the whole brain but we only use the data from the visual cortex for this study.


Are you able to extract an image showing the screen in the fMRI machine, as the subject can see it in between pictures ?


We did have a face reconstruction project planned. It is on the back-burner for now. That one will be based on something like the Celeb-A dataset instead of the Natural Scenes Dataset (images from MS-COCO) used here.


Human communication will change dramatically once useful invasive brain-computer interfaces are available.

People will suddenly realize that the reason language is primarily serial is simply due to the fact that it must be conveyed by a series of sounds. There will likely be a new type of visual language used via BCI "telepathy". It may have some ordering but will not rely so heavily on serializing information, since the world is quite multidimensional.


Indeed, it reminds me of the movie Arrival (and the short story upon which it's based) where the heptapods are able to show a complete sentence and story within one glyph. I thought it was interesting just how much the movie focused on linguistics, which is rare to see in Hollywood films.

Something else that's interesting about language is it's just a form of compressive medium for thoughts; I think of a concept, then I put it into words (compression) that you the listener then have to interpret and understand (decompression) and then fit your brain state to the new data you've received. It's overall a very lossy medium compared to what brains can do. It would be much easier to beam my thoughts and images and videos in my mind directly to you.

Unless you or I have aphantasia, of course.


Right and I would go so far as to say that most types of intelligence are a type of functional compression also.

There's definitely room for direct transfer of concrete unrolled information. But at the same time we would still need some forms of abstraction in many cases.

I think the biggest issue with the compression of natural language is that the loss is different for each person, since everyone's "codec" varies. In other words, people often interpret language in different ways.

But suppose that humans or AIs or AI-enhanced humans could have exactly the same base dictionary or interpretive network or "codec" or whatever for a (visual or word-based) language. Then we could get away from many of the disputes and misunderstandings that arise purely from different interpretations.


I wonder what the limits are to such a universal codec. From what I've gathered about synaesthesia (e.g. from V.S. Ramachandran, or Galton earlier), it varies quite significantly between persons. I believe it's said that some 3% of people have aphantasia for instance. That means entire modalities would be excluded for some in a latent space glyph language. Unless, I suppose, one could find ways of stimulating the synaesthetic connections artificially too.


In a popular sci-fi (avoiding spoliers), the alien race has transparent skulls, and their visible thoughts are broadcast to anyone within visual range.

It does seem more efficient than sound.


I want to know the scifi

You may base64 to spoiler proof it


I think it's "VGhlIFRocmVlLUJvZHkgUHJvYmxlbQ==" The aliens cannot lie to each other (they don't even have the idea of a lie), because their thoughts are transparent to each other.


Sounds like a recipe for conflict.


It would certainly be a very violent shift towards a very different societal equilibrium.

Few, if any, people who currently have power, would be able to absorb the sheer amount of hitherto hidden distrust or resentment that their subordinates harbor towards them.

Interestingly, there might be two very different end stages.

Either a very open society where people at the top are selected to be non-narcissist and stoic, or a very closed and oppressive society where the absolute ruler is kept in power by a bunch of truly zombified and obedient warriors whose loyalty is real and unshakeable, and who will kill anyone whose brain entertains any rebel ideas too much.


For context, early vision is easier to map than you might expect.

Here's a radiograph of the primary visual cortex created in 1982 by projecting a pattern onto a macaque's retina: https://web.archive.org/web/20100814085656im_/http://hubel.m...

An injection of radioactive sugar lets you see where the neurons were firing away and metabolizing the sugar.

(https://pubmed.ncbi.nlm.nih.gov/7134981/)


But can brain activity be mapped anywhere like this with fmri? I doubt it. But yes — it is cool that the brain keeps spatial proportions of reality in the brain map! Very unlike latent space.


In some ways, absolutely not - precision is a huge challenge with an indirect method like fMRI - but this example is over a decade old now: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130346/

Fig4 shows the letter M on the cortical surface, where the stimulus accounted for the effects of foveal magnification (foveal vision gets more cortical space). Keep in mind that we now, in theory, have stronger magnets, better head coils (the part that picks up the image information), and better sequences (the software that manipulates the magnets to produce the images) so we could do even better than that these days.


Some important points from the article under the limitations section

- Each participant in the dataset spent up to 40 hours in the MRI machine to gather sufficient training data.

- Models were trained separately for every participant and are not generalizable across people. Image limitations: MindEye is limited to the kinds of natural scenes used for training the model. For other image distributions, additional data collection and specialized generative models would be needed.

- … Non-invasive neuroimaging methods like fMRI not only require participant compliance but also full concentration on following instructions during the lengthy scan process. …


For anyone who hasn’t been in an MRI: 40 hours is a lot.

Those things are tight; not “look where you are going” tight, but you absolutely need to tell people beforehand that they will feel very uncomfortable inside, remind them that they can get out at any time, and show them how because they will not like being in there.

I wouldn’t spend 20 minutes in one if it were not important. I’d seriously push back on an hour. 40 hours is something I’d only do if that is absolutely necessary.


Awesome, predicting words from fMRI has been around for a while and visual cortex can be mapped well.

That said, and coming from a background in neuroimaging 20 years ago, what’s the applicability? MRI hasn’t gotten that much more cost effective for more widespread uses. Magnets are expensive.


Yeah, the first thing that comes to mind(har har) when I see this is that we'd be better off trying to develop better scanning technology. You can't exactly walk around town with an MRI strapped to your skull.


We think it could be useful for clinical research and maybe even diagnostics. For example, you could imagine a person with depression(or other neurological disorders) may have a different perception of the same image than a healthy person. Now with the much higher fidelity that both more powerful MRI machines and better generative AI tools can provide, this may now be a very promising direction for future research.


I work in pediatrics and am an academic investigating MRI of kids in various diseases. When I saw this work, I did wonder about us being better able to functionally map where things are going wrong in the pathways of neurodisability. I wondered if this would have applications in being able to do that - for example being able to say that someone could process the image. Do you think it could have this type of application? One thing which would be a deal breaker at the moment is the amount of time participants spend in the scanner. But if we wanted to (for example) see if a child could perceive simple objects, would that be doable do you think?


People with disabilities could benefit greatly from this.


As long as they want to talk about london buses, steam trains, surfing and football.


reading suspect's mind


What if a copyrighted image or video can be recovered from your brain using external tech like this? That's not fair to the rights holders, what we really need is technology that can clean such illegal memories from the brain; a brainwasher if you will


Aside from helping those with disabilities, what is stopping authorities using this as a lie detector? I assume the tech isn’t quite there yet.


Not yet.

> Models were trained separately for every participant and are not generalizable across people.


Ah right, so the models are subject to some serious overfitting then. Good proof of concept, but not useful in practice yet.


Techniques like this give me hope that some day we will be able to objectively diagnose mental illness, and monitor the efficacy of treatment.


As someone suffering from intrusive thoughts I do not look forward to a future where other people can see what I sometimes see in my head.


As a person who is normal by any measuring standard, I do not look forward to a future where other people can see what I sometimes see in my head.

You'd be quite surprised.


I think the method of merging the pipelines via img2img should use controlnet. Possibly needing to be finetuned specifically for this, although existing controlnet models might work fine for this.

This is exactly what you'd want to use controlnet for - mapping semantic information onto the perceived structure.


Yes we've been looking into ControlNet as well, and I think there is one recent fMRI-to-image paper that also has tried ControlNet. Maybe we'll use ControlNet in MindEye v2 :)


Yes controlnet will be used in the next version. For this one we couldn't get it working in time.


They should dose people with DMT in the fMRI and run it through the model.


There was also DreamDiffusion recently. https://arxiv.org/abs/2306.16934


This is cool . We are going towards future like shown in inception to plant an idea.


the best recons I've seen so far are from Jack Gallant and Alex Huth's labs (at least what was shown publicly at SfN)


This is mind reading, we are in the future.


Found the sucker.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: