New algorithm unlocks high-resolution insights for computer vision (news.mit.edu)
183 points by zerojames 9 months ago | hide | past | favorite | 21 comments

The papers actual page feels like a clearer explanation to me.


What an amazing idea :)

They reproject the input images and run the low-res network multiple times. Then they use an approach similar to NeRF to merge the knowledge from those reprojected images into a super-resolution result.

So in a way, this is quite similar to how modern Pixel phones can take a burst of frames and merge them into a final image that has a higher resolution than the sensor. Except that they run useful AI processing in between and then do the super-resolution merge on the results.

Also similar to temporal antialiasing https://en.wikipedia.org/wiki/Temporal_anti-aliasing .

Perhaps similar in some ways to how big cats' eyes reflect the light back from behind the retina (i.e. back through it for a second pass) to capture more light. I'm sure I heard that on a nature documentary ...

Very interesting, I am curious how do people reach that train of thought to a successful idea. So many great algorithms based on small twists.

It is interesting indeed. One wonders if the researchers of this particular bit of work made it mandatory to go for walks at lunch and think about how their vision chunked/filtered the information it was receiving. Interesting that they "perturb" the image to get some noise involved. I'll need to read it over again.

Nature is such a good source of inspiration, the "perturb" approach reminded me of [fixational eye movement][1] but maybe that's only a clear link in retrospect.

[1]: https://en.wikipedia.org/wiki/Fixation_(visual)

This seems like it could have been inspired by how human vision works.

The training technique used here (fitting something similar to a NeRF to different views of the same image) is pretty similar to this paper which uses a similar technique to denoise (instead of upscale) output features: https://arxiv.org/abs/2401.02957

It's not that clear why they are downsampling and then upsampling again. Why not do all the work at the original resolution?

Apparently, the issue is that some vision algorithms only output a low-res representation and that needs to be upsampled to match the original?

>It's not that clear why they are downsampling and then upsampling again. Why not do all the work at the original resolution?

For NNs, This is pretty much a compute efficiency thing. Working on the original resolution directly is more compute intensive.

Correct, s/some/vast majority of. Ex. major video conference software ML blur algos run at like 100x100 - the weird edge is much more about resolution of input/output than ML.

Is a learned downsampler a form of inverse crime? https://arxiv.org/abs/math-ph/0401050

Don't think that's applicable in this case. This "FeatUp" technique does not feed its output back into the model in any way.

Rather, it's just producing a higher resolution output by taking multiple passes of the input image (subtly shifting the input image before each pass) producing a slightly different low-resolution feature map.

Each of these low-resolution feature maps represent contributions from differing areas of the input image. "FeatUp" can then create a higher-resolution feature map, "simply" by taking the color from the pass with the most appropriate input shift.

A very rough sketch:

     Input Image:  abcdefgh
Create multiple low resolution feature maps using your model, shifting the input image, a few pixels each pass:

     Pass 1:  abcdefgh   --> ACEG    
     Pass 2:  bcdefgh    --> BDFH
Now take all the low resolution feature passes and combine into a single higher resolution version:

     FeatUp:  ACEG,BDFH -->  ABCDEFGH

I wonder what you'd get if you did something similar on the latent space in a diffusion model, before decoding to an image.

This looks like it could be useful. Remote sensing uses feature extraction tools. Being able to upsample again would make the data a lot easier to view and interpret. Nice work.

Seems pretty scary that their demo video shows medical images having their resolution 'increased'. Does this add anything to the original images?

I'm really not sure what your concern is?

They do this based on sematics with data which doesn't has the data. You can get more information out of pixelated data if you know what the semantics are.

The search space is much much smaller if you only optimize for bloodcells than for everything. If this adds a chance of seeing things which you couldn't do before, it adds value.

It could mean doing a cheap analysis with low res and doing a high res and much more expensive one when you detect something. Like being in a rural area and traveling to the big city after you found something.

Overall the chances are that more people get help not less

The images aren't being upscaled, I dont think. Rather, the features from low resolution representations are being upscaled. Images are origional, to show how upscaled features still line up with reality.

What does the learned downsampler add? Isn't the output of the algorithm the upsampled features that's fed into the downsampler?

What can this do for Satellite imagery?

