There's definitely value in providing this functionality for photographs taken in the present.
But I think the real value -- and this is definitely in Google's favor -- is providing this functionality for photos you have taken in the past.
I have probably 30K+ photos in Google Photos that capture moments from the past 15 years. There are quite a lot of them where I've taken multiple shots of the same scene in quick succession, and it would be fairly straightforward for Google to detect such groupings and apply the technique to produce synthesized pictures that are better than the originals. It already does something similar for photo collages and "best in a series of rapid shots." They surface without my having to do anything.
I got really good compression using this technique with JPEG XL, I'm sure there's even a good reason why it works so well but it's been a long time and I don't seem to remember why.
Could it be possible that jpg also exploits the repetition at the wavelength of the width of a single picture, so to say? E.g. 4 pictures side-by-side with the same black dot in the center, can all 4 dots be encoded with a single sine wave (simplifying a lot here..) that has peaks at each dot?
Tiled/stacked approach as others mention is good, and probably the best approach. Could also try doing an uncompressed format (even just .png uncompressed) or something simple like RLE then 7zip them together since 7zip is the only archive format that does inter-file (as opposed to intra-file) compression as far as I am aware.
Unfortunately lossless video compression won't help here as it will compress frames individually for lossless.
Not so. Gzip’s window is very small - 32K in the original gzip iirc, which meant even identical copies of a 33KB file would bot help each other.
Iirc it was Bzip2 that bumped that up to 1MB, and there are now compressors with larger windows - but files have also grown, it’s not a solved problem for compression utilities.
It is solved for backup - but, reatic, and a few others will do that across a backup set with no “window size” limit.
…. And all of that is only true for lossless, which does not include images or video.
Philosophically, yes. But some photo-editing techniques rely on data that is not backfillable and must be recorded at capture time. And even in cases where there is no functional impediment to applying it against historical photos, sometimes there is product gatekeeping to contend with.
I would really like to see the proof that it's impossible to design a reversible state machine that won't cycle. But even if you do prove that, you would also have to prove that if the laws of physics are reversible that the universe is reversible.
The current best theory and understanding of the evolution of the universe is that it will reach maximum entropy (heat death). There is no cycling when this happens. Can you cite what theory or new discovery you have come across that somehow challenges the heat death hypothesis?
> ..fairly straightforward for Google to detect such groupings and apply the technique to produce synthesized pictures that are better than the originals.
Wouldn't an operation like this require some kind of fine-tuning? Or do diffusion models have a way of using images as context, the way one would provide context to an LLM?
I think simpler algorithms (e.g. image histograms) can get you a long way. Regardless of the mechanism, Google Photos already has the capability to detect similar images, which is used to generate animated gifs.
When you think about it, the only thing that's weird about this hypothetical conversation is the context of it being about (purported) photographs.
We expect images that look like photographs — at least when taken by amateurs — to be the result of a documentary process, rather than an artistic one. They might be slightly filtered or airbrushed, but they won't be put together from whole cloth.
But amateur photography is actually the outlier, in the history of "capturing memories"!
If you imagine yourself before the invention of photography, describing your vacation to an illustrator you're commissioning to create a some woodblock-print artwork for a set of christmas cards you're having made up, the conversation you've laid out here is exactly how things would go. They'd ask you to recount what you saw, do a sketch, and then you'd give feedback and iterate together with them, to get a final visual down that reflects things the way you remember them, rather than the way they were, per se.
This is an interesting point. Usually people claim technology goes inexorably forward, yet here we are, merrily destroying trust in the most objective method we have to record the past!
Photographs haven't been able to be trusted since almost the beginning. Trusted as an image of a real scene that is.
Indeed, people viewing photographs have always been able to be manipulated by presentation as fact something that is not true -- you dress up smart, in borrowed clothes, when you're really poor; you stand with a person you don't know to indicate association; you get photographed with a dead person as if they're alive; you use a back drop or set; et cetera.
This is a bit of a stretch, but the end results from either manipulation technique would be comparable if they were meant to skew the truth the same way. However, that sounds stupid as shit when I read it back, but I'm not entirely sure why.
I think a use case for AI image manipulation could be more like if I need a picture where I'm poor but wearing smart borrowed clothes, standing with an unassociated associate and a dead alive, with a backdrop, with the only source image beimg selfie of someone else that incidentally caught half of me way in the background
The intent or use cases for these two (lacking a better term) manipulators aren't orthogonal here. The purpose of AI image generation is, well, images generated by AI. It could technically generate images that misrepresent info, but that's more of a side effect reached in a totally different way than staging a scene in an actual photo. It seems like using manipulation to stage misleading photos would be used primarily for the purpose of deceptive activities or subversive fuckery.
Agreed. My point was that trusting images ('seeing is believing') has always been at issue whilst we might imagine it is a new thing, the scale of the issue is different -- phenomenally so -- but it's not a category difference. Many people were convinced by the fairy hoaxes based on image manipulation in the early 20th Century (~1917). They fell for it hook-line-and-sinker, images made with ML weren't needed.
FB AI, make a series of posts about me climbing mount everest, meeting dalai lama, curing cancer, bringing peace to ukraine, changing my name to Melon Tusk, announcing running for president and adopting a dog named Molly
But see, that's the sort of thing that would give it away.
You got to shoot for something just attainable enough to sound credible, while still being at the "enviable" end of the spectrum.
"FB AI, make a series of pictures of my first 3 months at Goldman Sachs in 2021. Include me shaking hands with the VP of software as I receive a productivity award for making them $1m in a week. Include a group photo of me and 12 other people (all C execs and my VP must be there). Crosspost all to LinkedIn, with notifications muted."
"Ok done"
"ChatGPT, take my existing CV and replace entries from 2021 onwards with a job as Head of Performance Monitoring at Goldman Sachs, reporting to VP of software. Include several projects with direct CEO and CFO involvement. Crosspost changes to LinkedIn."
This actually feels like it could be an incredibly valuable post-production tool in film and TV, once they get it working consistently across multiple frames.
Not only for more flexibility in "uncropping" after shooting (there was a tree/wall in the way), but this could basically be the holy grail solution for converting 4:3 to widescreen without cutting off content on the top and bottom.
I already use Photoshop Generative Fill for uncropping videos, but it only works for fixed camera shots. Photoshop just added feature where you can just drag the video file in and do the uncrop in one step.
The problem I'm solving is converting videos from widescreen to vertical and sometimes you need some extra height.
Mind if I ask why you'd need to do that? It's a huge amount of if the frame being generated artificially, especially if you're talking cinema aspect ratio wide-screen.
If you're trying to convert widescreen content so it looks good on TikTok, Reels, Shorts, then for the most part you can crop a vertical chunk from the centre of the frame, and pan if necessary to keep the action in frame. Sometimes though the shot is too wide and you can't crop that vertical chunk out, so you have to crop it as vertical as you can and then add something to the top and bottom to fill out the frame, otherwise you have a whole shot that isn't in vertical and it breaks the flow of the video.
I can see it working great for some stuff but wouldn't you ultimately face the issue with more artistic work that the framing might not be very good if just artificially extending.
It definitely needs to be applied judiciously on a shot-by-shot basis.
There have been quite a few 4:3-to-widescreen conversions that were done using the original film that was actually shot in widescreen and cropped for TV.
Sometimes, the wider shot makes perfect sense. Sometimes, they keep the original cropped one but cut off top/bottom. Sometimes it's a combination of the two. It all depends on what's being framed -- two people in a car usually benefits from cropping (nobody needs the bottom third of the frame occupied by the car's hood), while a close-up on someone's face usually benefits from extending the sides (otherwise it's an uncomfortable mega-close up that cuts off their mouth).
But having the flexibility to extend horizontally gives you the artistic possibilities.
Also getting everyone smiling with their eyes open at the same time. Phone cameras could record a group photo for five or ten seconds and use the best expression from different times for each person.
Or you take a single picture of a group in front of a monument, but cut it off. As I understand it you could find pictures of the monument online, run the model, and have a picture with the group and the entire monument.
Probably google can even do this automatically - I would not be surprised if I get suggestions to fix images with cut off buildings via Google Photos in the future! Would be so cool.
I’ve done this manually in Photoshop more times than I can count.
Usually more automated solutions only hold up to light scrutiny, but that’s rapidly changed in the past year. I’m sitting after this year and I’m a little miffed about it. Oh well.
That's "Top Shot" which is the entire frame. The feature I'm referring to would adjust multiple faces in a frame by selecting the same faces/sections from different frames to a single target frame
I swear that's what was announced, but I assume you're right because you actually know the term Top Shot, and I had no memory of that.
So your memory is probably better than mine. :)
I just remember some demo of a family shot and it automatically opening a little boys eyes by using another photo. And another auto combining of images so that you could take a lot of photos of a busy tourist place and automatically remove all the people.
You don’t even need to take the photo, with enough images of each family member and images of a tourist destination you can just automatically construct a photo of everyone together at the location, saving the costs and carbon footprint of getting everyone together.
And then why demand "photos" of family excursions at all, when it is just an AI imagining how things probably were happening at the time, or would have happened? We should just stick to our own imperfect memory.
I'd imagine in the future we could have services such as this one:
> In exchange of a small fee and a 35 minutes suggestion session, get you and your family implanted with memories of a beautiful vacation that'll last you for a lifetime for fraction of the cost of an actual one.
I have been working on a holographic camera, but the ultra-cheap pinhole cameras I chose for the array have two issues: the exposure can't be controlled and the lenses are poorly aligned. I can calibrate away most lens aberrations with OpenCV, but some of the outliers have so much cropping that I am discarding 75% of my good pixels to get a coherent result. I was considering using NeRFs to reproject the ideal camera angles, but COLMAP is not very tolerant of brightness fluctuations and NeRF training is relatively slow (considering my goal is video). This would be a nice solution to my problem, because I have a comprehensive set of angles to pull context from.
So is the weather just hallucinated then? We're just making up memories and calling them real? And advertising this blatently, called rainy days sunny and sunny days rainy? My god I hate this so much.
Not even a discussion about if this might be harmful or what the risks are or anything, just plain old "THIS FAKE MOMENT WAS REAL AND YOU'LL BELIEVE IT"?!
I really have a hard time with this. Wow I'm upset, more than I expected. The tech is fine yeah but the marketing is just deeply upsetting.
Seems like the real utility of this technique will be as a way to vastly improve the temporal stability with a variety of generative video techniques. For example, if you are trying to use a video as a base for a new generative video: Take the first frame of your video and run it through SD with the control net of your choice. Then take that initial image and run it through this process to generate a new base model and then use that to generate your second frame. Now you can use that second frame to feed back into your model and rinse and repeat, always using the past few frames to inform the latest.
an interesting use case for this once the compute is there is to basically allow for ai powered digital zoom-out. it could work by instructing the user to take several pictures around the target, and then you take regular pictures of your subject.
then, as you like, you can do an "ai zoom out" to get zoomed out pictures, no longer constrained by your lens or distance.
I imagine this to be included relatively soon, just like how panaromas were once a niche thing that became much easier to do with some good ui/ux. pretty much any modern phone can do them without having to struggle with lining up photos and what not.
one thing that does greatly concern me about the demo/site is that they have "authentic" and "recover" as terms. the result here is not authentic nor has anything been "recovered." it's an illusion at best. I personally don't like how they portray the new image as being equivalent as if the lens framed it in the original picture. it's not, as they show themselves in the later portion (near the end) with the text sign. seriously irresponsible framing (pun intended) to what's otherwise very cool tech.
My wife and I have been using the Pixel phones since Pixel 6 and we love the camera. Great pictures! But the best features are google photos, auto-tagging, recommending collages, walking down memory lane.
Then you can magic erase tourists from pictures and pic a better shot from a picture you took on the fly....
You add this "authentic image completion" to my kids pics, and it's game over...
I definitely agree, Pixel has been at the forefront of computation photography and editing since its inception. Things like night photography that we take for granted now, I remember when Pixel 2 first introduced it and it was honestly mind blowing.
this use of computation photography and editing that
What's so magical about that I/O? I get the point of improving the quality of a picture. But editing the picture so that it includes things that didn't really happen... why even care besides trying to impress others?
What does "didn't even happen" mean? That girl was standing there with the balloons. She just happened to be slightly off frame, and that moment is now gone and you'll never get a proper framed photo of that moment.
It's like re-coloring an old black and white photo, or photoshopping out a photo bomber from the background.
Looking at the paper for this specifically, it's a simple one. Wouldn't take someone who knew what they were doing more than a couple days to implement. But generally, i agree.
Question: does this model only do outpainting, or does it also do super-resolution? Could this model be fed all the frames of a really awful security-camera video, in order to then synthesize a high-resolution still image of a suspect?
For the last 2-3 years, on an almost weekly basis, I am blown away by the progress made in AI. Huge steps forward. It actually happened twice in the last 24 hours alone.
Lex Friedman's 3D realistic avatar interviewing Mark Zuckerberg in a generated space (two floating heads).
Interesting to be how it illustrates philosophical questions on the nature of reality, the projection of personality, the 'problem of other minds', and such.
Somewhat covertly I deep down wish that human's desire for pretty looking pictures will fade away over time, due to the ubiquity of pretty looking pictures produced by auto post processing. And at ultima their liking of pretty people and shiny new stuff in general. I don't want to sound negative or pedantic, I just would like that people prefer inner beauty in the broader sense.
I'd love to see a combo of this Google tech and AI upscaling do the same for Babylon 5. They had shot the actors in widescreen format, but the CGI spaceships were only rendered in 4:3 and the files have been lost.
This requires other pictures of the environment to use to infer what should fill in the gaps, which will not exist for every shot in those series. (TOS and TNG were already rereleased in 1080p, though.) I suppose you could use outpainting to construct the rest of the scene in one frame, and use that as the reference for other frames in the same shot.
Agreed, I also suspect this. Since they don't release anything most of their "fantastic" papers are probably just BS made to let people think they are still relevant
The current advancement in Generative AI is a bit scary, in my opinion. May I be pessimistic?
This and the new demos I saw from WhatsApp's new demo around persona-based AI can really alter someone's perception and memories. I don't think we are considering how it can really impact our understanding of our feelings, perception, memories and mindfulness.
If you take a picture of reality and alter it with Gen AI to do something else and change the moment, what is the new reality? After a while, we might question if it was real or not, and then that might just become the new reality.
In my opinion, GenAI is truly transformational as well as scary, as it can alter our perception. I wonder if anyone else feels this way.
but when you take a picture that capture personal moment and some software without your consent alters it with some generative stuff, what would that lead to?
I disagree with lying to yourself. For people who are not mindful and aware, this is severely impact their perception.
> but when you take a picture that capture personal moment and some software without your consent alters it with some generative stuff, what would that lead to?
I mean, do you not look at the photo after you take it? Even if you don't, you were there and saw the original scene. If your memory fails you, it's on you. If you didn't take an accurate picture, it's on you. Check next time.
If anything meaningful is added, it'll be very noticeable, if it's not meaningful, then what does it matter?
Cameras already do a lot of corrections that don't represent reality.
Hell, our perceptions of colors is different than everyone else's.
Can we cryptographically sign a photo in a way that shows it was generated in a particular place? I'm thinking of some sort of beacon in a location that allows you to say this person was here, at least. I'm not sure if it's possible to go beyond presence and indicate anything else about the situation?
I hesitate to say it, but a blockchain is probably part of the solution.
There is this trend going on where hardware vendors are increasingly locking down their hardware, and this could be a part of the solution you are looking for. Not everyone will be happy about it, however.
I have a ton of potato definition videos along with matching high res photos from my childhood that were made by one of those cheap CF-card cameras at the time. Would be cool if this could restore those shitty video frames based on the reference photos as well.
Mix this with some nerf/guasian spat, or other 3D rendering and we have photos where you can re-frame after being shot. No more selfie sticks, perhaps use both cameras at one time to capture more of a scene for infill.
Some will say "but that isn't a real photo of what was there", but our memories of what was in a photo or a scene aren't perfect anyway.
Does anyone else notice that the example images they provided look like they included their test data in the training set?
E.g., the picture with the couch where they cut out the dog in it. How should the network know that there was a dog on the couch? The only explanation is: It knows the reference image.
You give it a bunch of reference images, then another image with some rectangle removed, and it will fill in the rectangle with information from the reference image.
There was a subreddit called something like r/bubbling where people would edit pictures of women in bikinis and actually cover more of the image, but in such a way that your brain was fooled into completing the image and seeing a nude woman. I thought it was a technical marvel, however creepy.
Is this similar to what GoPro cameras do to remove the selfie-stick? They use video content from adjacent frames to remove the pole and fill in with pixels. I get that the approach here can use imagery that's frames completely differently.
Cool tech as others have said, but of course, for thee but not for me with Google, unless I missed a link to a GitHub repo. (That's why OpenAI is called OpenAI - not open source, but at least open access!)
How much VRAM (or system RAM) would you need to run this, and how much processing time does it take to process the reference images and let it generate the fills?
Assuming you are asking about a generative AI way, you could use photos of your new wife to train a LoRA with kohya-ss, then with A1111 you could do an img2img repaint using the ControlNet extension to make sure you get a similar pose. With enough experimentation you could probably get at least one decent result.
At least that's what comes to mind with the things I know you can run offline.
I know someone who did something similar. He remarried then went back to and cropped or deleted ten years of Facebook photos to make it look like he never had a previous relationship and just ten years of boys nights.
He even has a picture up of him from his wedding day… standing alone in a tux.
The kind of stuff the op is doing—-changing the composition to reflect a picture that could have been taken—-is one thing. But what you are asking feels Stalin-esque to me. A picture is a record of a point in time and you can’t change the past.
Sure you can, just as you can change people's memories and implant false ones. Hell, in this dystopia we're headed towards, it'll probably be a subscription service where you can rewrite 5 bad memories a month for $29/mo
This is why my family portrait is going to be a painting. In the future, you can just paint in the new generations and we can all be in the same frame together.
Cool tech, but plastering "authentic" all over this kind of generated photography is really disingenuous and just rubs me the wrong way. I get that it's informed by real details from other photos, but that's not what authentic means.
If I buy an "authentic Rolex" and receive a Chinese Rolex clone that's built similarly based on observations of a real Rolex, I'm going to feel scammed and very upset. And I'm much more protective of my memories than I would be of a watch.
Yeah, I think the first example is bad. This shouldn't be used for the photos you took. What's the purpose of having a photo if it wasn't the real moment you captured? I could understand the usage in marketing or event photographies, but for memories with your loved ones (as the first example tries to show it) it just doesn't make sense to me.
Two anecdotes:
1. A friend of mine met his favourite author (traveled from one continent to another for a signing event). When he shaked hands with the author, a friend took a photo. A lady (still hated by us!) step in the middle, and blocked the photo. Maybe an IA or a talented person could remove her, use a footage photo of the author and rebuild the photo... but why? What's the purpose of that?
2. A few months ago during the pandemic I scanned all the printed pictures of my grand parents with my phone. Aftre scanning like 200s, I checked one and I zoomed in: the stupid app applied some IA to make it better and it just was worse. I don't care if it looks better for the untrained eye: my grandparents didn't look like that. I now have stupid horrible verson of the scanned photos, where my grand parents appear with smooth skin and weird eyes.
I totally agree with 2. I'm less sure on 1. Imagine it's perfect - it would be an accurate representation of what was really there. The real photo is a snapshot of a very specific time that doesn't represent the broader context of what happened.
A different angle, if a friend had painted the encounter instead, it wouldn't be exact but it would be a snapshot of a memory.
I'm not hugely arguing in favour of it but I think there's different scales here, from cameras doing "merge pictures half a second apart so people have their eyes open" to "totally change their face".
I would argue that authentic is a relative term, and actually helped me understand the product more easily. IMO, it is “authentic” because, compared to other image fills, it tries to fill in the data using real data from other photos.
IDK, when I think authentic, I think "genuine", and no image generation is genuine by definition. this is not a bad thing necessarily, but it's important to frame these things correctly.
ultimately we oughta think about what we are referring to. if we are talking about a photograph taken by someone, the authenticity is ultimately coming from the combination of the photograph and camera used. so when you think of a genuine photo in this scenario you expect it to be fundamentally taken by the user by a particular camera to create a particular photograph. you can use devices to take a photo without pressing the button, such as a timer, but the photograph and camera are both fundamental to the authenticity of the image. if the camera is no longer entirely involved in the generation of the photograph I would say that it is no longer genuine.
Reference driven as described in the article is more appropriate, but alas it is verbose. normally such pedantry bores me, but in this case it's pretty tantamount to what it is being presented in this case.
It's "authentic" in the same way that when you see something labeled authentic it makes you more likely to question if it's actually what it says it is because authentic thing don't need such labels plastered on them.
Regardless, I'm pretty sure "reconstructed" it the honest word to use.
They really need to not use the term "authentic" to name this.
They also need to be very, very careful when introducing capability to falsify photographic images convincingly.
Using the term "authentic" for this (and how do they even know what's an authentic memory?) doesn't sound like being very, very careful. It sounds like being gratuitously reckless.
"Realistic" is the wrong word, since that's what infill models are already doing, and the word is already used for that. You'd have to find something that differentiates between plausibly realistic and contextual realistic infill.
Seems like it's only a matter of degree, given that modern cell phone cameras take image bursts and combine them into a single output image. Filling in details in a scene from other photos taken at the same time doesn't really seem that different to me. And seeing that photography has never really been capturing real life exactly, is it really that big a deal? Look at Ansel Adams - he heavily edited his "real-life" photographs, and changed them over the years as he made subsequent prints.
(Disclaimer: work for Google but have nothing to do with this project.)
> plastering "authentic" all over this kind of generated photography is really disingenuous
No more so than "virtual," which used to mean "true." Or "literal" which used to be the opposite of "figurative." It's just another word being used auto-autonymically.
The point is that it's not based on hallucination -- it's generated out of the authentic details provided from other images.
There's definitely a middle ground here that we perhaps don't have a good word for. E.g. what do we call a painting made by an artist who sat in front of the scene they depicted, vs. a painting made by an artist from their imagination? There's certainly some sense in which the first one was an "authentic" scene.
Yeah, except it's still absolutely vulnerable to hallucination. Look at the last set of images on "Limitations" page. The algorithm knows that there's a sign with text there, and it uses the original image to get the right letters in there, but it randomly reorders the letters rather than using the source image. "Real" and "authentic" is extremely misleading here.
That said, props to them for calling out the limitations so clearly. I really appreciate it when people are up front with the problems like that.
Something…seems fishy? Like the example with the guy next to the robot figure. Their model happened to predict exactly the same type of figure?! Diffusion models are not omnipotent…
That's the entire point. It didn't "happen" to predict exactly the same type of figure. It used the context photos to know what type of figure it should render.
You might be getting a bit confused because here the training process has to happen every time you use it, whereas in most AI applications you only perform inference for actual use.
But I think the real value -- and this is definitely in Google's favor -- is providing this functionality for photos you have taken in the past.
I have probably 30K+ photos in Google Photos that capture moments from the past 15 years. There are quite a lot of them where I've taken multiple shots of the same scene in quick succession, and it would be fairly straightforward for Google to detect such groupings and apply the technique to produce synthesized pictures that are better than the originals. It already does something similar for photo collages and "best in a series of rapid shots." They surface without my having to do anything.