Hacker News new | past | comments | ask | show | jobs | submit login
VideoGigaGAN: Towards detail-rich video super-resolution (videogigagan.github.io)
329 points by CharlesW 8 months ago | hide | past | favorite | 230 comments



Video quality seems really good, but limitations are quite restrictive "Our model encounters challenges when processing extremely long videos (e.g. 200 frames or more)".

I'd say most videos in practice are longer than 200 frames, so lot more research is still needed.


At 24fps that's not even 10 seconds. Calling it extremely long is kinda defensive.


The average shot length in a modern movie is around 2.5 seconds (down from 12 seconds in 1930's).

For animations it's around 15 seconds.


Sure, but that represents a lot of fast cuts balanced out by a selection of significantly longer cuts.

Also, it's less likely that you'd want to upscale a modern movie, which is more likely to be higher resolution already, as opposed to an older movie which was recorded on older media or encoded in a lower-resolution format.


Huh, I thought this couldn't be true, but it is. The first time I noticed annoyingly fast cuts was World War Z, for me it was unwatchable with tons of shots around 1 second each.


The first time I noticed how bad the fast cuts are we see in most movies was when I watched Children of Men by Alfonso Cuarón, who often uses very long takes for action scenes:

https://en.wikipedia.org/wiki/Children_of_Men#Single-shot_se...


So sad they didn’t keep to the idea of the book. Anyone who hasn’t read this book you should, it bares no resemblance to the movie aside from the name.


It's offtopic, but this is very good advice. As near as I can tell, there aren't any real similarities between the book and the movie; they're two separate zombie stories with the same name, and honestly I would recommend them both for wildly different reasons.


> there aren't any real similarities between the book and the movie; they're two separate zombie stories with the same name

Funny - this is also a good description of I Am Legend.


And similarly, I, Robot, which is much more enjoyable when you realize it started as an independent murder-mystery screenplay that had Asimov’s works shoehorned in when both rights were bought in quick succession. I love both the movie and the collection of short stories, for vastly different reasons.

https://www.cbr.com/i-robot-original-screenplay-isaac-asimov...


Will Smith, a strange commonality in this tiny subgenre.


I didn’t rate the film really, but loved the book. Apparently it is based on / taking style inspiration from real first hand accounts of ww2.


It’s style is based on the oral history approach used by Studs Terkel to document aspects of WW2 - building a big picture by interleaving lots of individual interviews.


Making the movie or a documentary series like that would have been awesome.


I know two movies where the book is way better, Jurassic Park and Fight Club. I thought about putting spoilers in a comment to this one but i won't.


The lost world is also a great book. It explores a lot of interesting stuff the film completely ignores. Like that the raptors are only rampaging monsters because they had no proper upbringing having been been born in the lab with no mama or papa raptor to teach them social skills


But hey, at least we finally got the motorcycle chase (kind of) in "Jurassic World"! (It's my favourite entry in the series, BTW.)


Disagree, Jurassic Park was an amazing movie on multiple levels, the book was just differently good, and adapting it to film in the exact format would have been less interesting (though the ending was better in the book.)


I totally forgot the book ending! So much better.

I think like the motorcycle chase that they borrowed from the lost world in Jurassic world, they also have a scene with those tiny dinosaurs pecking someone to death.


Also The Godfather. No Country for old Men I wouldn’t say is better but is fantastic.


Loved the audiobook


Batman Begins was already in 2005 basically just a feature length trailer - all the pacing was completely cut out.


Yes, Nolan improves on that in later movies but he used to abuse of it.

Another movie of him that crimes of this non stop is The Prestige.


Yeah, the average may also be getting driven (e: down) by the basketball scene in Catwoman


[watches scene] I think you mean the average shot length is driven down.


The textures of objects need to maintain consistency across much larger time frames, especially at 4k where you can see the pores on someone's face in a closeup.


Off topic: the clarity of pores and fine facial hair on Vision Pro when watching on a virtual 120-foot screen is mindblowing.


I'm sure if you really want to burn money on compute you can do some smart windowing in the processing and use it on overlapping chunks and do an OK job.


I believe the relevant data point when considering applicability is the median shot length to give an idea of the length of the majority of shots, not the average.

It reminds me of the story about the Air Force making cockpits to fit the elusive average pilot, which in reality fit none of their pilots...


People won't be upscaling modern movies though.


10 seconds is what, about a dozen cuts in a modern movie? Much longer has people pulling out their phones.


:( "Our model encounters challenges when processing >200 frame videos"

:) "Our model is proven production-ready using real-world footage from Taken 3"

https://www.youtube.com/watch?v=gCKhktcbfQM


Freal. To the degree that i compulsively count seconds on shots until a show/movie has a few shots over 9 seconds then they "earn my trust" and i can let it go. Im fine


The Wright Brothers' first powered flight lasted 12 seconds

Source: https://www.nasa.gov/history/115-years-ago-wright-brothers-m....


Our invention works best except for extremely long flight times of 13 seconds


I guess one can break videos into 200-frame chunks and process them independent of each other.


Not if there isn't coherency between those chunks


Easily solved, just overlap by ~40 frames and fade the upscaled last frames of chunk A into the start of chunk B before processing. Editors do tricks like this all the time.


Decent editors may try that once, but they will give up right away because it will only work by coincidence.


There has to be a way where you can do it intelligently in chunks and reduce noise along the chunk borders.

Moreover I imagine that further research and power will do a lot, smarter, and quicker.

Don't forget people had toy story-comparable games in a decade or so after it was originally rendered at 1536x922.


Or upscale every 4th frame for consistency. Upscaling in between frames should be much easier.


And now you end up with 40 blurred frames for each transition.


'before processing'


At 30fps, which is not high, that would mean chunks of less than 7 seconds. Doable but highly impractical to say the least.


It's not so much that it would be impratical (video streaming, like HLS or MPEG-Dash, requires to chunk videos in pieces of roughly this size) but you'd lose the inter-frame consistency at segments boundaries, and I suspect the resulting video would be flickering at the transition.

It could work for TV or movies if done properly at the scene transition time though.


7s is pretty alright, I've seen HLS chunks of 6 seconds, that's pretty common I think.


6s was adopted as the "standard" by Apple [0].

For live streaming it's pretty common to see 2 or 3 seconds (reduces broadcast delay, but with some caveats).

0: https://dev.to/100mslive/introduction-to-low-latency-streami...


You will probably have to have some overhang of time to get the state space to match enough to minimize flicker in between fragments.


You could probably mitigate this by using overlapping clips and fading between them. Pretty crude but could be close to unnoticeable, depending on how unstable the technique actually is.


Perhaps a second pass that focuses on smoothing out the frames where the clips are joined.


Maybe they could do a lower framerate and then use a different AI tool to interpolate something smoother.


Fascinating how researchers put out amazing work and then claim that videos consisting of more than 200 frames are "extremely long".

Would it kill them to say that the method works best on short videos/scenes?


Tale as old as time, in graphics papers it's "our technique achieves realtime speeds" and then 8 pages down they clarify that they mean 30fps at 640x480 on an RTX 4090.


Break into chunks that overlap by, say, a second, upscale separately and then blend to reduce sudden transitions in the generated details to gradual morphing.

The details changing every ten seconds or so is actually a good thing; the viewer is reminded that what they are seeing is not real, yet still enjoying a high resolution video full of high frequency content that their eyes crave.


If you're using this for existing material you just cut into <=8 second chunks, no big deal. Could be an absolute boon for filmmakers, otoh a nightmare for privacy because this will be applied to surveillance footage.


Still potentially useful - predict the next k frames with a sliding window throughout the video.

But idk how someone can write "extremely long videos" with a straight face when meaning seconds.

Maybe "long frame sequences"


It's good enough for "enhance, enhance, enhance" situations.


If I am understanding the limitations section of the paper it seems like the 200 frames depends on the scene, it may be worse or better.


Wonder what happens if you run it piece-wise on every 200 frames. Perhaps it glitches in the interface.


Well there goes my dreams of making my own Deep Space Nine remaster from DVDs.


I think it encounters memory leaks and the usage of memory goes over the roof


Unless they can predict a 2 hour movie in 200 frames.


This is great for entertainment (and hopefully the main application), but we need clear marking of such type of videos before hallucinated details are used as "proofs" of any kind by people not knowing how this works. Software video/photography on smartphones is already using proprietary algorithms that "infer" non-existent or fake details, and this would be at an even bigger scale.


Funny to think of all those scenes in TV and movies when someone would magically "enhance" a low-resolution image to be crystal clear. At the time, nerds scoffed, but now we know they were simply using an AI to super-scale it. In retrospect, how many fictional villains were condemned on the basis of hallucinated evidence? :-D


Enemy of the State (1998) was prescient, that had a ridiculous example of "zoom and enhance" where they move the camera, but they hand-waved it as the computer "hypothesizing" what the missing information might have been. Which is more or less what gaussian splat 3D reconstructions are doing today.


Like Ryan Gosling appearing in a building https://petapixel.com/2020/08/17/gigapixel-ai-accidentally-a...


Or white Obama https://twitter.com/Chicken3gg/status/1274314622447820801

I don't see the word "bias" appear anywhere in this work :'(


Yeah I was curious about that baby. Do they know how it looks, or just guess? What about the next video with the animals. The leaves on the bush, are they matching a tree found there, or just generic leaves perhaps from the wrong side of the world?

I guess it will be like people pointing out bird sounds in movies, that those birds don't exist in that country.


The video of the owl is a great example of doing a terrible job without the average Joe noticing.

The real owl has fine light/dark concentric circles on its face. The app turned it into gray because it does not see any sign of the circles. The real owl has streaks of spots. The app turned them into solid streaks because it saw no sign of spots. There's more where this came from, but basically only looks good to someone who has no idea what the owl should look like.


Is this considered a reincarnation of the 'rest of the owl' meme


This is great. I look forward to when cell phones run this at 60fps. It will hallucinate wrong, but pixel perfect moons and license plate numbers.


Just get a plate with 'AAAAA4' and blame everything on 'AAAAAA'


Even better, get NU11 and have it go to this poor guy: https://www.wired.com/story/null-license-plate-landed-one-ha...


So that’s why I don’t get toll bills.


I look forward to VR 360 degree videos using something like this to overcome their current limitations, assuming the limit is on the capture side.


Is anyone else concerned at the societal effects of technology like this? In one of the examples they show a young girl. In the upscale example it's quite clearly hallucinating makeup and lipstick. I'm quite worried about tools like this perpetuating social norms even further.


Aside your point: It does look like she is wearing lipstick tho, to me. More likely lip balm. Her (unaltered) lips have specular highlights on the tops that suggests they're wet or have lip balm to me. As for the makeup, not sure there. Here cheeks seem rosy in the original, and not sure what you're referring to beyond that. Perhaps her skin is too clear in the AI version, suggesting some type of foundation?

I know nothing of makeup tho, just describing my observations.


From Plato's dialogue Phaedrus 14, 274c-275b:

Socrates: I heard, then, that at Naucratis, in Egypt, was one of the ancient gods of that country, the one whose sacred bird is called the ibis, and the name of the god himself was Theuth. He it was who invented numbers and arithmetic and geometry and astronomy, also draughts and dice, and, most important of all, letters.

Now the king of all Egypt at that time was the god Thamus, who lived in the great city of the upper region, which the Greeks call the Egyptian Thebes, and they call the god himself Ammon. To him came Theuth to show his inventions, saying that they ought to be imparted to the other Egyptians. But Thamus asked what use there was in each, and as Theuth enumerated their uses, expressed praise or blame, according as he approved or disapproved.

"The story goes that Thamus said many things to Theuth in praise or blame of the various arts, which it would take too long to repeat; but when they came to the letters, "This invention, O king," said Theuth, "will make the Egyptians wiser and will improve their memories; for it is an elixir of memory and wisdom that I have discovered." But Thamus replied, "Most ingenious Theuth, one man has the ability to beget arts, but the ability to judge of their usefulness or harmfulness to their users belongs to another; and now you, who are the father of letters, have been led by your affection to ascribe to them a power the opposite of that which they really possess.

"For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise."


No, I'm not concerned. When an AI is trained on a largely raw, uncurated set of low-quality data (eg most of the public internet), it's going to miss subtle distinctions some humans might prefer that it make. I'm confident that pretty quickly the majority of the general public using such AIs will begin to intuitively understand this. Just as they have developed a practical, working understanding of other complex technology's limitations (such as auto-complete algorithms). No matter how good AI gets, there will always be some frontier boundary where it gets something wrong. My evidence is simply that even smart humans trying their best occasionally get such subtle distinctions wrong. However, this innate limitation doesn't mean that an AI can't still be useful.

What I am concerned about is that AI providers will keep wasting time and resources trying to implement band-aid "patches" to address what is actually an innate limitation. For example, exception processing at the output stage fails in ways we've already seen, such as AI photos containing female popes or an AI lying to deny that HP Lovecraft had a childhood pet (due to said pet having a name that was crudely rude 100 years ago but racist today). The alternative of limiting the training data to include only curated content fails by yielding a much less useful AI.

My, probably unpopular, opinion is that when AI inevitably screws up some edge case, we get more comfortable saying, basically, "Hey, sometimes stupid AI is gonna be stupid." The honest approach is to tell users upfront: when quality or correctness or fitness for any given purpose is important, you need to check every AI output because sometimes it's gonna fail. Just like auto-pilots, auto-correct and auto- everything else. As impressive as AI can sometimes be, personally, I think it's still lingering just below the threshold of "broadly useful" and, lately, the rate of fundamental improvement is slowing. We can't really afford to be squandering limited development resources or otherwise nerfing AI's capabilities to pursue ultimately unattainable standards. That's a losing game because there's a growing cottage industry of concern trolls figuring out how to get an AI to generate "problematic" output to garner those sweet "tsk tsk" clicks. As long as we keep reflexively reacting, those goalposts will never stop moving. Instead, we need to get off that treadmill and lower user expectations based on the reality of the current technology and data sets.


I am not at all.

We seem to have a culture of completely paranoid people now.

When the internet came along every conversation was not dominated by "but what about people knowing how to build bombs???" the way most AI conversation flips to these paranoid AI doomer scenarios.


> AI lying to deny that HP Lovecraft had a childhood pet

GPT4 told me with no hesitation.


Ah, interesting. Originally, it would answer that question correctly. Then it got concern trolled in a major media outlet and some engineers were assigned to "patch it" (ie make it lie). Then that lie got highlighted some places (including here on HN), so I assume since then some more engineers got assigned to unpatch the patch.

I'll take that as supporting my point about the folly of wasting engineering time chasing moving goalposts. :-)


I just tested it on Copilot. It starts responding and then at some point deletes the whole text and replies with:

"Hmm… let’s try a different topic. Sorry about that. What else is on your mind?"


I don't think it's hallucinating too much.

The nails have nail polish in the original, and the lips also look like they have at least lip gloss or a somewhat more muted lipstick.


Seems to be stock footage, is it surprising makeup would be involved?


I don't know, it's a mirror, right? It's up to us to change really. Besides, failures like the one you point out make subtle stereotypes and biases more conspicuous, which could be a good thing.


It's interesting that the output of the genAI will inevitably get fed into itself. Both directly and indirectly by influencing humans who generate content that goes back into the machine. How long will the feedback loop take to output content reflecting new trends? How much new content is needed to be reflected in the output in a meaningful way. Can more recent content be weighted more heavily? Such interesting stuff!


Precisely: tools don't have morality. We have to engage in political and social struggle to make our conditions better. These tools can help but they certainly wont do it for us, nor will they be the reason why things go bad.


looks pretty clearly like she has makeup/lipstick on in the un-processed video to me.


Yes, but if you mention that here, you’ll get accused of wokeism.

More seriously, though, yes, the thing you’re describing is exactly what the AI safety field is attempting to address.


> is exactly what the AI safety field is attempting to address

Is it though? I think it's pretty obvious to any neutral observer that this is not the case, at least judging based on recent examples (leading with the Gemini debacle).


Yes, avoiding creating societally-harmful content is what the Gemini "debacle" was attempting to do. It clearly had unintended effects (e.g: generating a black Thomas Jefferson), but when these became apparent, they apologized and tried to put up guard rails to keep those negative effects from happening.


> societally-harmful content

Who decides what is "societally-harmful content"? Isn't literally rewriting history "societally-harmful"? The black T.J. was a fun meme, but that's not what the alignment's "unintended effects" were limited to. I'd also say that if your LLM condemns right-wing mass murderers, but "it's complicated" with the left-wing mass murderers (I'm not going to list a dozen of other examples here, these things are documented and easy to find online if you care), there's something wrong with your LLM. Genocide is genocide.


This isn't the un-determinable question you've framed it as. Society defines what is and isn't acceptable all the time.

> Who decides what is "societally-harmful theft"? > Who decides what is "societally-harmful medical malpractice"? > Who decides what is "societally-harmful libel"?

The people who care to make the world a better place and push back against those that cause harm. Generally a mix of de facto industry standard practices set by societal values and pressures, and de jure laws established through democratic voting, legislature enactment, and court decisions.

"What is "societally-harmful driving behavior"" was once a broad and undetermined question but nevertheless it received an extensive and highly defined answer.


> The people who care to make the world a better place and push back against those that cause harm.

This is circular. It's fine to just say "I don't know" or "I don't have a good answer", but pretending otherwise is deceptive.


Read the entire comment before replying, please. I'm not interested in lazy comments like this, and they're not appropriate for HN.


> Who decides what is "societally-harmful content"?

Are you stupid, or just pretending to be?


What Gemini was doing -- what it was explicitly forced to do by poorly considered dogma -- was societally harmful. It is utterly impossible that these were "unintended"[1], and were revealed by even the most basic usage. They aren't putting guardrails to prevent it from happening, they quite literally removed instructions that explicitly forced the model to do certain bizarre things (like white erasure, or white quota-ing).

[1] - Are people seriously still trying to argue that it was some sort of weird artifact? It was blatantly overt and explicit, and absolutely embarrassing. Hopefully Google has removed everyone involved with that from having any influence on anything for perpetuity as they demonstrate profoundly poor judgment and a broken sense of what good is.


I didn't say the outcome wasn't harmful. I said that the intent of the people who put it in place was to reduce harm, which is obvious.


Yeah, I don’t think there’s such thing as a “neutral observer” on this.


An LLM should represent a reasonable middle of the political bell curve where Antifa is on the far left and Alt-Right is on the far right. That is what I meant by a neutral observer. Any kind of political violence should be cosidered deplorable, which was not the case with some of the Gemini answers. Though I do concede that right wingers cooked up questionable prompts and were fishing for a story.


> An LLM should represent a reasonable middle of the political bell curve where Antifa is on the far left and Alt-Right is on the far right. That is what I meant by a neutral observer.

This is a bad idea.

Equating extremist views with those seeking to defend human rights blurs the ethical reality of the situation. Adopting a centrist position without critical thought obscures the truth since not all viewpoints are equally valid or deserve equal consideration.

We must critically evaluate the merits of each position (anti-fascists and fascists are very different positions indeed) rather than blindly placing them on equal footing, especially as history has shown the consequences of false equivalence perpetuate injustice.


All of this is political. It always is. Where does the LLM fall on trans rights? Where does it fall on income inequality? Where does it fall on tax policy? "Any kind of political violence should be considered deplorable" - where's this fall on Israel/Gaza (or Hamas/Israel)? Does that question seem non-political to you? 50 years ago, the middle of American politics considered homosexuality a mental disorder - was that neutral? Right now if you ask it to show you a Christian, what is it going to show you? What _should_ it show you? Right now, the LLM is taking a whole bunch of content from across society, which is why it turns back a white man when you ask it for a doctor - is that neutral? It's putting lipstick on an 8-year-old, is that neutral? Is a "political bell curve" with "antifa on the left" and "alt-right on the right" neutral in Norway? In Brazil? In Russia?


Speaking as somebody from outside the United States, please keep the middle of your political bell curve away from us.


Nobody mentioned wokism except you.


Wonder how long until Hollywood CGI shops have these types of models running as part of their post-production pipeline. Big blockbusters often release with ridiculously broken CGI due to crunch (Black Panther's third act was notorious for looking like a retro video-game), adding some extra generative polish in those cases is a no-brainer.


Once AI tech gets fully integrated entire Hollywood rendering pipeline will go from rendering to diffusing


Once AI tech gets fully integrated, the movie industry will cease to exist.


Hollywood has incredible financial and political power. And even if fully AI generated movies reach the same quality (both visually and story wise) as current ones, there’s a lot of value in the shared experience of watching the same movies as other people, that a complete collapse of the industry seems highly unlikely to me.


What quality? Current industry movies are, for lack of better term, inbred. Sound too loud, washed out rigid color scheme, keeping attention of the audience captive at all costs. They already exclude large, more sensitive, part of population that hates all of this despite the shared experience. And AI is exceptionally good at further inbreeding to the extreme.

While of course it isn't impossible for any industry to reinvent itself, movie as an art form won't die....having doubts about where it's going.


> that a complete collapse of the industry seems highly unlikely to me.

Unlikely in the next 10 years or the next 100?


I wouldn’t have any confidence in any predictions I make 100 years into the future even if we didn’t have the current AI developments.

With that said, I’m pretty confident that the movie industry will exist in 10 years (maybe heavily tranformed, but still existing and still pretty big). If it’s still a big part of current popculture by then (vs obviously on its way out) then I’d expect a collapse of it to require a change that is not a result of AI proliferation, but something else entirely.


My point is that many talk about AI as though it's not going to evolve or get better. It's a mindset of "We don't need to talk about this because it won't happen tomorrow".

Realistically, AI being able to replace Hollywood is something that could happen in 20-50 years. That's within most people's lifetime.


A few years if not less.

They will have huge budgets for compute and the makers of compute will be happy to absorb those budgets.

Cloud production was already growing but this will continue to accelerate it imho


Wasn't Hollywood an early adopter of advanced AI video stuff, w.r.t. de-aging old famous actors?


Bingo. Except it looked like magic because the tech was so expensive and only available to them.

Limited access to the tech added some mystique to it too.

Just like digital cameras created a lot more average photographers, it pushed photography to a higher standard than just having access to expensive equipment.


yeah and the only reason we don't see more of it was prohibitively expensive for all but basically Disney.

the compute budgets for basic run of the mill small screen 3D rendering and 2D compositing is already massive compared to most other businesses of a similar scale. the industry has been under paying their artists for decades too.

I'm willing to bet that as soon as unreal or adobe or whoever comes out with a stable diffusion like model that can be consistent across a feature length movie, they'll stop bothering with artists altogether.

why have an entire team of actual people in the loop when the director can just tell the model what they want to see? why shy away from revisions when the model can update colour grade or edit a character model throughout the entire film without needing to re-render?


> generative polish

I don't think we're far away from models that are able to take video input of an almost finished movie and add the finishing touches.

Eg. make the lighting better, make the cgi blend in better, hide bits of set that ought to have been out of shot, etc.


A couple of months.


It's impressive, but still looks kinda bad?

I think the video of the camera operator on the ladder shows the artifacts the best. The main camera equipment is no longer grounded in reality, with the fiddly bits disconnected from the whole and moving around. The smaller camera is barely recognizable. The plant in the background looks blurry and weird, the mountains have extra detail. Finally, the lens flare shifts!

Check out the spider too, the way the details on the leg shift is distinctly artificial.

I think the 4x/8x expansion (16x/64x the pixels!) is pushing the tech too far. I bet it would look great at <2x.


>I think the 4x/8x expansion (16x/64x the pixels!) is pushing the tech too far. I bet it would look great at <2x.

I believe this applies to every upscale model released in the past 8 years, yet undeterred by this scientists keep pushing on, sometimes even claiming 16x upscaling. Though this might be the first one that is pretty close to holding up at 4x in my opinion, which is not something I've seen often.


I think the hand running through the wheat (?) is pretty good, object permanence is pretty reasonable especially considering the GAN architecture. GANs are good at grounded generation--this is why the original GigaGAN paper is still in use by a number of top image labs. Inferring object permanence and object dynamics is pretty impressive for this structure.

Plus, a rather small data set: REDS and Vimeo-90k aren't massive in comparison to what people speculate Sora was trained on.


It seems likely that our brains are doing something similar.

I remember being able to add a lot of detail to the monsters that I could barely make out amidst the clothes piled up on my bedroom floor.


I wonder if you could specialise a model by training it on a whole movie or TV series, so that instead of hallucinating from generic images, the model generates things it has seen closer-up in other parts of the movie.

You'd have to train it to go from a reduced resolution to the original resolution, then apply that to small parts of the screen at the original resolution to get an enhanced resolution, then stitch the parts together.


Finally, we get to know whether the Patterson bigfoot film is authentic.


I can't wait for the next explosion in "bigfoot" videos: wildlife on the moon, people hiding in shadows, plants, animals, and structures completely out of place.

The difference will be that this time the images will be crystal clear, just hallucinated by a neural network.


No public model available yet? Would love to test and train it on some of my datasets.


I'm curious as to how well this works when upscaling from 1080p to 4K or 4K to 8K.

Their 128x128 to 1024x1024 upscales are very impressive, but I find the real artifacts and weirdness are created when AI tries to upscale an already relatively high definition image.

I find it goes haywire, adding ghosting, swirling, banded shadowing, etc as it whirlwinds into hallucinations from too much source data since the model is often trained to work with really small/compressed video into an "almost HD" video.


This looks great, however, things like rolling shutter or video wipes/transitions will be interesting to see how it handles that. Also, all of the sample videos the camera is locked down and not moving, or moving just ever so slightly (the ants and the car clips ). It looks like they took time to smooth out any excessive camera shake.

Intergrading this with Adobe's object tracking software (in premier/after effects) may help.


The video comparison examples, while impressive, were basically unusable on mobile Safari because they launched in full screen view and broke the slider UI.


Yeah, and in my case they immediately went fullscreen again the moment I dismissed them, hijacking the browser.


Videos autoplay in full screen as I scroll in mobile. Impressive tech, but could use better mobile presentation


Yup, same here (iPhone Safari). They go fullscreen and can't dismiss them (they expand again) unless I try it very fast a few times.


Terrible viewing experience


I am personally much more interested in frame rate upscalers. A proper 60Hz just looks much better then anything. Also would really, really like to see a proper 60Hz animate upscale. Anything in that space just sucks. But when in the rare cases it works it really looks next level.


Frame-rate upscaling is fine for video, but for animation it's awful.

I think it's almost inherently so, because of the care that an artist takes in choosing keyframes, deforming the action, etc.


This just sounds like the AI is just not good enough yet. I mean it's pretty clear now that there is nothing stopping AI from producing close to or sometimes even exceeding human artists. A big problem here is good training material


Baffling to me that you think that AI art is capable of "exceeding" the human training material.


I didn't say that. I said that AI is capable of sometimes exceeding human artists. That is not the same thing as saying AI is exceeding the best human artist. If your training material is of high quality it's shouldn't be impossible to exceed human artists some or even most times, i.e. Produce better material then the average or good artists.


Have you tried DAIN?


This is amazing and all but at what point do we reach the point of there is no more “real” data to infer from low resolution? In other words there are all sorts of information theory research on the amount of unique entropy on a given medium and even with compression there is a limit. How does that limit relate to work like this? Is there a point at which it can say we know it’s inventing things beyond x scaling constant because of information theory research?


I'm not sure information theory deals with this question.

Since this isn't lossless decompression, the point of having no "real" data is already reached. It _is_ inventing things, and the only relevant question is how plausible are the things being invented; in other words, if the video also existed in higher resolution, how close would it actually look like the inferred version. Seems obvious that this metric increases as a function of the amount of information from the source, but I would guess the exact relationship is a very open question.


> This is amazing and all but at what point do we reach the point of there is no more “real” data to infer from low resolution?

The start point. Upscaling is by definition creating information where there wasn't any to begin with.

Nearest neighbor filtering is technically inventing information, it's just the dumbest possible approach. Bilinear filtering is slightly smarter. This approach tries to be smarter still by applying generative AI.


That point is the starting point.

There is plenty of real information: that's what the model is trained on. That information ceases to be real the moment it is used by a model to fill in the gaps of other real information. The result of this model is a facade, not real data.


This seems technically very impressive, but it does occur to my more pragmatic side that I probably haven't seen videos as blurry as the inputs for ~ 10 years. I'm sure I'm unaware of important use cases, but I didn't realize video resolution was a thing we needed to solve for these days (at least inference for perceptive quality).


What exactly does this do? They have examples with a divider in the middle that you can move around and one side says "input" and the other "output". However, no matter where I move the slider, both sides look identical to me. What should I be focusing on exactly to see a difference?


It has clearly just loaded incorrectly for you (or you need glasses desperately). The effect is significant.


Tried again, same result. This is what I get: https://imgur.com/CvqjIhy

(And I already have glasses, thank you).


That's an error in your browser. It's not supposed to look like that.


Have we reached peak image sensor size. Would it still make sense to shoot in fullframe when you can just upscale.


If you want to use your image for anything that needs to be factual (i.e. surveillance, science, automation) the up-scaling adds nothing---it's just guessing on what is probably there.

If you just want the picture to be pretty, this is probably cheaper than a bigger sensor.


How does it compare to fractal video compression [1]?

[1] https://www.computer.org/csdl/proceedings-article/dcc/1992/0...


Very impressive demonstration but a terrible mobile experience. For the record, I am using iOS Safari.


I was hoping this would be an open-source video upscale until I saw it was from Adobe.


Would be neat to see this on much older videos (maybe WW2 era) to see how it improves details.


That is essentially what Peter Jackson did for the 2018 film They Shall Not Grow Old: https://www.imdb.com/title/tt7905466/

They used digital upsampling techniques and colorization to make World War One footage into high resolution. Jackson would later do the same process for the 2021 series Get Back, upscaling 16mm footage of the Beatles taken in 1969: https://www.imdb.com/title/tt9735318/

Both of these are really impressive. They look like they were shot on high resolution film recently, instead of fifty or a hundred years ago. It appears that what Peter Jackson and his team did meticulously at great effort can now be automated.

Everyone should understand the limitations of this process. It can't magically extract details from images that aren't there. It is guessing and inventing details that don't really exist. As long as everyone understands this, it shouldn't be a problem. Like, we don't care that the cross-stitch on someone's shirt in the background doesn't match reality so long as it's not an important detail. But if you try to go Blade Runner/CSI and extract faces from reflections of background objects, you're asking for trouble.


You mean _invents_ details.


You mean _infers_ details.


What's the distinction?


Or, extracts from its digital rectum?


*logit


Glad to see that Adobe is still investing on the alias-free convolutions (as in StyleGAN3), and this time they know how to fill the lost high frequency features

I always thought that alias-free convolutions can produce much more natural videos


Curious how this compares to Topaz which is the current industry leader in the field.


Something I've been thinking about recently is a more scalable approach to video super-resolution.

The core problem is that any single AI will learn how to upscale "things in general", but won't be able to take advantage of inputs from the source video itself. E.g.: a close-up of a face in one scene can't be used elsewhere to upscale a distant shot of the same actor.

Transformers solve this problem, but with quadratic scaling, which won't work any time soon for a feature-length movie. Hence the 10 second clips in most such models.

Transformers provide "short term" memory, and the base model training provides "long term" memory. What's needed is medium-term memory. (This is also desirable for Chat AIs, or any long-context scenario.)

LoRA is more-or-less that: Given input-output training pairs it efficiently specialises the base model for a specific scenario. This would be great for upscaling a specific video, and would definitely work well in scenarios where ground-truth information is available. For example, computer games can be rendered at 8K resolution "offline" for training, and then can upscale 2K to 4K or 8K in real time. NVIDIA uses this for DLSS in their GPUs. Similarly, TV shows that improved in quality over time as the production company got better cameras could use this.

This LoRA fine-tuning technique obviously won't work for any single movie where there isn't high-resolution ground truth available. That's the whole point of upscaling: improving the quality where the high quality version doesn't exist!

My thought was that instead of training the LoRA fine-tuning layers directly, we could train a second order NN that outputs the LoRA weights! This is called a HyperNet, which is the term for neural networks that output neural networks. Simply put: many differentiable functions are twice (or more) differentiable, so we can minimise a minimisation function... training the trainer, in other words.

The core concept is to train a large base model on general 2K->4K videos, and then train a "specialisation" model that takes a 2K movie and outputs a LoRA for the base model. This acts as the "medium term" memory for the base model, tuning it for that specific video. The base model weights are the "long term" memory, and the activations are its "short term" memory.

I suspect (but don't have access to hardware to prove) that approaches like this will be the future for many similar AI tasks. E.g.: specialising a robot base model to a specific factory floor or warehouse. Or specialising a car driving AI to local roads. Etc...


I find it interesting how it changed the bokeh from a octagon to a circular bokeh


interesting, which scene is this?


The third image in the carousel, with the beer getting poured.


No code?


Ok, how do I download it and use it though???


I wonder how well this works with compressed video. The low res input video looks to be raw uncompressed.


Wow, the results are amazing. Maintaining temporal consistency was just the beginning part. Very cool.


Can this take a crappy phone video of an object and convert that into a single high resolution image?


That's known as multi-frame super-resolution.

https://paperswithcode.com/task/multi-frame-super-resolution


We need to input UFO vids into this ASAP to get a better guess as what some of those could be.


When do I have that in my Nvidia Shield? I would pay $$$ to have that in real-time ;)


Are you using your Shield as an HTPC? In that case you can use the upscaler built into a TV. I prefer my LG C2 upscale (particularly the frame interpolation) compared to most Topaz AI upscales.


How long until we can have this run real-time in the browser? :D


I need to learn how to use these new models


hmm is there something more specifically for lecture videos? I'm tired of watching lectures in 480p...


is there code/model available to try out?


Another boon for the porn industry


History demonstrates that what's good for porn is generally good for society.


Why, so they can restore old videos? I can't see much demand for that.


There are a lot of old porn videos out there which have become commercially worthless because they were recorded at low resolutions (e.g. 320x240 MPEG, VHS video, 8mm film, etc). Being able to upscale them to HD resolutions, at high enough quality that consumers are willing to pay for it, would be a big deal.

(It doesn't hurt that a few minor hallucinations aren't going to bother anyone.)


"I can't see much" - that's the demand


Ok then?


Show me the code


The first demo on the page alone shows that it is a huge failure. It clearly changes the expression of the person.

Yes, it is impressive, but it's not what you want to actually "enhance" a movie.


It doesn't change the expression - the animated gifs are merely out of sync.

This appears to happen because they begin animating as soon as they finish loading, which happens at different times for each side of the image.


Reloading can get them in sync. But, it seems to stop playback of the "left" one if you drag the slider completely left, which makes it easy to get desynced again.


I agree that it's not perfect, though it does appear to be SoTA. Eventually something like this will just be part of every video codec. You stream a 480p version and let the TV create the 4K detail.


Why would you ever do that?

If you have the high res data you can actually compress the details which are there and then recreate them. No need to have those be recreated, when you actually have them.

Downscaling the images and then upscaling them is pure insanity when the high res images are available.


So streaming services can save money on bandwidth


That's absurd. I think anybody is aware that it is far superior to e.g. compress in the frequency domain than to down sample your image. If you don't believe me just compare a JPEG compressed image with the same image of the same size compressed with down sampling. You will notice a literal night and day difference.

Down sampling is a bad way to do compression. It makes no sense to do NN reconstruction on that if you could have compressed that image better and reconstructed from that data.


An image downscaled and then upscaled to its original size is effectively low-pass filtered where the degree of edge preservation is dictated by the kernel used in both cases.

Are you saying low-pass filtering is bad for compression?


The word is "blur." Low-pass filtering is blurring.

Is blurring good for compression? I don't know what that means. If the image size (not the file size) is held constant, a blurry image and a clear image take up exactly the same amount space in memory.

Blurring is bad for quality. Our vision is sensitive to high-frequency stuff, and low-pass filtering is by definition the indiscriminate removal of high-frequency information. Most compression schemes are smarter about the information they filter.


> Is blurring good for compression? I don't know what that means.

Consider lossless RLE compression schemes. In this case, would data with low or high variance compress better?

Now consider RLE against sets of DCT coefficients. See where this is going?

In general, having lower variance in your data results in better compression.

> Our vision is sensitive to high-frequency stuff

Which is exactly why we pick up HF noise so well! Post-processing houses are very often presented with the challenge of choosing just the right filter chain to maximize fidelity under size constraint(s).

> low-pass filtering is by definition the indiscriminate removal of high-frequency information

It's trivial to perform edge detection and build a mask to retain the most visually-meaningful high frequency data.


Do you seriously think down sampling is superior to JPEG?


No. I never made this claim. My argument is pedantic.


Are you saying that when Netflix streams a 480p version of a 4k movie to my TV they do not perform downsampling?


Yes. Down sampling makes only sense if you store per pixel data, which is obviously a dumb idea. You get a stream for 480p which contains frames which were compressed from the source files, or the 4k version. At some point there might have been down sampling involved, but you never actually get any of that data, you get the compressed version of those.


Not sure if I’m being dumb, or if it’s you not explaining it clearly: if Neflix produced low resolution frames from high resolution (4k to 480p), and if these 480p frames are what my TV is receiving - are you saying it’s not downsampling, and my TV would not benefit from this new upsampling method?


Your TV never receives per pixel data. Why would you use a NN to enhance the data which your TV has constructed instead of enhancing the data it actually receives?


OK, I admit I don’t know much about video compression. So what does my TV receives from Netflix if it’s not pixels? And when my TV does “upsampling” (according to the marketing) what does it do exactly?


It receives information about the spacial frequency content of the image. If you're unfamiliar, it's definitely worth looking into the specifics of how this works, as it's quite impressive! Here's a few relevant Wikipedia articles, and a Computerphile video:

https://en.wikipedia.org/wiki/JPEG#JPEG_codec_example

https://en.wikipedia.org/wiki/Discrete_cosine_transform

https://www.youtube.com/watch?v=Q2aEzeMDHMA


I think you're missing the point of this paper—the precise thing it's showing is upscaling previously downscaled video with minimal perceptual differences from ground truth.

So you could downscale, then compress as usual, and then upscale on playback.

It would obviously be quite attractive to be able to ship compressed 480p (or 720p etc) footage and be able to blow it up to 4K at high quality. Of course you will have higher quality if you just compress the 4K, but the file size will be an order of magnitude larger.


Why would you not enhance the compressed data?


In our hypothetical example, the compressed 4k data or the compressed 480p data? You would enhance the compressed 480p—that's what the example is. You would probably not enhance the 4K, because there's very little benefit to increasing resolution beyond 4K.


Or low connectivity scenarios that pushes more local processing.

I think it a bit unimaginative to see no use cases for this.


There is no use case, because it is a stupid idea. Downscaling then reconstructing is a stupid idea for exactly the same reasons why downscaling for compression is a bad idea.

The issue isn't NN reconstruction, but that you are reconstructing the wrong data.


if the nn is part of the codec, you can choose to only downscale the regions that get reconstructed correctly.


Why would you not let the NN work on the compressed data? That is actually where the information is.


that's like asking why you don't train a llm on gzipped text. the compressed data is much harder to reason about


Meh.

I think upscaling framerate would be more useful.


TVs already do this… and it's basically a bad thing


There's lots of videos where there isn't high res data available


Totally irrelevant to the discussion, which is explicitly about streaming services delivering in lower resolutions than they have available.


Streaming services already deliver in lower resolution than they available based on network conditions. Good upscaling would let you save on bandwidth and deliver content easier to people in poor network conditions. The tradeoff would be that details in the image wouldn't be exactly the same as the original - but, presumably, nobody would notice this so it would be fine.


Why would you not enhance the compressed data with a neutral network? That is where the information actually is.


Satellite data would benefit from that.


Why would someone ever take a 40Mbps (compressed) video and downsample it so it can be encoded at 400Kbps (compressed) but played back with nearly the same fidelity / with similar artifacts to the same process at 50x data volume? The world will never know.

You're also ignoring the part where all lossy codecs throw away those same details and then fake-recreate them with enough fidelity that people are satisfied. Same concept, different mechanism.

Look up what 4:2:0 means vs 4:4:4 in a video codec and tell me you still think it's "pure insanity" to rescale.

Or, you know, maybe some people have reasons for doing things that aren't the same as the narrow scope of use-cases you considered, and this would work perfectly well for them.


>Why would someone ever take a 40Mbps (compressed) video and downsample it so it can be encoded at 400Kbps (compressed) but played back with nearly the same fidelity

Because you can just not downscale them and compress them in the frequency domain and encode them in 200Kbps? This is pretty obvious, seriously do you not understand what JPEG does? And why it doesn't do down sampling?

Do you seriously believe downscaling outperforms compressing in the frequency domain?


Yes, absolutely. Paychovisual encoding can only do so much within the constraints of H.264/265.

Throwing away 3/4 (half res) or 15/16 (quarter res) of the data, encoding to X bitrate and then decoding+upscaling looks far better than encoding to the same X bitrate with full resolution.

For high bitrate, native resolution will of course look better. For low bitrate, the way H.26? algorithms work end up turning high resolution into a blocky ringing mess to compensate, vs lower resolution where you can see the content, just fuzzily.

Go get Tears of Steel raw 4K video (Y4M I think it's called). Scale it down 4x and encode it with ffmpeg HEVC veryslow at CRF 30. Figure out the bitrate, then cheat - use two-pass veryslow HEVC encoding to get the best possible quality native resolution at the same bitrate as your 4x downscaled version. You're aiming for two files that are about the same size. Somehow I couldn't convince the codec to go low enough to match, so I had the low-res version about 60% of the high-res version filesize. Now go and play them both back at 4K with just whatever your native upscale is - bilinear, bicubic, maybe NVIDIA Shield with it's AI Upscaling.

Go do that, then tell me you honestly think the blocky, streaky, illegible 4K native looks better than the "soft" quarter-res version.


4:2:0 which is used in all common video codecs is down scaling the color data.


Scaling color data is a different technique than down sampling. Again, all I am saying is that for a very good reason you do not stream pixel data or compress movies by storing data that was down sampled.


Your video codec should never create a 480p version at all. Downsampling is incredibly lossy. Instead stream the internal state of your network directly, effectively using the network to decompress your video. Train a new network to generate this state, acting as a compressor. This is the principle of neural compression.

This has two major benefits:

1. You cut out the low resolution half your network entirely. (Go check out the architecture diagram of the original post.)

2. Your encoder network now has access to the original HD video, so it can choose to encode the high-frequency details directly instead of generating them afterwards.



Not really, DLAA and the current incarnation of DLSS are temporal techniques, meaning all of the detail they add is pulled from past frames. That's an approach which only really makes sense in games where you can jitter the camera to continuously generate samples at different subpixel offsets with each frame.

The OP has more in common with the defunct DLSS 1.0, which tried to infer extra detail out of thin air rather than from previous frames, without much success in practice. That was like 5 years ago though so maybe the idea is worth revisiting at some point.


That’s a good idea it would save a lot of bandwidth and could be used to buffer drops while keeping the quality.


A better idea would obviously be to enhance the compressed data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: