Video quality seems really good, but limitations are quite restrictive "Our model encounters challenges when processing extremely long videos (e.g. 200 frames or more)".
I'd say most videos in practice are longer than 200 frames, so lot more research is still needed.
Sure, but that represents a lot of fast cuts balanced out by a selection of significantly longer cuts.
Also, it's less likely that you'd want to upscale a modern movie, which is more likely to be higher resolution already, as opposed to an older movie which was recorded on older media or encoded in a lower-resolution format.
Huh, I thought this couldn't be true, but it is. The first time I noticed annoyingly fast cuts was World War Z, for me it was unwatchable with tons of shots around 1 second each.
The first time I noticed how bad the fast cuts are we see in most movies was when I watched Children of Men by Alfonso Cuarón, who often uses very long takes for action scenes:
So sad they didn’t keep to the idea of the book. Anyone who hasn’t read this book you should, it bares no resemblance to the movie aside from the name.
It's offtopic, but this is very good advice. As near as I can tell, there aren't any real similarities between the book and the movie; they're two separate zombie stories with the same name, and honestly I would recommend them both for wildly different reasons.
And similarly, I, Robot, which is much more enjoyable when you realize it started as an independent murder-mystery screenplay that had Asimov’s works shoehorned in when both rights were bought in quick succession. I love both the movie and the collection of short stories, for vastly different reasons.
It’s style is based on the oral history approach used by Studs Terkel to document aspects of WW2 - building a big picture by interleaving lots of individual interviews.
The lost world is also a great book. It explores a lot of interesting stuff the film completely ignores. Like that the raptors are only rampaging monsters because they had no proper upbringing having been been born in the lab with no mama or papa raptor to teach them social skills
Disagree, Jurassic Park was an amazing movie on multiple levels, the book was just differently good, and adapting it to film in the exact format would have been less interesting (though the ending was better in the book.)
I think like the motorcycle chase that they borrowed from the lost world in Jurassic world, they also have a scene with those tiny dinosaurs pecking someone to death.
The textures of objects need to maintain consistency across much larger time frames, especially at 4k where you can see the pores on someone's face in a closeup.
I'm sure if you really want to burn money on compute you can do some smart windowing in the processing and use it on overlapping chunks and do an OK job.
I believe the relevant data point when considering applicability is the median shot length to give an idea of the length of the majority of shots, not the average.
It reminds me of the story about the Air Force making cockpits to fit the elusive average pilot, which in reality fit none of their pilots...
Freal. To the degree that i compulsively count seconds on shots until a show/movie has a few shots over 9 seconds then they "earn my trust" and i can let it go. Im fine
Easily solved, just overlap by ~40 frames and fade the upscaled last frames of chunk A into the start of chunk B before processing. Editors do tricks like this all the time.
It's not so much that it would be impratical (video streaming, like HLS or MPEG-Dash, requires to chunk videos in pieces of roughly this size) but you'd lose the inter-frame consistency at segments boundaries, and I suspect the resulting video would be flickering at the transition.
It could work for TV or movies if done properly at the scene transition time though.
You could probably mitigate this by using overlapping clips and fading between them. Pretty crude but could be close to unnoticeable, depending on how unstable the technique actually is.
Tale as old as time, in graphics papers it's "our technique achieves realtime speeds" and then 8 pages down they clarify that they mean 30fps at 640x480 on an RTX 4090.
Break into chunks that overlap by, say, a second, upscale separately and then blend to reduce sudden transitions in the generated details to gradual morphing.
The details changing every ten seconds or so is actually a good thing; the viewer is reminded that what they are seeing is not real, yet still enjoying a high resolution video full of high frequency content that their eyes crave.
If you're using this for existing material you just cut into <=8 second chunks, no big deal. Could be an absolute boon for filmmakers, otoh a nightmare for privacy because this will be applied to surveillance footage.
This is great for entertainment (and hopefully the main application), but we need clear marking of such type of videos before hallucinated details are used as "proofs" of any kind by people not knowing how this works. Software video/photography on smartphones is already using proprietary algorithms that "infer" non-existent or fake details, and this would be at an even bigger scale.
Funny to think of all those scenes in TV and movies when someone would magically "enhance" a low-resolution image to be crystal clear. At the time, nerds scoffed, but now we know they were simply using an AI to super-scale it. In retrospect, how many fictional villains were condemned on the basis of hallucinated evidence? :-D
Enemy of the State (1998) was prescient, that had a ridiculous example of "zoom and enhance" where they move the camera, but they hand-waved it as the computer "hypothesizing" what the missing information might have been. Which is more or less what gaussian splat 3D reconstructions are doing today.
Yeah I was curious about that baby. Do they know how it looks, or just guess? What about the next video with the animals. The leaves on the bush, are they matching a tree found there, or just generic leaves perhaps from the wrong side of the world?
I guess it will be like people pointing out bird sounds in movies, that those birds don't exist in that country.
The video of the owl is a great example of doing a terrible job without the average Joe noticing.
The real owl has fine light/dark concentric circles on its face. The app turned it into gray because it does not see any sign of the circles. The real owl has streaks of spots. The app turned them into solid streaks because it saw no sign of spots. There's more where this came from, but basically only looks good to someone who has no idea what the owl should look like.
Is anyone else concerned at the societal effects of technology like this? In one of the examples they show a young girl. In the upscale example it's quite clearly hallucinating makeup and lipstick. I'm quite worried about tools like this perpetuating social norms even further.
Aside your point: It does look like she is wearing lipstick tho, to me. More likely lip balm. Her (unaltered) lips have specular highlights on the tops that suggests they're wet or have lip balm to me. As for the makeup, not sure there. Here cheeks seem rosy in the original, and not sure what you're referring to beyond that. Perhaps her skin is too clear in the AI version, suggesting some type of foundation?
I know nothing of makeup tho, just describing my observations.
Socrates: I heard, then, that at Naucratis, in Egypt, was one of the ancient gods of that country, the one whose sacred bird is called the ibis, and the name of the god himself was Theuth. He it was who invented numbers and arithmetic and geometry and astronomy, also draughts and dice, and, most important of all, letters.
Now the king of all Egypt at that time was the god Thamus, who lived in the great city of the upper region, which the Greeks call the Egyptian Thebes, and they call the god himself Ammon. To him came Theuth to show his inventions, saying that they ought to be imparted to the other Egyptians. But Thamus asked what use there was in each, and as Theuth enumerated their uses, expressed praise or blame, according as he approved or disapproved.
"The story goes that Thamus said many things to Theuth in praise or blame of the various arts, which it would take too long to repeat; but when they came to the letters, "This invention, O king," said Theuth, "will make the Egyptians wiser and will improve their memories; for it is an elixir of memory and wisdom that I have discovered." But Thamus replied, "Most ingenious Theuth, one man has the ability to beget arts, but the ability to judge of their usefulness or harmfulness to their users belongs to another; and now you, who are the father of letters, have been led by your affection to ascribe to them a power the opposite of that which they really possess.
"For this invention will produce forgetfulness in the minds of those who learn to use it, because they will not practice their memory. Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise."
No, I'm not concerned. When an AI is trained on a largely raw, uncurated set of low-quality data (eg most of the public internet), it's going to miss subtle distinctions some humans might prefer that it make. I'm confident that pretty quickly the majority of the general public using such AIs will begin to intuitively understand this. Just as they have developed a practical, working understanding of other complex technology's limitations (such as auto-complete algorithms). No matter how good AI gets, there will always be some frontier boundary where it gets something wrong. My evidence is simply that even smart humans trying their best occasionally get such subtle distinctions wrong. However, this innate limitation doesn't mean that an AI can't still be useful.
What I am concerned about is that AI providers will keep wasting time and resources trying to implement band-aid "patches" to address what is actually an innate limitation. For example, exception processing at the output stage fails in ways we've already seen, such as AI photos containing female popes or an AI lying to deny that HP Lovecraft had a childhood pet (due to said pet having a name that was crudely rude 100 years ago but racist today). The alternative of limiting the training data to include only curated content fails by yielding a much less useful AI.
My, probably unpopular, opinion is that when AI inevitably screws up some edge case, we get more comfortable saying, basically, "Hey, sometimes stupid AI is gonna be stupid." The honest approach is to tell users upfront: when quality or correctness or fitness for any given purpose is important, you need to check every AI output because sometimes it's gonna fail. Just like auto-pilots, auto-correct and auto- everything else. As impressive as AI can sometimes be, personally, I think it's still lingering just below the threshold of "broadly useful" and, lately, the rate of fundamental improvement is slowing. We can't really afford to be squandering limited development resources or otherwise nerfing AI's capabilities to pursue ultimately unattainable standards. That's a losing game because there's a growing cottage industry of concern trolls figuring out how to get an AI to generate "problematic" output to garner those sweet "tsk tsk" clicks. As long as we keep reflexively reacting, those goalposts will never stop moving. Instead, we need to get off that treadmill and lower user expectations based on the reality of the current technology and data sets.
We seem to have a culture of completely paranoid people now.
When the internet came along every conversation was not dominated by "but what about people knowing how to build bombs???" the way most AI conversation flips to these paranoid AI doomer scenarios.
Ah, interesting. Originally, it would answer that question correctly. Then it got concern trolled in a major media outlet and some engineers were assigned to "patch it" (ie make it lie). Then that lie got highlighted some places (including here on HN), so I assume since then some more engineers got assigned to unpatch the patch.
I'll take that as supporting my point about the folly of wasting engineering time chasing moving goalposts. :-)
I don't know, it's a mirror, right? It's up to us to change really. Besides, failures like the one you point out make subtle stereotypes and biases more conspicuous, which could be a good thing.
It's interesting that the output of the genAI will inevitably get fed into itself. Both directly and indirectly by influencing humans who generate content that goes back into the machine. How long will the feedback loop take to output content reflecting new trends? How much new content is needed to be reflected in the output in a meaningful way. Can more recent content be weighted more heavily? Such interesting stuff!
Precisely: tools don't have morality. We have to engage in political and social struggle to make our conditions better. These tools can help but they certainly wont do it for us, nor will they be the reason why things go bad.
> is exactly what the AI safety field is attempting to address
Is it though? I think it's pretty obvious to any neutral observer that this is not the case, at least judging based on recent examples (leading with the Gemini debacle).
Yes, avoiding creating societally-harmful content is what the Gemini "debacle" was attempting to do. It clearly had unintended effects (e.g: generating a black Thomas Jefferson), but when these became apparent, they apologized and tried to put up guard rails to keep those negative effects from happening.
Who decides what is "societally-harmful content"? Isn't literally rewriting history "societally-harmful"? The black T.J. was a fun meme, but that's not what the alignment's "unintended effects" were limited to. I'd also say that if your LLM condemns right-wing mass murderers, but "it's complicated" with the left-wing mass murderers (I'm not going to list a dozen of other examples here, these things are documented and easy to find online if you care), there's something wrong with your LLM. Genocide is genocide.
This isn't the un-determinable question you've framed it as. Society defines what is and isn't acceptable all the time.
> Who decides what is "societally-harmful theft"?
> Who decides what is "societally-harmful medical malpractice"?
> Who decides what is "societally-harmful libel"?
The people who care to make the world a better place and push back against those that cause harm. Generally a mix of de facto industry standard practices set by societal values and pressures, and de jure laws established through democratic voting, legislature enactment, and court decisions.
"What is "societally-harmful driving behavior"" was once a broad and undetermined question but nevertheless it received an extensive and highly defined answer.
What Gemini was doing -- what it was explicitly forced to do by poorly considered dogma -- was societally harmful. It is utterly impossible that these were "unintended"[1], and were revealed by even the most basic usage. They aren't putting guardrails to prevent it from happening, they quite literally removed instructions that explicitly forced the model to do certain bizarre things (like white erasure, or white quota-ing).
[1] - Are people seriously still trying to argue that it was some sort of weird artifact? It was blatantly overt and explicit, and absolutely embarrassing. Hopefully Google has removed everyone involved with that from having any influence on anything for perpetuity as they demonstrate profoundly poor judgment and a broken sense of what good is.
An LLM should represent a reasonable middle of the political bell curve where Antifa is on the far left and Alt-Right is on the far right. That is what I meant by a neutral observer. Any kind of political violence should be cosidered deplorable, which was not the case with some of the Gemini answers. Though I do concede that right wingers cooked up questionable prompts and were fishing for a story.
> An LLM should represent a reasonable middle of the political bell curve where Antifa is on the far left and Alt-Right is on the far right. That is what I meant by a neutral observer.
This is a bad idea.
Equating extremist views with those seeking to defend human rights blurs the ethical reality of the situation. Adopting a centrist position without critical thought obscures the truth since not all viewpoints are equally valid or deserve equal consideration.
We must critically evaluate the merits of each position (anti-fascists and fascists are very different positions indeed) rather than blindly placing them on equal footing, especially as history has shown the consequences of false equivalence perpetuate injustice.
All of this is political. It always is. Where does the LLM fall on trans rights? Where does it fall on income inequality? Where does it fall on tax policy? "Any kind of political violence should be considered deplorable" - where's this fall on Israel/Gaza (or Hamas/Israel)? Does that question seem non-political to you? 50 years ago, the middle of American politics considered homosexuality a mental disorder - was that neutral? Right now if you ask it to show you a Christian, what is it going to show you? What _should_ it show you? Right now, the LLM is taking a whole bunch of content from across society, which is why it turns back a white man when you ask it for a doctor - is that neutral? It's putting lipstick on an 8-year-old, is that neutral? Is a "political bell curve" with "antifa on the left" and "alt-right on the right" neutral in Norway? In Brazil? In Russia?
Wonder how long until Hollywood CGI shops have these types of models running as part of their post-production pipeline. Big blockbusters often release with ridiculously broken CGI due to crunch (Black Panther's third act was notorious for looking like a retro video-game), adding some extra generative polish in those cases is a no-brainer.
Hollywood has incredible financial and political power. And even if fully AI generated movies reach the same quality (both visually and story wise) as current ones, there’s a lot of value in the shared experience of watching the same movies as other people, that a complete collapse of the industry seems highly unlikely to me.
What quality? Current industry movies are, for lack of better term, inbred. Sound too loud, washed out rigid color scheme, keeping attention of the audience captive at all costs. They already exclude large, more sensitive, part of population that hates all of this despite the shared experience. And AI is exceptionally good at further inbreeding to the extreme.
While of course it isn't impossible for any industry to reinvent itself, movie as an art form won't die....having doubts about where it's going.
I wouldn’t have any confidence in any predictions I make 100 years into the future even if we didn’t have the current AI developments.
With that said, I’m pretty confident that the movie industry will exist in 10 years (maybe heavily tranformed, but still existing and still pretty big). If it’s still a big part of current popculture by then (vs obviously on its way out) then I’d expect a collapse of it to require a change that is not a result of AI proliferation, but something else entirely.
My point is that many talk about AI as though it's not going to evolve or get better. It's a mindset of "We don't need to talk about this because it won't happen tomorrow".
Realistically, AI being able to replace Hollywood is something that could happen in 20-50 years. That's within most people's lifetime.
Bingo. Except it looked like magic because the tech was so expensive and only available to them.
Limited access to the tech added some mystique to it too.
Just like digital cameras created a lot more average photographers, it pushed photography to a higher standard than just having access to expensive equipment.
yeah and the only reason we don't see more of it was prohibitively expensive for all but basically Disney.
the compute budgets for basic run of the mill small screen 3D rendering and 2D compositing is already massive compared to most other businesses of a similar scale. the industry has been under paying their artists for decades too.
I'm willing to bet that as soon as unreal or adobe or whoever comes out with a stable diffusion like model that can be consistent across a feature length movie, they'll stop bothering with artists altogether.
why have an entire team of actual people in the loop when the director can just tell the model what they want to see? why shy away from revisions when the model can update colour grade or edit a character model throughout the entire film without needing to re-render?
I think the video of the camera operator on the ladder shows the artifacts the best. The main camera equipment is no longer grounded in reality, with the fiddly bits disconnected from the whole and moving around. The smaller camera is barely recognizable. The plant in the background looks blurry and weird, the mountains have extra detail. Finally, the lens flare shifts!
Check out the spider too, the way the details on the leg shift is distinctly artificial.
I think the 4x/8x expansion (16x/64x the pixels!) is pushing the tech too far. I bet it would look great at <2x.
>I think the 4x/8x expansion (16x/64x the pixels!) is pushing the tech too far. I bet it would look great at <2x.
I believe this applies to every upscale model released in the past 8 years, yet undeterred by this scientists keep pushing on, sometimes even claiming 16x upscaling. Though this might be the first one that is pretty close to holding up at 4x in my opinion, which is not something I've seen often.
I think the hand running through the wheat (?) is pretty good, object permanence is pretty reasonable especially considering the GAN architecture. GANs are good at grounded generation--this is why the original GigaGAN paper is still in use by a number of top image labs. Inferring object permanence and object dynamics is pretty impressive for this structure.
Plus, a rather small data set: REDS and Vimeo-90k aren't massive in comparison to what people speculate Sora was trained on.
I wonder if you could specialise a model by training it on a whole movie or TV series, so that instead of hallucinating from generic images, the model generates things it has seen closer-up in other parts of the movie.
You'd have to train it to go from a reduced resolution to the original resolution, then apply that to small parts of the screen at the original resolution to get an enhanced resolution, then stitch the parts together.
I can't wait for the next explosion in "bigfoot" videos: wildlife on the moon, people hiding in shadows, plants, animals, and structures completely out of place.
The difference will be that this time the images will be crystal clear, just hallucinated by a neural network.
I'm curious as to how well this works when upscaling from 1080p to 4K or 4K to 8K.
Their 128x128 to 1024x1024 upscales are very impressive, but I find the real artifacts and weirdness are created when AI tries to upscale an already relatively high definition image.
I find it goes haywire, adding ghosting, swirling, banded shadowing, etc as it whirlwinds into hallucinations from too much source data since the model is often trained to work with really small/compressed video into an "almost HD" video.
This looks great, however, things like rolling shutter or video wipes/transitions will be interesting to see how it handles that. Also, all of the sample videos the camera is locked down and not moving, or moving just ever so slightly (the ants and the car clips ). It looks like they took time to smooth out any excessive camera shake.
Intergrading this with Adobe's object tracking software (in premier/after effects) may help.
The video comparison examples, while impressive, were basically unusable on mobile Safari because they launched in full screen view and broke the slider UI.
I am personally much more interested in frame rate upscalers. A proper 60Hz just looks much better then anything. Also would really, really like to see a proper 60Hz animate upscale. Anything in that space just sucks. But when in the rare cases it works it really looks next level.
This just sounds like the AI is just not good enough yet. I mean it's pretty clear now that there is nothing stopping AI from producing close to or sometimes even exceeding human artists. A big problem here is good training material
I didn't say that. I said that AI is capable of sometimes exceeding human artists. That is not the same thing as saying AI is exceeding the best human artist. If your training material is of high quality it's shouldn't be impossible to exceed human artists some or even most times, i.e. Produce better material then the average or good artists.
This is amazing and all but at what point do we reach the point of there is no more “real” data to infer from low resolution? In other words there are all sorts of information theory research on the amount of unique entropy on a given medium and even with compression there is a limit. How does that limit relate to work like this? Is there a point at which it can say we know it’s inventing things beyond x scaling constant because of information theory research?
I'm not sure information theory deals with this question.
Since this isn't lossless decompression, the point of having no "real" data is already reached. It _is_ inventing things, and the only relevant question is how plausible are the things being invented; in other words, if the video also existed in higher resolution, how close would it actually look like the inferred version. Seems obvious that this metric increases as a function of the amount of information from the source, but I would guess the exact relationship is a very open question.
> This is amazing and all but at what point do we reach the point of there is no more “real” data to infer from low resolution?
The start point. Upscaling is by definition creating information where there wasn't any to begin with.
Nearest neighbor filtering is technically inventing information, it's just the dumbest possible approach. Bilinear filtering is slightly smarter. This approach tries to be smarter still by applying generative AI.
There is plenty of real information: that's what the model is trained on. That information ceases to be real the moment it is used by a model to fill in the gaps of other real information. The result of this model is a facade, not real data.
This seems technically very impressive, but it does occur to my more pragmatic side that I probably haven't seen videos as blurry as the inputs for ~ 10 years. I'm sure I'm unaware of important use cases, but I didn't realize video resolution was a thing we needed to solve for these days (at least inference for perceptive quality).
What exactly does this do? They have examples with a divider in the middle that you can move around and one side says "input" and the other "output". However, no matter where I move the slider, both sides look identical to me. What should I be focusing on exactly to see a difference?
If you want to use your image for anything that needs to be factual (i.e. surveillance, science, automation) the up-scaling adds nothing---it's just guessing on what is probably there.
If you just want the picture to be pretty, this is probably cheaper than a bigger sensor.
They used digital upsampling techniques and colorization to make World War One footage into high resolution. Jackson would later do the same process for the 2021 series Get Back, upscaling 16mm footage of the Beatles taken in 1969: https://www.imdb.com/title/tt9735318/
Both of these are really impressive. They look like they were shot on high resolution film recently, instead of fifty or a hundred years ago. It appears that what Peter Jackson and his team did meticulously at great effort can now be automated.
Everyone should understand the limitations of this process. It can't magically extract details from images that aren't there. It is guessing and inventing details that don't really exist. As long as everyone understands this, it shouldn't be a problem. Like, we don't care that the cross-stitch on someone's shirt in the background doesn't match reality so long as it's not an important detail. But if you try to go Blade Runner/CSI and extract faces from reflections of background objects, you're asking for trouble.
Glad to see that Adobe is still investing on the alias-free convolutions (as in StyleGAN3), and this time they know how to fill the lost high frequency features
I always thought that alias-free convolutions can produce much more natural videos
Something I've been thinking about recently is a more scalable approach to video super-resolution.
The core problem is that any single AI will learn how to upscale "things in general", but won't be able to take advantage of inputs from the source video itself. E.g.: a close-up of a face in one scene can't be used elsewhere to upscale a distant shot of the same actor.
Transformers solve this problem, but with quadratic scaling, which won't work any time soon for a feature-length movie. Hence the 10 second clips in most such models.
Transformers provide "short term" memory, and the base model training provides "long term" memory. What's needed is medium-term memory. (This is also desirable for Chat AIs, or any long-context scenario.)
LoRA is more-or-less that: Given input-output training pairs it efficiently specialises the base model for a specific scenario. This would be great for upscaling a specific video, and would definitely work well in scenarios where ground-truth information is available. For example, computer games can be rendered at 8K resolution "offline" for training, and then can upscale 2K to 4K or 8K in real time. NVIDIA uses this for DLSS in their GPUs. Similarly, TV shows that improved in quality over time as the production company got better cameras could use this.
This LoRA fine-tuning technique obviously won't work for any single movie where there isn't high-resolution ground truth available. That's the whole point of upscaling: improving the quality where the high quality version doesn't exist!
My thought was that instead of training the LoRA fine-tuning layers directly, we could train a second order NN that outputs the LoRA weights! This is called a HyperNet, which is the term for neural networks that output neural networks. Simply put: many differentiable functions are twice (or more) differentiable, so we can minimise a minimisation function... training the trainer, in other words.
The core concept is to train a large base model on general 2K->4K videos, and then train a "specialisation" model that takes a 2K movie and outputs a LoRA for the base model. This acts as the "medium term" memory for the base model, tuning it for that specific video. The base model weights are the "long term" memory, and the activations are its "short term" memory.
I suspect (but don't have access to hardware to prove) that approaches like this will be the future for many similar AI tasks. E.g.: specialising a robot base model to a specific factory floor or warehouse. Or specialising a car driving AI to local roads. Etc...
Are you using your Shield as an HTPC? In that case you can use the upscaler built into a TV. I prefer my LG C2 upscale (particularly the frame interpolation) compared to most Topaz AI upscales.
There are a lot of old porn videos out there which have become commercially worthless because they were recorded at low resolutions (e.g. 320x240 MPEG, VHS video, 8mm film, etc). Being able to upscale them to HD resolutions, at high enough quality that consumers are willing to pay for it, would be a big deal.
(It doesn't hurt that a few minor hallucinations aren't going to bother anyone.)
Reloading can get them in sync. But, it seems to stop playback of the "left" one if you drag the slider completely left, which makes it easy to get desynced again.
I agree that it's not perfect, though it does appear to be SoTA. Eventually something like this will just be part of every video codec. You stream a 480p version and let the TV create the 4K detail.
If you have the high res data you can actually compress the details which are there and then recreate them. No need to have those be recreated, when you actually have them.
Downscaling the images and then upscaling them is pure insanity when the high res images are available.
That's absurd. I think anybody is aware that it is far superior to e.g. compress in the frequency domain than to down sample your image. If you don't believe me just compare a JPEG compressed image with the same image of the same size compressed with down sampling. You will notice a literal night and day difference.
Down sampling is a bad way to do compression. It makes no sense to do NN reconstruction on that if you could have compressed that image better and reconstructed from that data.
An image downscaled and then upscaled to its original size is effectively low-pass filtered where the degree of edge preservation is dictated by the kernel used in both cases.
Are you saying low-pass filtering is bad for compression?
The word is "blur." Low-pass filtering is blurring.
Is blurring good for compression? I don't know what that means. If the image size (not the file size) is held constant, a blurry image and a clear image take up exactly the same amount space in memory.
Blurring is bad for quality. Our vision is sensitive to high-frequency stuff, and low-pass filtering is by definition the indiscriminate removal of high-frequency information. Most compression schemes are smarter about the information they filter.
> Is blurring good for compression? I don't know what that means.
Consider lossless RLE compression schemes. In this case, would data with low or high variance compress better?
Now consider RLE against sets of DCT coefficients. See where this is going?
In general, having lower variance in your data results in better compression.
> Our vision is sensitive to high-frequency stuff
Which is exactly why we pick up HF noise so well! Post-processing houses are very often presented with the challenge of choosing just the right filter chain to maximize fidelity under size constraint(s).
> low-pass filtering is by definition the indiscriminate removal of high-frequency information
It's trivial to perform edge detection and build a mask to retain the most visually-meaningful high frequency data.
Yes. Down sampling makes only sense if you store per pixel data, which is obviously a dumb idea. You get a stream for 480p which contains frames which were compressed from the source files, or the 4k version. At some point there might have been down sampling involved, but you never actually get any of that data, you get the compressed version of those.
Not sure if I’m being dumb, or if it’s you not explaining it clearly: if Neflix produced low resolution frames from high resolution (4k to 480p), and if these 480p frames are what my TV is receiving - are you saying it’s not downsampling, and my TV would not benefit from this new upsampling method?
Your TV never receives per pixel data. Why would you use a NN to enhance the data which your TV has constructed instead of enhancing the data it actually receives?
OK, I admit I don’t know much about video compression. So what does my TV receives from Netflix if it’s not pixels? And when my TV does “upsampling” (according to the marketing) what does it do exactly?
It receives information about the spacial frequency content of the image. If you're unfamiliar, it's definitely worth looking into the specifics of how this works, as it's quite impressive! Here's a few relevant Wikipedia articles, and a Computerphile video:
I think you're missing the point of this paper—the precise thing it's showing is upscaling previously downscaled video with minimal perceptual differences from ground truth.
So you could downscale, then compress as usual, and then upscale on playback.
It would obviously be quite attractive to be able to ship compressed 480p (or 720p etc) footage and be able to blow it up to 4K at high quality. Of course you will have higher quality if you just compress the 4K, but the file size will be an order of magnitude larger.
In our hypothetical example, the compressed 4k data or the compressed 480p data? You would enhance the compressed 480p—that's what the example is. You would probably not enhance the 4K, because there's very little benefit to increasing resolution beyond 4K.
There is no use case, because it is a stupid idea. Downscaling then reconstructing is a stupid idea for exactly the same reasons why downscaling for compression is a bad idea.
The issue isn't NN reconstruction, but that you are reconstructing the wrong data.
Streaming services already deliver in lower resolution than they available based on network conditions. Good upscaling would let you save on bandwidth and deliver content easier to people in poor network conditions. The tradeoff would be that details in the image wouldn't be exactly the same as the original - but, presumably, nobody would notice this so it would be fine.
Why would someone ever take a 40Mbps (compressed) video and downsample it so it can be encoded at 400Kbps (compressed) but played back with nearly the same fidelity / with similar artifacts to the same process at 50x data volume? The world will never know.
You're also ignoring the part where all lossy codecs throw away those same details and then fake-recreate them with enough fidelity that people are satisfied. Same concept, different mechanism.
Look up what 4:2:0 means vs 4:4:4 in a video codec and tell me you still think it's "pure insanity" to rescale.
Or, you know, maybe some people have reasons for doing things that aren't the same as the narrow scope of use-cases you considered, and this would work perfectly well for them.
>Why would someone ever take a 40Mbps (compressed) video and downsample it so it can be encoded at 400Kbps (compressed) but played back with nearly the same fidelity
Because you can just not downscale them and compress them in the frequency domain and encode them in 200Kbps? This is pretty obvious, seriously do you not understand what JPEG does? And why it doesn't do down sampling?
Do you seriously believe downscaling outperforms compressing in the frequency domain?
Yes, absolutely. Paychovisual encoding can only do so much within the constraints of H.264/265.
Throwing away 3/4 (half res) or 15/16 (quarter res) of the data, encoding to X bitrate and then decoding+upscaling looks far better than encoding to the same X bitrate with full resolution.
For high bitrate, native resolution will of course look better. For low bitrate, the way H.26? algorithms work end up turning high resolution into a blocky ringing mess to compensate, vs lower resolution where you can see the content, just fuzzily.
Go get Tears of Steel raw 4K video (Y4M I think it's called). Scale it down 4x and encode it with ffmpeg HEVC veryslow at CRF 30. Figure out the bitrate, then cheat - use two-pass veryslow HEVC encoding to get the best possible quality native resolution at the same bitrate as your 4x downscaled version. You're aiming for two files that are about the same size. Somehow I couldn't convince the codec to go low enough to match, so I had the low-res version about 60% of the high-res version filesize. Now go and play them both back at 4K with just whatever your native upscale is - bilinear, bicubic, maybe NVIDIA Shield with it's AI Upscaling.
Go do that, then tell me you honestly think the blocky, streaky, illegible 4K native looks better than the "soft" quarter-res version.
Scaling color data is a different technique than down sampling. Again, all I am saying is that for a very good reason you do not stream pixel data or compress movies by storing data that was down sampled.
Your video codec should never create a 480p version at all. Downsampling is incredibly lossy. Instead stream the internal state of your network directly, effectively using the network to decompress your video. Train a new network to generate this state, acting as a compressor. This is the principle of neural compression.
This has two major benefits:
1. You cut out the low resolution half your network entirely. (Go check out the architecture diagram of the original post.)
2. Your encoder network now has access to the original HD video, so it can choose to encode the high-frequency details directly instead of generating them afterwards.
Not really, DLAA and the current incarnation of DLSS are temporal techniques, meaning all of the detail they add is pulled from past frames. That's an approach which only really makes sense in games where you can jitter the camera to continuously generate samples at different subpixel offsets with each frame.
The OP has more in common with the defunct DLSS 1.0, which tried to infer extra detail out of thin air rather than from previous frames, without much success in practice. That was like 5 years ago though so maybe the idea is worth revisiting at some point.
I'd say most videos in practice are longer than 200 frames, so lot more research is still needed.