Something I've been thinking about recently is a more scalable approach to video...

Something I've been thinking about recently is a more scalable approach to video super-resolution.

The core problem is that any single AI will learn how to upscale "things in general", but won't be able to take advantage of inputs from the source video itself. E.g.: a close-up of a face in one scene can't be used elsewhere to upscale a distant shot of the same actor.

Transformers solve this problem, but with quadratic scaling, which won't work any time soon for a feature-length movie. Hence the 10 second clips in most such models.

Transformers provide "short term" memory, and the base model training provides "long term" memory. What's needed is medium-term memory. (This is also desirable for Chat AIs, or any long-context scenario.)

LoRA is more-or-less that: Given input-output training pairs it efficiently specialises the base model for a specific scenario. This would be great for upscaling a specific video, and would definitely work well in scenarios where ground-truth information is available. For example, computer games can be rendered at 8K resolution "offline" for training, and then can upscale 2K to 4K or 8K in real time. NVIDIA uses this for DLSS in their GPUs. Similarly, TV shows that improved in quality over time as the production company got better cameras could use this.

This LoRA fine-tuning technique obviously won't work for any single movie where there isn't high-resolution ground truth available. That's the whole point of upscaling: improving the quality where the high quality version doesn't exist!

My thought was that instead of training the LoRA fine-tuning layers directly, we could train a second order NN that outputs the LoRA weights! This is called a HyperNet, which is the term for neural networks that output neural networks. Simply put: many differentiable functions are twice (or more) differentiable, so we can minimise a minimisation function... training the trainer, in other words.

The core concept is to train a large base model on general 2K->4K videos, and then train a "specialisation" model that takes a 2K movie and outputs a LoRA for the base model. This acts as the "medium term" memory for the base model, tuning it for that specific video. The base model weights are the "long term" memory, and the activations are its "short term" memory.

I suspect (but don't have access to hardware to prove) that approaches like this will be the future for many similar AI tasks. E.g.: specialising a robot base model to a specific factory floor or warehouse. Or specialising a car driving AI to local roads. Etc...