It would be interesting to see how far you could get using deepfakes as a method for video call compression.
Train a model locally ahead of time and upload it to a server, then whenever you have a call scheduled the model is downloaded in advance by the other participants.
Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end. When the tech is a little further along, it should be possible to get good quality video using only a fraction of the bandwidth.
This is a minor plot point in Vernor Vinge's excellent SF novel A Fire Upon the Deep.
One of the premises of the novel's universe is that computational power is generally absurdly plentiful, but communications bandwidth over interstellar distances is not. Most communications are in plain text (modeled after USENET) but in some cases, "evocations" are used to extrapolate video and audio from an ultra-compressed data stream.
The trouble, of course, is that it's not very obvious what aspects of the image you're seeing are real, and what aspects were dreamed up by the system doing the extrapolating.
A main premise of the Fear the Sky trilogy as well but solved a different way. Machines representing various political factions from the home planet are uploaded with AI that mimics them emotionally and politically for all intents and purposes. I really enjoyed this book.
Eh, I personally enjoyed the series, but I wouldn't recommend anything beyond book 1. Book 2 is ok. Book 3 really spoiled the series for me because of the inconsistent behavior if the main character. (Keeping it vague to avoid spoilers)
> it's not very obvious what aspects of the image you're seeing are real, and what aspects were dreamed up by the system doing the extrapolating.
It would be quite obvious unless the raw data before extrapolating is destroyed, for which there are no reason nor is it possible to stop others in the vincinity from receiving this raw data.
That assumes that the "raw data" is reasonably human-comprehensible (which neural network weights and activations are notoriously not) and/or that you have time to sit down and analyze the data at your leisure.
> WaveNetEQ is a generative model, based on DeepMind’s WaveRNN technology, that is trained using a large corpus of speech data to realistically continue short speech segments enabling it to fully synthesize the raw waveform of missing speech.
I don't think you need to train for each person specifically, you can just train a model for all heads, then maybe transmit a few high quality pics when the call starts, and interpolate from that afterward.
Excellent idea and we'll surely be seeing something like this, there are AR apps that already map facial expressions to avatars.
Downside could be some uncanny valley if the models are not very high quality.
But if I had to make a prediction, I'd expect we'll get much more value from higher bandwidth, ultra high definition streaming and features like 3d cameras / virtual reality. I think we have a tendency to really underestimate how important high definition is for human communication.
> I'd expect we'll get much more value from higher bandwidth, ultra high definition streaming and features like 3d cameras / virtual reality. I think we have a tendency to really underestimate how important high definition is for human communication.
Low latency is probably more important to me.
Recently I seem to have a 3 second delay on many VC calls at work (and just for me it seems), and I either end up interrupting people or feeling reluctant to talk at all since it becomes impossible to time gaps and conversations right.
Despite that I get a crystal clear HD picture for all participants, but I'd happily sacrifice video quality (in fact I'd accept audio only in some cases) to get a more real time experience (disabling video doesn't seem to have any effect).
> Despite that I get a crystal clear HD picture for all participants, but I'd happily sacrifice video quality (in fact I'd accept audio only in some cases) to get a more real time experience (disabling video doesn't seem to have any effect).
If you're really willing to sacrifice video completely, at least for Zoom, and probably for lots of other videoconferencing solutions, you can call into meetings with your phone. In fact, I think Zoom allows you to join with the computer for video and the phone for audio, which might be the best of both worlds.
This is a long shot - but are you running on battery while this is happening? Had some weird issues that worked themselves out by plugging in the charger. Probably had to do with power savings and cpu throttling.
> Downside could be some uncanny valley if the models are not very high quality.
That can be controlled, since these compression algorithms usually work by making a prediction and sending the difference between the prediction and the actual value.
That works both for lossless compression - where the difference is sent in full - and lossy as well - where only the most important part of the difference is sent.
This is very loosely what nvidias dlss game upscaling does. Generalized NN trained on super high resolution game engine output. You can run a game at like a quarter to half resolution and it upscales the rest.
Very cool idea. The coding used in H264 is a variant of the DCT, so moving one layer of abstraction up from there basically moves from semi-analog to fully digital. I agree that it should only require a fraction of the bandwidth because you'd only be sending parametric data rather than full video.
I think this is largely possible, and accuracy to a human is very different than MSE accuracy used in a traditional lossy compression algorithm.
To a human, for example, the exact pattern of every strand of hair isn't important at all -- all that matters is that the hairstyle and hair color stays the same.
The algorithm can also not worry about encoding and re-constructing skin blemishes because humans would possibly actually enjoy not having to put on makeup for a video call.
I was thinking the same thing today. I wonder if it can be done in on-the-spot, like capture your image from the camera initially and then send the rest as data points for deepfake generation on the other side, but based on your own image. That would be amazing for low/limited bandwidth situations.
Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end.
The MPEG-4 part 2 actually had something like that, called "face and body animation (FBA)". As far as I know, there are no implementations in widespread use.
Mentioned in Jim Akhaleli's (sp?) "Revolutions" episode about the smart-phone currently on Netflix (my lad was watching it, really good for juniors or non-technical people IMO).
Train a model locally ahead of time and upload it to a server, then whenever you have a call scheduled the model is downloaded in advance by the other participants.
Now, instead of having to send video data, you only have to send a representation of the facial movements so that the recipients can render it on their end. When the tech is a little further along, it should be possible to get good quality video using only a fraction of the bandwidth.