This and other ways of using ML for video compression (like DLSS) scare me a bit because it kinda screws with reality. Like, with MPEG compression, you just see less detail when there's less bandwidth and you know exactly what you're missing. But we seem to be moving to methods where instead of removing the missing data, we fill it in with what should most likely be there. There's something that just seems wrong with that.
IMO this is a fantastic use of this technology. The bandwidth is low enough to enable videoconferencing on a dial-up connection. I only wonder how big the model is - gigabytes I bet.
The article state that it does send over at least 1 key-frame.
That might be used to find coefficients in a much smaller model, similar to how Active Appearance Models approximate a lot of faces with a relatively small model.