I would be extremely surprised if someone were able to create a neural network that can come anywhere close to H.264/265. A lot of research has happened in the area of video codecs. Neural networks are good at adaptation, but useless at forming concepts about how the data is structured. For example: in video we do motion compensation, because we know video captures motion since objects move in physical reality. A neural network would have to do the same in order to get the same compression levels. And I also doubt it can outperform dedicated engineers in motion search estimation. But it's certainly interesting to see the development.
While I agree with you that the reason why modern video codecs work so well is because they embody knowledge about the statistical structure of natural scenes, there is no reason why a data driven / neural network approach could not also learn these sorts of hierarchical spatiotemporal contingencies as well.
The image labeling neural networks are a good proof of concept of this possibility. After all, what is a text label other than a concise (ie highly compressed) representation of the pixel data in the image. Obviously, being able to represent that a cat is in an image is quite lossy as compared to being able to represent that a particular cat is in an image (and where it is, and how it's scaled, etc). However, it's easy to imagine (in principle) layering on a hierarchy of other features, each explaining more and more of the statistics in the scene, until the original image is reproduced to arbitrary precision.
Could this outperform a hardwired video compressor? In terms of file size/bitrate, my intuition is yes and probably by a lot. In terms of computational efficiency, no idea.
But isn't there a fundamental difference between labeling and compression? For compression I would like stable guarantees for all data. For labeling it is enough to do better. Think of the classic stereotype that all asian faces look alike to europeans: that's ok for still labeling a human face, and useful. But for image compression to have different quality based on the subject would be useless!
Not really a fundamental difference. The better you can predict the data, the less you have to store. And the performance of all compression varies wildly depending on subject matter. The deeper the understanding of the data, the better the compression - and the more difficult to distinguish artefacts from signal. A compression algorithm that started giving you plausible, but incorrect faces if you turned the data rate too far down wouldn't be useless - it would be a stunning technical achievement.
> Think of the classic stereotype that all asian faces look alike
> to europeans: that's ok for still labeling a human face, and useful.
> But for image compression to have different quality based on the
> subject would be useless!
You bring up a very interesting phenomenon, but I think your example actually supports my assertion. My understanding is that europeans who (presumably initially having some difficulty telling asian faces apart) go on to live Asia (or live among asians in other contexts) report that, after a while, asian faces start to "look white" to them. I would suggest that plastic changes (ie learning) in the brain's facial recognition circuitry underlie this changing qualitative report. In other words, it's not that the faces are miscategorized as having more caucasian features, but rather that they start to look more like "familiar faces".
An extreme case of this not happening can be found among persons with prosopagnosia (face blindness). These people have non- or poorly functioning facial processing circuitry and exhibit great difficulty distinguishing faces. Presumably to the extent they can distinguish people at all, they must use specific features in a very cognitive way ("the person has a large nose and brown hair") rather than being able to apprehend the face "all at once".
Incidentally, I think there are a myriad of other examples of this phenomena, especially among specialists (wine tasters, music enthusiasts, etc) who have highly trained processing circuitry for their specialties and can make discriminations that the casual observer simply cannot. Another example that just came to mind is that of distinguishing phonemes that are not typical in ones native tongue. One reason that these sounds are so difficult for non-native speakers to produce is because they are difficult for non-native speakers to distinguish, and it simply takes time to learn to "hear" them.
All this is to say that your perceptual experience is not as stable as you think it is. Any sort of AI compression need only to be good enough for you or some "typical person". If the compressor was trained on asian faces (and others, besides) then it should be able to "understand" and compress them, perhaps even better than a naive white person. I could even imagine the AI being smart enough to "tune" its encoding to the viewers preferences and abilities.
But there is also a long history of novel compression schemes loosing to fine tuned classic schemes in the general case. Remember fractal or wavelet image compression? On the other hand I like your sentiment - I am just be more cautious.
> Neural networks are good at adaptation, but useless at forming concepts about how the data is structured. For example: in video we do motion compensation, because we know video captures motion since objects move in physical reality. A neural network would have to do the same in order to get the same compression levels.
You're not supposed to. Image generation and modeling scene dynamics is a hard task, and thumbnail scale is what we're at at the moment. Nevertheless, those and other papers do demonstrate that NNs are perfectly capable of learning 'objectness' from video footage (to which we can add GAN papers showing that GANS learn, based on static 2D images, 3D movement). More directly connected to image codecs, there's work on NN prediction of optical flow and inpainting.