This is pretty neat. But is it just me or does the dog picture look better in JPEG?
When zoomed in, the JPEG artifacts are quite apparent and the RNN produces a much smoother image. However, to my eye when zoomed out the high frequency "noise", particularly in the snout area, looks better in JPEG. The RNN produces a somewhat blurrier image that reminds me of the soft focus effect.
It looks like they are using a "multi-scale structural similarity" (MS-SIM) metric as a proxy for how well a human will think a compressed image reproduces the original image. Both the JPEG and the RNN images are compressed to the same MS-SIM metric, which in this case makes the RNN image take 25% fewer bytes.
Part of the reason JPEG looks better in this case is that they have chosen to feature a detail- and noise-rich portion of the image, which favors JPEG... High JPEG compression has a sharpening effect around edges, including "ringing" which hides well in noise. However, if you view the full-sized comparison of the dog images, you can see that JPEG has essentially no detail in the darker portions of the image. Presumably both the MS-SIM metric and humans would judge RNN better for these portions.
Thanks for describing MS-SIM - I've always been concerned about using L2 or similar metrics for image similarity. Is there any work on learning an image similarity metric? I would think if you Mechanical Turk'd a bunch of A/B images (with or without perturbations of various sorts or even totally different images) and asked the user to score the similarity you should be able to learn a pretty interesting scoring.
>High JPEG compression has a sharpening effect around edges, including "ringing" which hides well in noise.
Ringing is the opposite of a sharpening effect, no? JPEG essentially acts as a low-pass filter over the DCT tiles of the image, so you get ringing because you eliminate the high-frequency components that make the edge sharp and you're left with a wavelet that extends past the sharp edge and isn't cancelled out the way it used to be.
Nope, ringing is the effect of a high-pass (or band-pass) filter - it's the opposite of smoothing.
JPEG looses information by quantization, not low-passing. It is true that small high frequency coefficients will get rounded to 0 during quantization, and this does save quite a few bits. But the ringing is just from being imprecise about each DCT coefficient, coding a value that is either too large or too small.
Remember that a brickwall band pass filter (low, high, or anywhere else) is equivalent to convolution with this https://www.dsprelated.com/josimages_new/mdft/img1768.png in the spatial domain. Any time quantization is high enough for coefficients in a given band to become zero while the surrounding ones remain nonzero it will introduce at least some level of ringing.
wyager accurately described why in terms of the DCT though -- cosines naturally "ring" and the only way a non-sinusoid can be represented is through the combination of many other terms, so any time a coefficient contributing to this shape is eliminated, the structure "decays" so to speak and the underlying sinusoids become part of the reconstruction.
I don't think it makes sense to say that ringing is (necessarily) an effect of high-pass, low-pass, or band-pass filters. Any second-order system (that is undamped) could exhibit ringing. If I concatenate a low-pass filter with one of these ringing filters, I could get a ringing and low-pass filter
> Presumably both the MS-SIM metric and humans would judge RNN better for these portions.
doesn't remotely hold for detailed/high contrast. If I can see a huge difference, and vastly prefer one over the other, the metric is not useful by its definition. Please don't get emotional about figures of merit.
I'm guessing a genuinely useful figure of merit would have put both file sizes a bit closer together, and the NN would have shined, especially since it works so well in low contrast areas.
One thing I've seen suggested for doing algorithms like seam carving, is creating a small embedded "noise texture sample library" that can be mixed and matched to synthesize "more of" a surrounding texture, rather than simply cloning that texture.
You can, of course, recognize (and remove) textures using the convolution nodes of a convolutional NN, encode the texturing parameters as part of the output of the network, and then reverse the process with a deconvolutional NN, with the textures reappearing at the end. The deconvolution nodes are in effect equivalent to that "noise texture sample library."
But this is, in essence, two algorithms: one that compresses textures, and then another that compresses "everything else" once you've subtracted the texture. Thinking about the second algorithm, I wonder whether you need to go to the trouble of an ML approach at all: a non-textured image would effectively be a smooth color gradient, perfectly amenable to wavelet compression.
jpeg 2000 had about a 20% reduction in size over typical jpeg, while producing virtually no blocking artifacts, 16 years ago[1]. Almost no one uses it, though. Now in 2016 we are using neural networks to get a similar reduction, except the dog's snout looks blurry, and with a process that I assume is much more resource intensive. It's interesting for sure, but if people didn't care about jp2, they would have to be drinking some serious AI Kool-aid to want something like this.
Rather than using misguided phrases like "Kool-aid" its important to consider several reasons why JPEG 2000 adoption has lagged or why there is interest in neural networks:
1. JPEG 2000 is Patent encumbered.
2. Neural Network methods have good chance of getting significantly better.
3. The way similar architecture / ops have been found to be very effective across variety of tasks, its not unlikely that in future, all machines will have a dedicated neural inference chip, e.g. NVidia Jetson TK1 or Google TPU. In this scenario the more mileage you can get out of such dedicated hardware the better.
1 seems valid, but 2 and 3 don't seem to be a reason applicable going back the past 16 years. That would be a kind of vaporware argument if we would have had to have waited 16, 10, or even 5 years for NN improvements over JPEG 2000 to pan out.
There's also BPG [1], a javascript image decoder that can be used on the web today, with (in my estimate) about a 40% improvement over jpeg. It's a bit disingenuous to use something so far from the state of the art as baseline, but it's still an interesting line of research IMO.
BPG is derived from HEVC and thus patent encumbered as JPEG2000. But we have new codec like Daala or AV1 that could be used as a still image formats when they'll be mature enough.
I made a website where one could compare various formats:
> Now in 2016 we are using neural networks to get a similar reduction
It's a proof of concept, not to be used in real life for anything. It would be too expensive. Also, it's not a refined product, just an attempt to see if neural networks can do better compression.
Yeah that's not lost on me. I'm just trying to put the accomplishment in context. To me, (misguided as I may be LOL), it means the state of the art for NN photo compression is getting close to what we could do with wavelets in the late 90's. I don't mean to say that's bad; in fact I think that's pretty amazing.
It's also good to maintain some perspective on what it's actually worth to do better than JPEG. We have had actual usable products better than JPEG for a long time (I mentioned J2K but there's been so many others like WebP and BPG etc), and no one cares. People use GIF as a short-form video format--that's how much no one cares about this stuff. So I think it's mostly worthwhile as research, to increase our ability with neural nets in general, rather than as an eventual product. But who knows?
"The next challenge will be besting compression methods derived from video compression codecs,
such as WebP (which was derived from VP8 video codec), on large images since they employ tricks
such as reusing patches that were already decoded."
Beating block based JPEG with a global algorithm doesn't seem that exciting.
I would be extremely surprised if someone were able to create a neural network that can come anywhere close to H.264/265. A lot of research has happened in the area of video codecs. Neural networks are good at adaptation, but useless at forming concepts about how the data is structured. For example: in video we do motion compensation, because we know video captures motion since objects move in physical reality. A neural network would have to do the same in order to get the same compression levels. And I also doubt it can outperform dedicated engineers in motion search estimation. But it's certainly interesting to see the development.
While I agree with you that the reason why modern video codecs work so well is because they embody knowledge about the statistical structure of natural scenes, there is no reason why a data driven / neural network approach could not also learn these sorts of hierarchical spatiotemporal contingencies as well.
The image labeling neural networks are a good proof of concept of this possibility. After all, what is a text label other than a concise (ie highly compressed) representation of the pixel data in the image. Obviously, being able to represent that a cat is in an image is quite lossy as compared to being able to represent that a particular cat is in an image (and where it is, and how it's scaled, etc). However, it's easy to imagine (in principle) layering on a hierarchy of other features, each explaining more and more of the statistics in the scene, until the original image is reproduced to arbitrary precision.
Could this outperform a hardwired video compressor? In terms of file size/bitrate, my intuition is yes and probably by a lot. In terms of computational efficiency, no idea.
But isn't there a fundamental difference between labeling and compression? For compression I would like stable guarantees for all data. For labeling it is enough to do better. Think of the classic stereotype that all asian faces look alike to europeans: that's ok for still labeling a human face, and useful. But for image compression to have different quality based on the subject would be useless!
Not really a fundamental difference. The better you can predict the data, the less you have to store. And the performance of all compression varies wildly depending on subject matter. The deeper the understanding of the data, the better the compression - and the more difficult to distinguish artefacts from signal. A compression algorithm that started giving you plausible, but incorrect faces if you turned the data rate too far down wouldn't be useless - it would be a stunning technical achievement.
> Think of the classic stereotype that all asian faces look alike
> to europeans: that's ok for still labeling a human face, and useful.
> But for image compression to have different quality based on the
> subject would be useless!
You bring up a very interesting phenomenon, but I think your example actually supports my assertion. My understanding is that europeans who (presumably initially having some difficulty telling asian faces apart) go on to live Asia (or live among asians in other contexts) report that, after a while, asian faces start to "look white" to them. I would suggest that plastic changes (ie learning) in the brain's facial recognition circuitry underlie this changing qualitative report. In other words, it's not that the faces are miscategorized as having more caucasian features, but rather that they start to look more like "familiar faces".
An extreme case of this not happening can be found among persons with prosopagnosia (face blindness). These people have non- or poorly functioning facial processing circuitry and exhibit great difficulty distinguishing faces. Presumably to the extent they can distinguish people at all, they must use specific features in a very cognitive way ("the person has a large nose and brown hair") rather than being able to apprehend the face "all at once".
Incidentally, I think there are a myriad of other examples of this phenomena, especially among specialists (wine tasters, music enthusiasts, etc) who have highly trained processing circuitry for their specialties and can make discriminations that the casual observer simply cannot. Another example that just came to mind is that of distinguishing phonemes that are not typical in ones native tongue. One reason that these sounds are so difficult for non-native speakers to produce is because they are difficult for non-native speakers to distinguish, and it simply takes time to learn to "hear" them.
All this is to say that your perceptual experience is not as stable as you think it is. Any sort of AI compression need only to be good enough for you or some "typical person". If the compressor was trained on asian faces (and others, besides) then it should be able to "understand" and compress them, perhaps even better than a naive white person. I could even imagine the AI being smart enough to "tune" its encoding to the viewers preferences and abilities.
But there is also a long history of novel compression schemes loosing to fine tuned classic schemes in the general case. Remember fractal or wavelet image compression? On the other hand I like your sentiment - I am just be more cautious.
> Neural networks are good at adaptation, but useless at forming concepts about how the data is structured. For example: in video we do motion compensation, because we know video captures motion since objects move in physical reality. A neural network would have to do the same in order to get the same compression levels.
You're not supposed to. Image generation and modeling scene dynamics is a hard task, and thumbnail scale is what we're at at the moment. Nevertheless, those and other papers do demonstrate that NNs are perfectly capable of learning 'objectness' from video footage (to which we can add GAN papers showing that GANS learn, based on static 2D images, 3D movement). More directly connected to image codecs, there's work on NN prediction of optical flow and inpainting.
Why does a blog page showing static content do madness like this? I'd think google engineers of all people would know better. The site doesn't even work without javascript from a 3rd-party domain.
Engineers are expensive. Google's research blog probably isn't a very high-traffic surface, and most visitors are probably on modern machines and fast networks. It doesn't make sense to spend a bunch of engineering resources on optimizing something like this. So instead, you do whatever is easiest to implement and maintain.
> Instead of using a DCT to generate a new bit representation like many compression schemes in use today, we train two sets of neural networks - one to create the codes from the image (encoder) and another to create the image from the codes (decoder).
So instead of implementing a DCT on my client I need to implement a neural network? Or are these encoder/decoder steps merely used for the iterative "encoding" process? It seems like the representation of a "GRU" file is different from any other.
It sounds complicated, but neural networks these days are basically just a bunch of filters with nonlinear cut-off and bining. (From a signal processing point of view..)
Super simple to implement the feed-forward scenario for decoding.
Not entirely sure what the residual memory aspect of these networks add in terms of complexity, but it's probably just another vector add-multiply, or something to that effect.
Yes you would. But if this becomes widely used you would expect the decode portion to distributed to work on a variety of platforms. Just like jpeg is today. Although I dont think this going to happen. If they can do the same for video, leveraging temporal correlations, then I can see it taking off. In this case the decode neural network would be embedded in the codec.
Many NNs can be compressed considerably without losing much performance. The runtime of RNNs is more concerning, as is whether anyone wants to move to a new image format, but it's still interesting pure research in terms of learning to predict image contents. It's a kind of unsupervised learning, which is one of the big outstanding questions for NNs at the moment.
They basically ripped me a new one said it was a stupid idea and that I shouldnt make suggestions in a question. Then I took the suggestions and details out (but left the basic concept in there) and they gave me a lecture on basics of image compression.
Made me really not want to try to discuss anything with anyone after that.
Expecting 2 years of research by around 20 of the best DNN researchers on the planet to be compressed into a StackOverflow answer before it has been done seems a fairly large amount to expect.
Not a huge fan of the negativity on StackOverflow, but until August 2016 (when the paper this was based on was published - or maybe 2015, with DRAW[2]) people actively working in this area didn't think it was possible.
Also, the single answer there certainly didn't say anything like it was a stupid idea. I don't think the author of that answer knew much about autoencoders.
Also^2, your question isn't really anything like what this addresses. Your question concentrates on the idea of compressing a large set of images, and sharing some kind of representation.
That certainly is possible without using a ML approach. And yes, autoencoders have been around for a long time.
But hoverboards are a great idea, too.
[1] This paper has 7 authors, Gregor et al has 5 authors, DRAW has 6.
Well thought of, seriously, people can be too negative.
It's obvious that on a theoretical standpoint a RNN can help with compression...
A while ago I thought about a RNN or a NN for encryption i.e. cracking it (I had did something like post on that topic.. but saw no replies whatsoever)..
yeah the moderation there is getting a bit out of hand, recently they've deleted my answer (which was a direct quote, with reference link provided, formatted as a quote) because it was 'plagiarism'. before that they requested to quote from the link - the quotes were added in an edit.
It's quite exciting to see progress on a data driven approach to compression. Any compression program encodes a certain amount of information about the correlations of the input data in the program itself. It's a big engineering task to determine a simple and computationally efficient scheme which models a given type of correlation.
It seems to me like the data driven approach could greatly outperform hand tuned codecs in terms of compression ratio by using a far more expressive model of the input data. Computational cost and model size is likely to be a lot higher though, unless that's also factored into the optimization problem as a regularization term: if you don't ask for simplicity, you're unlikely to get it!
Lossy codecs like jpeg are optimized to permit the kinds of errors that humans don't find objectionable. However, it's easy to imagine that this is not the right kind of lossyness for some use cases. With a data driven approach, one could imagine optimizing for compression which only looses information irrelevant to a (potentially nonhuman) process consuming the data.
This seems so overly complicated, with the RNN learning to do arithmetic coding and image compression all at once. Why not do something like autoencoders to compress the image? Then you need only send a small hidden state. You can compress an image to many fewer bits like that. Then you can clean up the remaining error by sending the smaller Delta, which itself can be compressed, either by the same neural net, or with standard image compression.
The idea of using NNs for compression has been around for at least 2 decades. The real issue is that it's ridiculously slow. Performance is a big deal for most applications.
It's also not clear how to handle different resolutions or ratios.
I see there being a number ofpaths for Neural Network compression.
She Simplest is a network with inputs of [X,Y] and outputs of {R,G,B] Where the image is encoded into the network weights. You have to per-image train the network. My guess is it would need large complex images before you could get compression rates comparable to simpler techniques.
An example of this can be seen at http://cs.stanford.edu/people/karpathy/convnetjs/demo/image_...
In the same vein, you could encode video as a network of [X,Y,T] --> [R, G, B]. I suspect that would be getting into lifetime of the universe scales of training time to get high quality.
The other way to go is a neural net decoder. The network is trained to generate images from input data, You could theoretically train a network to do a IDCT, so it is also within the bounds of possibility that you could train a better transform that has better quality/compressibility characteristics. This is one network for all possible images.
You can also do hybrids of the above techniques where you train a decoder to handle a class of image and then provide a input bundle.
I think the place where Neural Networks would excel would be as a predictive+delta compression method.
Neural networks should be able to predict based upon the context of the parts of the image that have already been decoded.
Imagine a neural network image upscaler that doubled the size of a lower resolution image. If you store a delta map to correct any areas that the upscaler guesses excessively wrong then you have a method to store arbitrary images. Ideally you can roll the delta encoding into the network as well. Rather than just correcting poor guesses, the network could rank possible outputs by likelyhood. The delta map then just picks the correct guess, which if the predictor is good, should result in an extremely compressible delta map.
The principle is broadly similar to the approach to wavelet compression, only with a neural network the network can potentially go "That's an eye/frog/egg/box, I know how this is going to look scaled up"
Now that Google is full on the neural network deep learning train with their Tensor Processing Units we'll be seeing NN applied to everything. There was an article about translation now imagine compression. It is a bit amusing, but nothing wrong with it, this is great stuff, I am glad they are sharing all this work.
Not an expert in this, but you would need a lot of eigenfaces to capture the variance (i.e. a lot of upfront cost storing all the eigenimages). It might be good for something that is very standardized (e.g passport photos where everyone is in the same position) but otherwise I think there probably is too much variance to keep the number of eigenimages to a reasonable number.
When zoomed in, the JPEG artifacts are quite apparent and the RNN produces a much smoother image. However, to my eye when zoomed out the high frequency "noise", particularly in the snout area, looks better in JPEG. The RNN produces a somewhat blurrier image that reminds me of the soft focus effect.