I used to roll my eyes at crime television shows, whenever they said "Enhance" for a low quality image.
Now it seems the possibility of that becoming realistic are increasing with a steady clip, based on this paper and other enhancement techniques I've seen posted here.
Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".
You fundamentally can't get back information that has been destroyed/or never captured in the first place.
What you can do is fill in the gaps/information with plausible values.
I don't know whether this sounds like I'm splitting hairs, but it's really important that the general public not think we're extracting information in these procedures, we're interpolating or projecting information that is not there.
Very useful for artificially generating skins for each shoe on a shoe rack in a computer game or simulation, potentially disastrous if the general public starts to think it's applicable to security camera footage or admissible as evidence...
To give specific examples from their test data, it added stubble to people who didn't have stubble, gave them a different shape of glasses, changed the color of cats, changed the color and brand of sport shoe.
And even then, I'm a little suspicious of how close some of the images got to original without being given color information.
It appears that info was either hidden in the original in a way not apparent to humans or was implicit in their data set in some way that would make it fail on photos of people with different skin tones.
I haven't read the paper in full detail, but reading between the lines I'm guessing that there's a significant portion of manual processing and hand waving involved. From the abstract, emphasis mine:
> the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner.
My interpretation is that they select training data by hand and generate a bunch of outputs. Repeating the process until they like the final result. From the paper:
> we allow a user to have an
arbitrarily-fine level of control through on-the-fly editing of
the exemplar set (E.g., “resynthesize an image using the eye from this image and the nose from that one”).
There's nothing weak or negative about that, it's exactly what'd you expect. Obviously for a given input there will be multiple plausible outputs. With any such system it would make sense to allow some control in choosing among the outputs.
> Except, and this is really the fundamental catch, it's not so much "enhance" as it is "project a believable substitute/interpretation".
I would argue that this is a form of enhancement though, and in some cases will be enough to completely reconstruct the original image. For example, if I give you a scanned PDF, and you know for a fact that it was size 12 black Ariel text on a white background, this can feasibly let you reconstruct the original image perfectly. The 'prior' that has been encoded by the model from the large amount of other images increases the mutual information between grainy image and high-res. The catch is that uncertainty cannot be removed entirely, and you need to know that the target image comes from roughly the same distribution as the training set. But knowing this gives you information that is not encoded in the pixels themselves, so you can't necessarily argue that some enhancement is impossible. For example with celebrity images, if the model is able to figure out who is in the picture, this massively decreases the set of plausible outputs.
> The catch is that you need to know that the target image comes from roughly the same distribution as the training set.
When humans think about "enhance", they imagine extracting subtle details that were not obvious from the original, which implies that they know very little about what distribution the original image comes from. If they did, they wouldn't have a need for "enhance" 99% of the time -- the remaining 1% is for artistic purposes, which this is indeed suited for.
It'll be interesting to see how society copes with the removal of the "photographs = evidence" prior.
> when enhancing celebrity images, if the model is able to figure out who is in the picture this massively decreases the set of plausible outputs.
The benefit depends on how predictable the phenomenon is that your are interpolating from. Sometimes it will be quantitatively better than a low resolution version, sometimes not.
A good example is with compression algorithms for media. They work because the sound or image is predictable. And they are ineffective when the input becomes more unpredictable. But if the output is all you have then running the decompression will probably be better than just reading the raw compressed data. But you have to be aware of the limitations.
> You fundamentally can't get back information that has been destroyed/or never captured in the first place.
I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.
Because it's fundamentally flawed, especially in the context that it has usually been applied to, namely criticising the CSI:XYZ trope of "enhancing images".
The truth is that there is a lot more information in a low-res image than meets the eye.
Even if you can't read the letters on a license plate, it can be recovered by an algorithm. If the Empire State Building is in the background, it's likely to be a US license plate. Maybe only some letters would result in the photo's low-res pattern. If you only see part of a letter, knowing the font may allow you to rule out many letters or numbers etc...
It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.
The error mostly appears to be in assuming that the information has been destroyed, when in reality it's often just obscured. And Neural Nets are excellent in squeezing all the information out noisy data.
> It's similar to that guy who used Photoshop's swirl effect to hide his face, not knowing that the effect is deterministic, and can easily be undone.
The effect does not only need to be deterministic, but also invertible.
A low-res image has multiple "inverses" (yikes), supposedly each with an associated probability (if you would model it that way). So it would be more honest if the algorithm shows them all.
Showing them all seems a bit impossible because the number would blow up really quickly, wouldn't it? Maybe it could categorise them, but that could be misleading, too... I don't know.
>> You fundamentally can't get back information that has been destroyed/or never captured in the first place.
> I love this cliché. I've seen it thousands of times, and probably written it myself a few times. We all repeat stuff like that ad nauseam, without ever thinking.
It is not a cliche it is an absolute truth. Information not present cannot be retrieved. There may be more information present than is immediately obvious.
> Neural Nets are excellent in squeezing all the information out noisy data
Maybe but they are also good at overfitting onto noisy data (the original article is an example of such overfitting).
It's not cliché, it's true. You fundamentally can't get back information that has been destroyed/or never captured in the first place.
Yes, a low-res image has lots of information. You can process that information in many ways. Missing data can't just be magically blinked into existence though.
Copy/pasting bits of guessed data is NOT getting back information that has been destroyed or never captured. Obscured data is very different from non-existent data. Could the software recreate a destroyed painting of mine based on a simple sketch? Of course not, because it would have to invent details it knows nothing about.
I think it's almost dangerous to call this line of thinking cliché. It should be celebrated, not ridiculed.
For anyone put off by the .ps.gz, it's actually just a normal web page that links to the full article in HTML and PDF. Not sure what they were thinking with that URL. I almost didn't bother to look. (Maybe that's what they were thinking?)
I seem to remember from my computer vision class way back when that there's a fundamental theoretical limit to the amount of detail you can get out of a moving sequence. Recovering frequencies a little higher than the pixel sampling is definitely possible, but I feel like it was maybe something like 10x theoretical maximum. I also get the feeling, from looking around at available software, that in practice achieving 2-3x is the most you can get in ideal conditions, and most video is far from ideal.
> I don't know whether this sounds like I'm splitting hairs
Somewhat no, but somewhat yes. Thing is, while there can be lots of input images that generate the same output, it could be that only one (or a handful) of them would occur in reality. If this happens to sometimes be the case, and if you could somehow guarantee this was the case in some particular scenario, it could very well make sense to admit it as evidence. Of course, the issue is that figuring this out may not be possible...
>we're interpolating or projecting information that is not there
But that's not fully accurate either. Sometimes the information in total will really be a more accurate representation of reality than the blurred image. Maybe it could be described as an educated guess, sometimes wrong, sometimes invaluable.
It would be interesting to see the results starting with higher quality images. With the camera quality increasing, many times there should be more data to start with.
Exactly, this may be possible: [0] but only of the NN has seen such images before, the output will match the training data but says nothing about reality.
No, but think of these blurred images as a "hash" - in an ideal situation, you only have one value that encodes to a certain hash value, right? So If you are given a hash X you technically can work out that it was derived from value Y - you're not getting back information that was lost - in a way it was merely encoded into the blurred image, and it should be possible to produce a real image which, when blurred, will match what you have.
Don't get me wrong, I think we're still far far far off situation where we can get those reliably, but I can see how you could get the actual face out of a blurred image.
> you only have one value that encodes to a certain hash value, right?
Errr wrong. A perfect hash, yes. But they're never perfect. You have a collision domain and you hope that you don't have enough inputs to trigger a birthday paradox.
Look at the pictures on the article. It's an outline of the shoe. That's your hash. ANY shoe with that general outline resolves to that same hash.
If your input is objects found in the Oxford English Dictionary, you'll have low collisions. An elephant doesn't hash to that outline. But if your inputs is the Kohl's catalog, you'll have an unacceptable collision rate.
Hashes are attempts at creating a _truncated_ "unique" representation of an input. They throw away data they hope isn't necessary to uniquely identify between possible inputs (bits). A perfect hash for all possible 32 bit values is 32 bits. You can't even have a collision free 31 bit hash.
So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.
FYI (not because it’s particularly relevant to the sort of hashing that is being talked about, but because it’s a useful piece of info that might interest people, and corrects what I think is a misunderstanding in the parent comment): perfect hash functions are a thing, and are useful: https://en.wikipedia.org/wiki/Perfect_hash_function. So long as you’re dealing with a known, finite set of values, you can craft a useful perfect hash function. As an example of how this can be useful, there’s a set of crates in Rust that make it easy to generate efficient string lookup tables using the magic of perfect hash functions: https://github.com/sfackler/rust-phf#phf_macros. (A regular hash map for such a thing would be substantially less efficient.)
Crafting a perfect hash function with keys being the set of words from the OED is perfectly reasonable. It’ll take a short while to produce it, but it’ll work just fine. (rust-phf says that it “can generate a 100,000 entry map in roughly .4 seconds when compiling with optimizations”, and the OED word count is in the hundreds of thousands.)
>So back to the blurry security camera footage of a license plate or a face. Sure, that "hash" can reliably tell you that it wasn't a sasquatch that committed the robbery, but it literally doesn't contain the data necessary to _ever_ prove it was the suspect in question, even if the techs _can_ prove that the suspect hashes to the image in the footage.
For a face, sure, for printed text/license plates there are effective deblurring algorithms that in some cases may rebuild a readable image.
A (IMHO good) software is this one (was freeware, now it is Commercial, this is the last freeware version):
For the first choose "Out of Focus Blur" and play with the values, you should get a decent image at roughly Radius 8, Smooth 40%, Correction Strength 0%, Edge Feather 10%
For the second choose "motion Blur" and play with the values, you should get a decent image at roughly Length 14, Angle 34, Smooth 50%,
Fortunately there is a limit: the universe (in a practical sense). You cannot encode all states it has in a hash as it would require more states than you want to encode as you already mentioned (pigeon hole). But representing macroscopic data like text (or basically anything bigger than atomic scale) uniquely can be done with 128+ bits. Double that and you are likely safe for collisions, assuming the method you use is uniform and not biased to some input.
If you want ease collision examples you can take a look at people using CRC32 as hashes/digests. It is notoriously prone to collisions (since only 32 bits).
That won't work. A lot of people have tried to create systems that they claim always compress movies or files or something else. Yet, none of those systems ever come to market. They get backers to give them cash, then they disappear. The reason they don't come to market is that they don't exist. Look up the pigeon-hole principle. It's the very first principle of data compression.
You can't compress a file by repeatedly storing a series of hashes, then hashes of those hashes, down into smaller and smaller representations. The reason that you cannot do this is that you cannot create a lossless file smaller than the original entropy. If you could happen to do so, however, you would get down to ever smaller files, until you had one byte left. But, you could never decompress such a file, because there is no single correct interpretation of such a decompression. In other words, your decompression is not the original file.
Without getting too technical because I hate typing on a phone, you're technically right in the sense of a theoretical hash.
But in real life there's collisions.
And in real life image or sound compression, blurs, artifacts and resolutions, it is fundamentally destroying information in practice. It is no longer the comparatively difficult but theoretically possible task of reversing a perfect hash, but more like mapping a name to the characters/bucket RXXHXXXX where x could be anything.
There are lots of values we can replace X with which are plausible, but without an outside source of information, we can't know what the real values in the original name was.
Out of sheer curiosity I had a go at manually enhancing the Roundhay Garden Scene by dramatically enlarging the frames, stacking them, aligning them, erasing the most blurred ones and the obvious artifacts.
The funniest part was that the resolution really goes up if you make 1 px into 40 and align the frames accurately (then adjust opacity to the level of blur)
The crime television thing would be possible if you have enough frames of the gangster.
Approaches like these are hallucinating the high resolution images though--not something that we'd ever want being used for police work. That said, I wonder if it would perform better than eyewitness testimony...
To play devil's advocate though, modern neuroscience and neuropsychology basically tells us that that our brains reconstruct and recreate our memories every time we try to remember them. Our memories are highly malleable and prone to false implantation... and yet witness testimony is still the gold standard in courts.
I wouldn't want to see it used as evidence in court (and I doubt it would be allowed anyway but IANAL) but I could see this being a useful in certain circumstances for generating the photo-realistic equivalent of a police sketch e.g. if you had low-res security footage of a suspect and an eyewitness to guide the output.
It would be useful to reduce the number of suspects... calculate possible combinations, match them against the mugshots database and investigate/interrogate those people. Or if you're the NSA/KGB, you can match against the social media pictures database, and then ask the social media company to tell you where these users were at the time of the crime (since the social media app on the phone track their users' location...)
You could e.g. ostensibly produce valid license plates, which could be further reduced by matching the car color and model, to produce a small set of calid records.
Sure, but if we go by how the police works now, they will take a plate produced by the computer as 100% given and arrest/shoot the owner of that plate because "computer said so".
This image from the article shows that the original image and the fantasy image are not alike at all. The faces look to have different ages. The computer even fantasized a beauty mark.
> This image from the article shows that the original image and the fantasy image are not alike at all.
This is another avenue that could be further explored, which I quite like. That is, a non-artist can doodle images and create a completely new photo-realistic image based on the line drawings.
I was modifying a few images (from link on another comment here: https://affinelayer.com/pixsrv/ ) and the end results were interesting.
The low resolution to high resolution image synthesis reminds me of the unblur tool that Adobe demoed during Adobe MAX in 2011. Here is the relevant clip if you're interested https://www.youtube.com/watch?v=xxjiQoTp864
That demo was quite impressive, but the technique is completely different. Adobe uses deconvolution to recover information and details that are actually in the picture, but not visible (unintuitively blurring is a mathematically reversible transformation. If you know the characteristics of the blur, then you can reverse it. In fact most of Adobe demo's magic comes from knowing the blur kernel and path in advance, not sure how it works in practice for real photos). But the Neural net demoed in this post just "makes up" the missing info using examples from photos it learned from, there is no information recovery.
You'll get something that looks plausible for sure, maybe not what was originally there though. In the future, someone will be falsely convicted of a crime because a DNN enhance decided to put their picture in some fuzzy context.
You don't specify, but presumably you mean a true confession.
It could also be used to generate a false confession. If the prosecutor says "We have proof you were there at the scene" and shows you some generated image, then you as an innocent person have to weigh the chances of the jury being fooled by the image (and even if it's not admissable in court, it may be enough to convince the investiging team that you are responsible and stop looking for the real perpetrator) and the expected sentences if you maintain your innocence vs "admitting" your guilt.
Yup. In a court of law, the value as evidence is going to be weighted fairly low, even with expert testimony. It may be enough to get a warrant, or a piece in the process of deduction during the investigation phase.
Now it seems the possibility of that becoming realistic are increasing with a steady clip, based on this paper and other enhancement techniques I've seen posted here.