Hacker News new | past | comments | ask | show | jobs | submit login
Perceptual Image Hashing (bertolami.com)
104 points by specular on Aug 8, 2014 | hide | past | favorite | 34 comments



This made me think of something, so bear with me for a second. I took Andrew Ng's Machine Learning class on Coursera, and one of the things he talked about was how, given a certain data set and a logistic regression approach, the algorithm might decide to just always give it a value of 1, since 98% of the time the value is one.

The examples given here were all predicted to have 95%+ similarity, but it seems to me that given this dataset, I could be reasonably convinced that it just assumes all images are reasonably similar. In other words, the author provides no counter-examples that demonstrate that the algorithm can detect dissimilar images, such as a photo of a car and a clipart of pizza. I would expect that to have very low similarity, and given the algorithm's demonstrated ability to discern between similar and dissimilar images, I would be more convinced of perceptual image hashing as a viable technique.


While your comment applies on a broader sense (we need to test both for cases in which we expect the algorithm to find similarity and cases in which we expect to find dissimilarity - lest we embrace confirmation bias), I believe your reference to machine learning is a bit off.

In this case, we are not looking at a model training, we are just evaluating the outcome of a (static) algorithm. On the other hand, in the context of ML you should try to include as much information as possible in your training dataset. This is a concept that comes by in many fields, some of them just tangentially related to machine learning (e.g. excitation signals for system identification, speaking both in terms of frequency domain and differential equations).

tl;dr: there is no "training set" to speak in the presented article - hence, concepts such as overfitting and information content (of the training set) do not quite apply here.


I wasn't making a direct comparison between machine learning and this particular task. The article just made me think of that machine learning problem, and I would like to see more examples that show a wide range of results. For instance, there are no examples of false positives -- how often would that happen in an image processor? Probably a lot, but I wouldn't know it based on this particular article.


It works, at least the examples above work fairly well-- I've tested a few for an automated image processing app I'm working on.


I've used this technique before for image searching, here's some tips:

The algorithm presented here can be summarized as "Take the low-frequency DCT components of the luminance channel, quantized down to one bit". Hashing is a very misleading term for what you get from it: In particular, it doesn't have anything like the uniform coverage of the output space you would expect from a hash function, as almost all encountered images exist in a very small range of hashes. Also, the bits in the hash are not all created equal; assuming you encode them in a zig-zag pattern truncation is equivalent to a low-pass filter, so if you want to make the hash different sizes while matching on the same image elements you'll want to alter the quantization amount, not the output DCT size (this is directly equivalent to what JPEG does, which uses a fixed DCT size but lossily quantizes the information in it to varying degrees for different compression quality settings). Experiment with different quantization techniques; you don't have to treat every bit in the output the same.

This technique is only suitable for a small subset of the image files that you will encounter in the real world: it's mostly unsuitable for anything but photographs and photograph-like images (3d renders, paintings, etc.) Many classes of image have no useful distinguishing information in the low-frequency luminance space, such as screenshots, pictures of text, line drawings, many forms of graph and diagram, most "clip art" style drawings, pictures that include layout elements in the image file (like most of the meme/sucessories/demotivator format pictures you'll see on imageboards). Of course, an image with a flat or heavily compressed luminance channel will yield no information at all.

You don't have to do anything special to match flipped images, as you should be using (iirc) DCT-II that has the same output for a mirror image. It's insensitive to scaling and mild translation.

For debugging purposes, note that the transformation performed is lossy but easily reversible: you can undo all the steps but quantization to generate a 'stereotype' image of a particular hash value and its neighbors. There is no more information available to the matching algorithm than you can see with your own eyes with the round-trip conversion, so if the round-tripped image preserves important distinguishing detail the matching will work and if it turns the image into a meaningless grey smear you'll get tons of false positive matches.


Is there a good technique / library for comparing screenshot-like images?



A more tactful question might be, "How is it different from existing implementations?" And then don't ask "Am I missing something?"

Best practices for engineering is always to see what else is out there, so to assume the author hasn't conducted a review of the field might be read as rudeness or snark.

Many developers, even when they do compare and contrast their own work as they develop it, don't write up the results or list their peer projects, so it's otherwise not an unfair question to ask.


But I actually want to know what I'm missing-- what's interesting about this "closed source but available for academic purposes" application which appears to simply be a re-implementation of existing work?

I'm trying to give benefit of the doubt here actually.


It's using a very similar technique to pHash, actually. DCT-based hash and all. It's not brand new, but perhaps a variation on a theme.

I've actually been working in this domain for the last week or so and DCT-based hashes are quite accurate, but slow as hell. Average hashes (aHash) or Delta Hashes (dHash) are much faster and rather good at weeding out large numbers of images. A mix of ideas is usually a good idea.


Yup, it's been shown that the accuracy of a set of poor classifiers together can be greater than a known 'good' classifier.

The implication is that there's no single 'perfect' classifier for a problem, engineers have used this notion for many years having multiple systems vote for fail-proof operations.


The article directly describes the phash algorithm presented in the Hacker Factor post, while your second link is an implementation of the ahash algorithm in the same post.

libpHash's implementation is slightly more sophisticated (and slower) as it adds a box filter over the image before downscaling and uses the median of the AC coefficients rather than mean for computing the hash bits. It also offers a few alternative hashing methods.


Hash functions impose practical mappings between unlike domains. When volatile (~1-to-1) identifying (crypto), when correlated (~n-to-n) perceiving (AI).

(Tweet summary of the article. I'm becoming increasingly fascinated by hash functions. I'm finding all important this tension between abstraction/correlation/perception & groundedness/volatility/identification.)

https://twitter.com/elzr/status/497893639000190976


That is a remarkable way to put it. But aren't has functions -- per definition -- n-to-1? That seems to be what the perceiving function is about: map multiple sensations into a single correlated perception.


Thanks for pointing it out, you're right! I rewrote the summary (the original is still at the linked tweet). So it's ~1-to-1 (usually, that's what I mean with practical) when it's volatile. And I'm calling it ~n-to-n (again, usually, practically) when it's correlated: mapping ~n correlations in the source domain to ~n correlations in the destination.

To my mind, an ~n-to-1 mapping would actually be a bit more like idealistic Platonic classifying than Wittgensteinian perception (~"we spin perceptual threads by twisting n attributes like fiber on fiber. And the strength of the thread does not reside in the fact that some 1 fiber runs through its whole length, but in the overlapping of the fibers.").

What do you think? :)


Also look at Semantic Hashing, which uses deep learning to compress an input (e.g. a document) into a low dimensional bit vector with fast lookup:

http://www.cs.toronto.edu/~rsalakhu/papers/semantic_final.pd...

The idea is that you could hash one billion objects, and do a very fast lookup of similar objects.

There is later academic work that refines this approach. For example, here is work applying it to image search: http://www.cs.utexas.edu/~grauman/temp/GraumanFergus_Hashing...

You can consider this a variant of Locality-Sensitive Hashing, where the hash function is specifically induced through a machine learning technique.


Does the term "perceptual hashing" only apply to the domain of images? What would you call this concept applied to text files? I am familiar with Levenshtein distance, but that family of algorithms provide a similarity score or value whereas I'd like a hash as the output.


What would you be hashing, on a text file? What the text "looks like," e.g. particular density and repeated patterns of alphanumerics?

Or would you be hashing the content, the language, the words used, the meaning of the words?

Stylometry is the study of written style. It's used forensically to see if two texts were written the same, to identify anonymous/pseudonymous authors, and there's "adversarial stylometry," which is intentionally changing a writing style to make it look like someone else wrote something.

This deck seems to summarize a lot of these points and issues, but offers no "stylometric hash" solution: http://wiki.uni.lu/mine/docs/RDPresentation.ppt

Plagiarism detection is another forum that might do these sorts of analyses. How many metaphors do you have to change, and synonyms do you have to replace, before it's an original work, or would it always still hash nearby?

Textometry tries to make texts more analyzable. Wordprints or stylometric fingerprints, seem to turn up more results than "hash".


Hm. Well, I want to hash the pattern of bits in the text file, like a cryptographic hash does (suppose md5 or sha-1 for simplicity's sake).

I'll provide some examples of input and output. These examples happen to contain no linefeeds.

Suppose:

    Hello, World! --> 65a8e27d8879283831b664bd8b7f0ad4
Then I want something like:

    Hello, Worlds! --> 65a8e27d8879283831b664bd8b7f4ad4
...Rather than what md5 currently provides:

    Hello, Worlds! --> d0478649ad1c15f0e623846c3e26ebeb
Basically, I love hashes and I use them all the time, but for many purposes I am only accidentally using the cryptographic features and in fact I would sometimes find it nice if similar inputs had similar outputs.

Said differently: I'd like a hashing algorithm where the Levenshtein distance between any given two outputs correlates with the Levenshtein distance between the corresponding inputs.

I can imagine lots of uses for such a tool. ...But I can't imagine how it could be possible to make one: files (or strings) vary in length, for one thing, but a good hash does not! Of course, I couldn't imagine md5 before I saw it in action, either.


Perhaps I'm missing the point, but I came up with a naive and sloppy part solution: https://gist.github.com/peterc/737d9178f02118f8e315 .. there are some weaknesses to this solution but it does perform similarly to your examples.


Thanks, Peter. I don't speak Ruby. Uh, line 5 contains a small function definition, which is applied to each ...slice of the input string, is that right?

Do you (or anyone else) have time to describe this code in English or pseudocode?

I'll start.

    Let there be a function called hash which takes a
        string named str and an integer named
        length_of_hash (which defaults to 20).
    Slice the string (str) into length_of_hash equally-sized
        pieces?
    Take numeric value of each character (of each slice??)
        and do.. something to it, something involving
        modulo 256, unless it's zero.  Save all the
        results.  
    Express each numeric result as hexidecimal and append
        all those together.  Return it, probably.
Clearly there are bugs in the translation. :)


In short, convert the input string into an array of its character values, group those values into sub-arrays of length 'x', add together the aligned values in each sub-array and modulo each by 256, output the resulting values in hex pairs joined together in a big string.

Example, with a hash length of 4: "helloworld" => [104, 101, 108, 108, 111, 119, 111, 114, 108, 100] => [[104, 101, 108, 108], [111,119,111,114], [108, 100]] => [67, 64, 219, 222] => "4340dbde"

So if one character is different in the string, only one hex value in the output will vary too. However, a flaw of the plan is that changes spaced out at an interval that matches your hash "length" will only affect one hex value in the output, but you run into the pigeonhole principle if you want a hash of limited size to have the same or similar Levenshtein distance as the potential inputs, but I suspect there are far smarter solutions :-)


Yes, that's possible. You could a topic clustering algorithm, for example.

Text -> topic_id.


In text analysis you can broadly focus either on style or on topic. Stylometry does the former and typically focuses on highly frequent function words; this technique is good at identifying authors. There are various techniques that focus on topics such as distributional semantic models and bag-of-word models; these techniques are good at clustering documents about the same topics together (regardless of author style). I think that in both cases a text can be reduced by the model to a single vector, which could be considered a hash, but in the end what matters is how well the models perform on a task such as identifying authors or finding documents on a given topic.


a similar concept exists in audio hashing called acoustic fingerprinting [1]. i've used software like Picard [2] to de-dupe my music library based on similarity. Also Simiilarity [3] incorporates this as does Shazam.

[1] http://en.wikipedia.org/wiki/Acoustic_fingerprint

[2] https://musicbrainz.org/doc/MusicBrainz_Picard

[3] http://www.similarityapp.com/


Yes, such a thing exists for text files: There are techniques used for near-duplicate detection (MinHash), and then there are latent topic models (e.g. using lda, lsi, autoencoders) which map documents onto a lower-dimensional ("semantic") space which is supposed to give a similar representation to semantically similar documents.


How would an adversary break this?

My first idea would be to rotate it by something that's halfway between the steps used. Say, 360/32 degrees. How does that compare?

Also: because it discards the high frequency data one should be able to construct something like this: http://cvcl.mit.edu/hybrid/CatDogHybrid.jpg - where to us at a close distance it looks like one thing but to this it looks like something else.


It would be pretty easy to assign any image an arbitrary value. Compare the image's current (unquantized) DCT values to the desired ones, find the value that differs from the target but is closest to being quantized as the desired bit, and apply a filter that just barely flips it. Each time you do this you reduce the Hamming distance score between the image and the target by 1.

If getting close enough results in unsightly blotches on the image, reduce the power of the low frequency luminance channel across the board, which will mask the changes by making the unmodified high frequency components more noticeable. That looks like what's being done in the catdog image, at least.

You could increase the saturation as well, as this fingerprinting system ignores color.


If this sort of thing fascinates you, here's a related approach: http://grail.cs.washington.edu/projects/query/ .. for some reason numerous people were working on image search features like this (i.e. draw what you want to find) 10 years ago but they don't seem to have caught on other than with Google Images.


Very interesting approach.

I have worked with image feature extraction in the past. Although using DCT coefficients has been used as a way to analyze texture features, the idea idea of generating the hash (step 5) seems to be new.

I am curious however on why you are discarding color information. Usually for reverse image search this kind of information can be quite useful.


Not new. pHash (phash.org) has been doing it for a while now. :)


This seems to work well for modifications that are basically in-place and high frequency, but how well does it handle things like one image being a crop of another, or translation, or flipping?


Hashes are supposed to change a lot with small changes in the data source. I wouldn't call this hashing, but fingerprinting, and there're plenty of fingerprinting algorithms for images, some related to steganography.


A cryptographic hash, yes. Not all hashes are cryptographic hashes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: