Hacker News new | past | comments | ask | show | jobs | submit login
How Perceptual Hashes Work (hackerfactor.com)
354 points by brudgers on June 3, 2011 | hide | past | favorite | 75 comments



The two main problems with this approach is it is not rotation-invariant and it does not work well if the image is damaged or added to. A more robust system (that, admittedly, will take longer) is to use one of the affine-invariant feature detection algorithms pioneered by SIFT. SURF is a faster, open-sourced version of SIFT that has many implementations. Essentially it scans chunks of the image at different scales and identifies features that peak even as the chunk around it gets bigger. Once these are identified, they are described in a way forcing them to the same size and orientation for lookup. Since these features should presumably be scattered throughout the image, the image can be recognized even if certain features are obscured or modified. It's certainly not as straight-forward as a DCT metric on a downsampled image, but the nature of widespread image capture, creation and manipulation usually requires this robustness.


As a followup (and this is rather unrelated to the original post), you can combine a feature detector with a statistical clustering algorithm to automatically identify the generic visual properties of objects in an unsupervised manner. One of the first papers attempting this is http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.72.... but many have followed it. From what I understand these still simply check for the existence of features in an image instead of their spatial relationship ('a picture of a cat has two ears, a tail, and a blob for a body' versus 'a picture of a cat has two ears positioned on the end of a blob with a tail on the other end of the blob'). Nevertheless, they represent the current state-of-the-art in automatic object classification and recognition algorithms.


I went to a talk by Yann LeCun [1] a few months back, "Learning Feature Hierarchies for Vision." The current state of the art in this field is mind-blowing to an outsider. The final demo was a program that he trained in a matter of seconds to recognize and distinguish faces of various random audience members, in real time, from different viewing angles.

[1] http://yann.lecun.com/


Is any of that software available?


Some of the libraries are [1]. Probably not the whole kit and caboodle. Interestingly, there was a focus on special purpose hardware that made convolutional network learning possible in realtime. One of the other demos was an autonomous driving robot that learned to recognize obstacles in video. Again, just mind-blowing.

[1] http://www.cs.nyu.edu/~yann/software/index.html


How serious a problem is non-rotational invariance given the purposes for which Tineye may be using perceptual hashing?

In Tineye's data set, wouldn't rotational issues tend to be edge cases since the vast majority of images on the web are already properly oriented? And given that most of the edge cases are likely to involve 90, 180 or 270 degrees of rotation, the additional computational requirements to cover those would appear to be significantly less than required for true non-translational calculation - i.e. wouldn't rotating the low frequency image 90 degrees and generating a second hash + creating reverse order hashes of the original and second hash work?

It seems to me that for applications such as cryptography where the costs of false positives are high, a SIFT algorithm makes sense. But for a free consumer oriented search tool, might it be considered overkill?

Link to SIFT: http://en.wikipedia.org/wiki/Scale-invariant_feature_transfo...


The difference is that you're throwing ad-hoc enhancements on top of an underlying framework which can't be modified beyond its basic principle. Yes, if you are looking for exact or simply scaled duplicates on other websites, a vector search for downsampled images would work ok. However, SIFT/SURF has a much more principled approach that can can be extended to handle more cases "when you need it." It's the difference between hand-writing HTML and using a template engine. On one scale, of course it makes sense to just copy and paste some HTML. But as things get hairier over time, you have to switch to something more powerful.


>approach that can can be extended to handle more cases "when you need it."

I guess my take on TinEye is that cases "when you need it" may be cases outside their target current market segment. Getting people to use their service is probably more important than using a sophisticated algorithm.


If you're interested in this article, then you may interested in locality sensitive hashing (LSH), a randomized hash that has been used seemingly everywhere. I recently used it to speed up music source separation (papers pending).

The idea is similar to the one mentioned in this article, but more general. Unlike a cryptographically secure hash where x != y implies that h(x) != h(y) (collisions aside), LSH says that if x and y are "near", then P(h(x) = h(y)) is "high". This quality is important when doing robust similarity search. For example, if your image is noisy or rotated or scaled, you hope that you can still find the clean version in a database.

LSH has been used in many application domains including images, video, music, text, bioinformatics, and more. LSH is not directly comparable to a feature extraction algorithm such as SIFT.

[Edited for clarity.]


Thank you for that term. LSH looks like a very useful technique.


This reminds me of OpenSSH's fingerprint visualization support ("VisualHostKey yes"):

    $ ssh A.B.C.D
    Host key fingerprint is b0:c9:c9:96:fb:fd:ac:a4:ff:70:8f:1b:35:f4:f9:2e
    +--[ECDSA  256]---+
    |                 |
    |                 |
    |      .       .  |
    |     o *     . ..|
    |      O S     o..|
    |     . .     . ..|
    |      .   o o   .|
    |       . + + +E. |
    |        o.++*....|
    +-----------------+
    
    me@A.B.C.D's password:
Original article introducing this feature: http://www.undeadly.org/cgi?action=article&sid=200806150... (2008/06/26).


I can't be the only one who thinks "This looks like the visualization algorithm is nethack"


Plane of fire, or so.


This is pretty much the opposite from what the article is talking about -- the article is trying to get a hash from an image in order to compare that image to another, while you're talking about synthesizing an image from an arbitrary hash...


Hence "reminds me" :)


Well, one could use a variation of this technique to compare tons of other things... Music?


http://musicbrainz.org/doc/Audio_Fingerprint

The Picard MP3 tagger uses this kind of thing (and MusicBrainz' database of tracks and audio fingerprints) to identify files.


There's a publicly available implementation of a perceptual hashing algorithm called phash at http://phash.org. I use some of their c++ code to detect reposts on an image sharing site I run (http://lolstack.com).


Could you add RSS?


It's definitely on my summer to-do list. You're the first person to ask.


That was one of the best articles I've read in a while.


I've always wondered how services like Shazam work. I'm amazed that they can do this kind of perceptual hash against ANY 10 second portion of a song. How do they search against something like that when they don't know the start or end time of the segment that is being input?


I do research in music information retrieval. See the ISMIR 2003 paper below. In short, it searches for landmarks in the spectrogram, hashes those landmarks, then compares those hashes against database hashes for temporal continuity. http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

A seminal paper on audio fingerprinting is the one by Haitsma and Kalker. http://ismir2002.ismir.net/proceedings/02-fp04-2.pdf


There's a gpl implementation of an audio information retrieval approach here: http://code.google.comp/audioscout/


Thanks!


A Matlab implementation and tutorial to give you an idea can be found here: http://labrosa.ee.columbia.edu/~dpwe/resources/matlab/finger...


Yes the delta part is clever. Shazam is well described here: www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf


For pop/rock/rap music it probably doesn't matter. For classical, I'm not sure.


I think you just suggested that classical music is all the same.


I meant to suggest the near opposite. That with pop/rock/rap, a 10s chunk is enough to get a signature for the wole piece. While for classical, the music changes enough that any 10s chunk may not respesent adequately a different 10s chunk.


There was a cool HN article a while back on building your own Shazam clone in Java.. I think the author received a C&D..

Found it! http://www.redcode.nl/blog/2010/06/creating-shazam-in-java/


Thank you, I learned something.

Since TinEye pre-computes the hashes, do they use something like Redis to retrieve information? Redis seems perfect for a such quick results using the hash as the key and a URL or object of some kind as the value.


It seems to me that what you want is some kind of spatial index - you might not get an exact match on the hash, but instead get one that's one or two bits away, and you'll want something better than linear search to find it.


Here is a good overview of existing approaches to approximate sequence matching - http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.... . I also wrote in a fair bit of detail about solving the similar problem of searching through LaTeX syntax trees - http://scattered-thoughts.net/one/1291/799313/731344 . Adapting my code to searching by Hamming distance should be trivial should anyone want to play around with image searches.


That would be a good idea as along as you have a good distance between images (d(image1, image2) small equivalent to image1 and image2 are equivalent). I have little experience in image processing, but this is generally a hard problem with complex signals (as images, music and speech are).

Most of what is described in the article concerns dimension reduction, where the goal overlaps with the above.


The article explicitly mentions Hamming distance, which is also what I mean by bit distance.


Yes, but his remarks suggests that hamming distance is not very nice for pictures after his transformation (d(i1, i2) == 5 -> small, d(i1, i2) == 10 -> big). As I am not familiar with image signals, I don't know if this can be somehow mitigated with other techniques. Of course, the only way to be sure would be to actually test it, but I would not bet much on it.

Another issue I forgot to mention is that each "point" (image ) is a k-dimensional vector. Organizing those so that to easily retrieve points which are close to each other quickly becomes intractable. Conventional techniques like kd-trees quickly break down when k is above a few units, because of the curse of dimensionality. This introduces a "non-linearity" in your distance which makes comparing points in a k-dimension space quite different than in 2/3 dimension spaces.


In the case of Hamming distance there is much more information to work with than in a bare metric. You can dump all the bitstrings into a suffix array, break the query into K pieces and search for an exact match for each piece. The resulting list of matches is usually small enough to just compute the Hamming distance against each element. I've used this on a corpus of 2m LaTeX strings withbgood results. See my above post for more details.


Doesn't scaling the images down (to 32x32 for the pHash approach) achieve essentially the same thing? Images that differ only slightly will likely scale down to the same thumbnail to begin with, and the resulting hash still bears some relationship to that thumbnail — so you should be able to look at similar hashes to find similar inputs.


Often you get things like images that have had text added to them (e.g. I often use tineye to find the original source image that someone has added a caption to), which means that the thumbnails will differ to some degree.


That's exactly what I'm saying - a simple index will only find exact hash matches, but what you want is images with "similar" hashes. If you define "similar" as a low Hamming distance - i.e. a small number of bit differences - and you want to find these image hashes with something better than a linear search / exhaustive combinatorial bit-twiddling, then you'll need to be smarter about your index.


Tineye must take this so much further, one image I looked up was about half of one of the pictures returned, I think it was in a page of a book.

From the reddit link further down - "A Fourier transform takes a signal in the time domain and breaks it down into its frequency components. Simplified, it takes a CD and produces sheet music.", "To be clear, OP says this is a matching algorithm - it's not what tineye uses, because matching the signature from the searched-for image with the database of previous signatures which is probably a nightmare."


Wow, I had no idea it was so easy. Could this in fact be used to combat copyright infringement for the likes of The Oatmeal on large scale, given someone (Google) with the requisite computing power?


From the looks of it this is actually TinEye's main commercial activity, they merely expose the search interface to demonstrate their technology.


For that specific situation, this wouldn't help too much, seeing as the up-loaders are already removing source info, so it's reasonable to assume they will fudge the image somehow to change the hash value.


"With pictures, high frequencies give you detail, while low frequencies show you structure."

I have a vague idea of what this means but can someone please explain it in a bit more detail?


In short, a wave with a high frequency has a short wavelength, and vice-versa.

The Discrete Cosine Transform is a variant of a 2-dimensional Fourier Transform. The 1-D version of a Fourier Transform is what we use to break a signal, like a sound wave, into its constituent frequencies. It takes as input the wave amplitude at various times, and returns amplitudes for various frequencies. If you were to take waves of those frequencies and amplitudes, and add them together, you would get back the original sound wave you started with. (I'm hand-waving away a bunch of details like phase, boundary conditions, undersampling, and overtones--but this is the general idea.)

You can make the Fourier Transform and its relatives deal with images the same way as sound, by pretending that the image is periodic, i.e. that you are tiling an infinite wall with copies of that image. You could create this same wall by overlaying waves of color on top of each other. The Fourier Transform will find these waves, the same way it found the frequencies for the sound.

With sound, low frequency = slow vibration = long wavelength (imagine an oscilliscope). High frequency = rapid vibration = short wavelengths. So if you were to try yo draw a picture using waves instead of a brush, you would use low frequencies for large things like a head. You would use medium frequencies to add smaller objects like eyes. You would use high frequencies to give small details, like hair or freckles, or the specific shape of a specific person's head.


Ok, I'll take a shot at this.

Let's work in one dimension rather than two dimensions. It's easy enough to extend later.

You know that a any signal is the sum of a (potentially infinite) number of sine waves. For example, a square wave is the sum of ever higher-frequency (but smaller-amplitude) sine waves.

The higher frequencies are necessary to get the sharp edges.

If you strip the high frequencies, the sharp edges dissapear, leaving only the larger motions of the lower-frequency (yet bigger amplitude) waves.

So the low frequencies are the hill, and the high frequencies are the grass.

Does that make sense?

Edit: Here's an image: http://cnx.org/content/m0041/latest/fourier4.png


Thanks, that is a bit more helpful.


another way i like to think of it (someone please correct me if i'm wrong) is that high-frequency means high-detail (highly frequently needing information to specify how it looks) whereas low-frequency means low-detail

(is that completely off or is it an analogous transform?)


What is misleading when one talk about Fourier transform for pictures, is that it has nothing to do with the waves emitted by the colored particles and received by our eyes. It is more about the spatial distribution of intensities.

Applied to the sound, this "frequency view" is much more natural: we hear a sound, and there is a low and a high part of it. It's because our ears really do real time frequency analysis, a kind of biological Fast Fourier transform.

From what I remember, doing this transformation is just a matter of taking the original signal s, get its level n of the lowest frequency f, and compute s - n × f, and recurse on the result with the next frequency. The theorem proves that if you go to the limit you get two equivalent representations of the signal, one being the wave itself s = f(t), one being its "spectrum" s1 = f(freq) (a function of the frequencies).

For many purposes, f(freq) is much more convenient than f(t), including comparisons, frequency shifting, extraction, compression, etc.

It applies equally well to images, but for me the frequency representation of a picture is not perceptively useful, maybe because our eyes are not Fourier transforming what we see.

All that is's old story for me (I studied acoustics in IRCAM), please correct if my memory is wrong.


indeed. it took me forever to wrap my head around fft of an image.

one thing you can do is read how JPEG works, the DCT is a lot like generalized FFT.


I believe he's referring to the DCT transform, in which the image is interpreted as a mixture of sinusoidal waves. Low frequencies then correspond to differences between separate areas of the picture, while high frequencies correspond to texture.

I'm not a big fan of the periodic transforms, but they do have that nice perceptual interpretation.


Also, the DCT is what is used for JPEG, so maybe that helps you think about this a bit more intuitively: http://en.wikipedia.org/wiki/JPEG


libpuzzle is good for this type of work: http://libpuzzle.pureftpd.org/project/libpuzzle


That is completely brilliant! I want to go out and write a image diff viewer that uses this on blobs in the image to detect pieces moving around!


I don't understand? :)


so you have a before and after image, say a webpage mockup thats where the logo moved from top left to top right.

the blob finder finds the interesting pieces of each version, and the perceptual hash picks which blobs match each other, and the software can say with reasonable certainty that the top left part of the image was moved to top right.

don't eat my lunch :)


This is really interesting, but I'm sad that they skip over the step "convert to grayscale." There's a ton of ways to convert an image to grayscale, each with their own pros and cons.

How do you weight each channel? Do you convert to HSL and just use L? Do you instead use Lab? HSV? Do you do a global or local algorithm? So many questions!


It was my understanding that (for a lot of the steps) it does not matter what mechanism you use, so long as you use the same one.

The problem you are addressing would matter if someone were trying to query TinyEye's database without submitting the image to TinyEye's servers.


What a fascinating topic, it's too bad that Gazopa (mentioned at the end of the article) just announced that they're shutting down their CBIR service:

http://gazopablog.blogspot.com/2011/05/shut-down-notice-from...


Isn't tineye's signature algorithm based on Fourier transform?

http://www.reddit.com/r/programming/comments/bvmln/how_does_...


As the post notes, that algorithm is great at matching, but sucks at searching. Hashes on the other hand are probably better off at matching. My guess is that you use some variation of hashing to get a set of candidate images, and then do more detailed examinations.


I have a feeling there is a deep connection between perceptual hashes and compressed sensing. Could someone more familiar with the latter weigh in?


Kinda sorta not really. Compressed sensing and sparse coding show that, under certain sparsity assumptions, you can perfectly reconstruct your original data with fewer bits than previously thought. It is a coding principle.

Perceptual hashes (or hashes in general) are used for fast indexing and retrieval. You cannot recreate the original data from a hash, pretty much by definition.

So hashes and coding algorithms both provide smaller representations of data. But hashes are used for indexing and do not provide the original data, or even an approximation thereof, while compressed sensing can.


The key thing here is that compressed sensing is an attempt to throw away redundant data as early as possible in the process. Any data which is redundant in the actual data stream, or can be inferred from prior knowledge about the stream does not need to be measured.

Perceptual hashing is instead an attempt to make the matching problem easier by throwing away data that is seen as irrelevant. In the case of the described algorithm, low frequencies are chosen as relevant and high frequencies as irrelevant. As you point out, this involves losing the ability to recover the original signal and adds the risk of mismatching in cases where data thought to be irrelevant to the task is actually relevant.


If we take the principles from compressed sensing and use a random-lens approach to subsampling the original image, we can create a fingerprint of the image which also happens to be able to reconstruct the original.

Both techniques are compressions that rely on the sparse properties of images to devine which bits are meaningful and which are redundant. It appears to me that using compressed sensing is just a smarter way of doing it. Maybe a hash that starts with a random subsample is inherently slower for comparing millions of images, but I shouldn't think so.


I agree. Briefly looking at the description of it, it is some sort of compressed sensing. The differences from traditional CS are minimal in fact, but the scheme is in line with some of the work undertaken in manifold signal processing. The differences are: - the proposed hash is deterministic, generally in CS, you want to rely on random projections (yet there are some results for deterministic problems) in order to get some sort of universality and by the same token some sort of robustness. - step 3 and 4, are the most fascinating steps because they are clearly one of the approaches used in manifold signal processing for images. To summarize, in order for pictures to be close to each other on a manifold, you really want to defocus them. I'll put something on my blog on the matter. This is the reason why the has of two images next to each other are close in the "hash" or manifold space. - for one image, the hash seems to provide 16 measurements (16 bits of the hash result). That would be OK if the initial picture was at the size and color of the picture after step 1 and 2. So in effect, that information is lost. However, in CS you also have "lossy" scheme such as the 1-bit compressed sensing approach (there you retain only the sign of the measurement!, i.e. a little bit like step 6). The reconstruction of these 1-bit pictures are not the original but they are close).

(ps: I write a small blog on CS).


I developed my answer there:

Are Perceptual Hashes an instance of Compressive Sensing ? http://nuit-blanche.blogspot.com/2011/06/are-perceptual-hash...


Just an interesting note, from implementation perspective, random projection is used for both cases (LSH or CS).


See also SIFT


Isn't SiFT still patented, though?


I don't understand why compressing an image is going to generate the same 8x8 image each time no matter what aspect ratio it was originally... whether it has been stretched before.. If you stretch and then recompress a bunch of times don't you eventually lose the information?

That math is for some reason totally counter-intuitive to me. Could someone do a proof?


From my understanding, its not a md5 hash or equivalent. Its just encoding the 64 bits to a "hash" thus allowing a fuzzy comparison between hashes. Someone correct me if I am wrong.

Obviously, this isn't robust enough to find all matches. A simple cropping would throw it completely off.


"I'll use a picture of my next wife, Alyson Hannigan."

That had me doing a few google searches.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: