Hacker News new | past | comments | ask | show | jobs | submit login
Audio Fingerprinting with Python and Numpy (2013) (willdrevo.com)
210 points by sillysaurus3 on Sept 30, 2016 | hide | past | favorite | 40 comments



I used dejavu to fingerprint the entire ~2007 snapshot of the modarchive (repo of tracker-based music) which was about 120,000 individual songs in order to identify a song I'd first heard back in ~2006. There was definitely something special about finally identifying the song after almost a decade.

https://linux.ucla.edu/~eqy/molasses.html

The biggest surprise for me was that the entire process took only a few days on a single laptop (Sandy Bridge, no SSD).


It's almost insane at how much artistic output the demoscene has produced and yet has remained almost entirely underground.

That's almost a year of round the clock music playing 24/7 for free and completely unencumbered by any sort of licensing or playback restrictions. Most modern media players will play the formats back also without trouble (VLC for example).

Awesome write up, I'm looking forward to reading it all.


I used dejavu for a college project with some classmates. We made an app called JamJar which stitches concert videos together based on audio fingerprints. It worked super well! We've neglected the app for a few months now, but you can still check it out at http://projectjamjar.com

Huge thanks to Deja-Vu! Awesome library


Hey! Creator here. That's so cool to see! I made Dejavu as a fun side project in grad school, and it's super fun to see all the cool stuff people make with it.

I get a couple emails a week about it, but probably the weirdest/coolest was a guy in Spain who used Dejavu to make two rap-dueling robots who spoke in Basque.


Hey, I wanted to thank you personally, too. As far as I know, yours is the only completely open source implementation of the “shazam algorithm”, and it helped me a lot during my thesis work. The blog post was also great.

Actually, I encourage anyone interested to look up Panako[1], which is meant to be a framework for comparison of different audio fingerprinting techniques.

[1]: http://panako.be/releases/Panako-latest/readme.html


Oh very cool! It looks like this runs in realtime as well?


Wow that is such an awesome idea! Well executed too. One small bit of feedback: the video player seems to 'forget' your audio setting when you switch to a different recording. Might not be worth fixing, but just in case you weren't aware...


BTW, site projectjamjar.com crashed my firefox 49.0.1 (Linux) every time I tried to open it.

it loads the page fine, but two seconds after that it crashes


hah, indeed (I can reproduce it here as well). It would be interesting to report it to Firefox


Excellent article. His/her explanation of the the theory behind audio fingerprinting was simple, to-the-point, and most importantly, very well-explained. Thanks for submitting this.


Thanks! Glad you liked it. It's always been a frustration of mine (even reading / writing academic papers) when the author obfuscates or makes something more complicated than it needs to be.

You only truly understand something when you can explain it in lay (or close to) terms, I find.


Now you just have to explain to the language pedants why you called an audio fingerprinting tool "deja vu" (already seen) and not "deja entendu" (already heard :)


Oh, totally interesting. Didn't know that.

The American vernacular is you get "dejavu" whenever you have that "I've been here before..." feeling.


I am curios why streaming service don't incorporate audio fingerprinting a la Shazam into their services. Does Shazam hold some patent that would prevent this?


Yes. Shazam has a bunch of patents and has sued people before:

https://en.wikipedia.org/wiki/Shazam_(service)#Patent_infrin...

The core algorithm is well known and fun to implement:

https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

A blog had a simple implementation of it and were contacted by Shazam's lawyers:

http://royvanrijn.com/blog/2010/07/patent-infringement/


Dolby and Philips also have a bunch of patents. It seems like basically any version of audio fingerprinting has been patented. So unless you can come up with a fundamentally different algorithm, you're going to get sued.


Thanks for the links. I know in the US patenting an algorithm is legal but this is not the case in Europe where patenting any software is not legal. If this was the case I wonder why streaming services based in Europe such as Spotify or Soundcloud don't offer it considering the algorithms are well known.


I think the claim of it being impossible to patent algorithms in Europe is a bit overstated. In my experience, you just add "Method and apparatus that implements <whatever_algorithm>" and it's granted.

Pretty sure Shazam/Philips/Dolby patents cover Europe.


Why would a streaming service need it? Don't they already know the identity of the music they're streaming? Or am I missing something?


You fingerprint a song you don't know who the artist is and when you identify it you add it to your library or a playlist.

Contrast this to now with you would like to know what artist performs a song and you Shazam a song in one app and then when you identify it you open up Deezer or Spotify or whatever and then have to type it in the search box and then add it to your library or a playlist.

By integrating fingerprinting you can identify and add new music all in the same app.


I'd speculate a poor fingerprinting mechanism would actually increase frustration.


Interesting how easy this is, while spoken word recognition is so hard. I think this is because 1) human speakers vary more in terms of the signal than our various song-playing devices and 2) spoken word recognition depends strongly on context, whereas song identity doesn't. Said another way, a lot of the challenges in language processing are not signal processing problems.


I tried dejavu a few months ago, but dropped my project due to its large database size. Now I'm tring pyAudioAnalysis https://github.com/tyiannak/pyAudioAnalysis


I'm surprised the fingerprinting is so reliably repeatable that you can use an exact hash, like SHA-1. I would have guessed that noise or especially filtering could shift the peaks a few hertzes around. Why isn't this a problem?


Good question. The answer is that time is an invariant here.

Dejavu uses a locally sensitive hash (LSH) just like any other approximate search hash might. The key to note is that we're binning both the frequency and time units of the spectrogram, giving us room for noise/error. In fact, you can tune the granularity to which this happens by adjusting the Dejavu FFT window (DEFAULT_WINDOW_SIZE). This will create a spectrogram with few (and therefore larger) cells.

The trade-off with smaller cells is that with too much granularity (or if the audio is even a tiny bit stretched or we have small Doplar shift effects), we may miss the fingerprints we want (false negative). On the other hand, with too low granularity, we risk having our frequency bins too large and having perhaps another song/query audio match when it shouldn't (false positive).

Luckily in either case, we don't need to see all the fingerprints, just enough to align properly in time.

So to answer your question, Dejavu is fairly resistant to noise (and can be tuned with FFT settings) as long as the audio's original timing is unchanged.


I guess one explanation for the robustness is that noise is typically additive and other types of distortions (loudness compression, eq, reverb, etc.) are usually linear filters, i.e. only modifies the amplitude (and phase) of already existing frequencies. If the underlying peaks are strong, this normally does not change peak locations.


Exactly. Typically you won't see radio/streaming services messing around with reverb, but certainly different stereo systems have their own EQ (changes phase, amplitude) and often times the format/bitrate (again: phase, amplitude) or loudness (amplitude) will be different.

Most people don't realize this, but usually artists/labels will have a different mastered version of a song on each platform. Spotify, for instance, has it's own normalization algorithm that it puts tracks through to even out the listening experience in terms of loudness (RMS). Of course artists and their mastering engineers want to have some control over how that sounds, so they will change it.


I wish projects like MusicBrainz[1] would be more popular, along with tools like Picard[2]. They're using AcoustID[3] audio fingerprinting service/library.

[1] https://musicbrainz.org/

[2] https://picard.musicbrainz.org/

[3] https://acoustid.org/


Some audio fingerprinting database aims at identifying the whole song, usually for tagging purposes. These are a bit different from music identification apps like TrackID, which just need a few seconds of sample.


In the past, I've been in discussions on using a fingerprinting technique for videos. Provide a small clip of a movie, and out pops the title of the movie. This was always intended to be used on a small and well defined library, and never intended to be used on something like youtube.

One major problem with video is that you could have SD/HD versions and/or full frame/original aspect ratio types of differences of the same movie. One idea I wanted to play with was to detect edit points. The number of frames between edits could be used as the fingerprints. The entire concept was never anything more than a thought exercise. For the purpose of the exercise, we had to assume that there is no audio with the picture.

There are a lot of FFT libraries to process image data since most image compression techniques use some sort of FFT. Would this same type of fingerprinting be able to be used for a visual image. Could the amplitude of the RGB frequencies be used over time? The data set would increase with 3 channels of color, but would it also not help decrease false positives by making the combinations more unique?


SD vs HD is largely a solved problem in academia -- the key phrase is "resolution independence." Also, some robustness to cropping, scaling, and flipping is desired in robust fingerprinting systems.


Just thinking out loud, perhaps start with calculating perceptual hashes [1] of frames in both videos and use the inter frame hash distances of videos as fingerprints. As I understand it, perceptual hashing is robust against both resolution and color variations.

[1]: http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-I...


I suggest using LIRE. You may study my use case here. http://www.semanticmetadata.net/2016/03/04/lire-use-case-wha...


If you want to detect edits, you don't need an FFT. At an edit point, the whole image changes almost instantly (ignoring swipes). That means you just need the sum of per-pixel diffs from one frame to the next to be above some threshold.


I once used a modified version of Echoprint to fingerprint a few million tracks from a music service we were working on. Most fun bit was maxing out 80 cores and a LAN segment using a mix of Celery workers to fetch tracks and feed them to a C++ fingerprinter and store the data in Postgres.

The EP fingerprints were a lot smaller, though. IIRC it used a mix of beat and tonal detection.


Echoprint is wonderful for fingerprinting speed (C++), but the fingerprint size is actually smaller in Dejavu (binary(10) field in SQL for each fingerprint).

The other interesting differences to note are that Echoprint doesn't use a constellation fingerprinting approach along with offsets, and the fingerprinting is meant to be the same across all platforms / use cases so you can compare them.

As a direct result, you also can't get the offset in seconds that your query audio refers to like you can with Dejavu.

When I coded up this project, I wanted something that was more customizable - allowing you to decide the speed, number of fingerprints, size of the fingerprints etc to match your own false positive / memory / CPU requirements.

When you do, you sacrifice interoperability between all Dejavu index installations, but you gain that application specific performance. It of course depends on your use-case which library is better.


We modified EP to take multiple fingerprints to achieve the same result with offsets (down to 10 seconds, I think), and built a web UI prototype for matching audio from a desktop browser.

It didn't end up becoming a telco service solely due to commercial agreements, but it was a lot of fun and almost embarrassingly accurate with ABBA songs (since we ended up trying a lot of variations on the first entries in the catalogue).


This is awesome. I'm going to use this to fingerprint a huge library of tracks in a local language that I have.


Would it be possible to create some software that will prevent apps like Shazam from recognizing the song?


Not really. It's not like a neural network which approximates a crazy function to the feature space leaving brittle points where you can exploit that to get false positives.

You could, for instance, insert other tracks into the audio additively to try and confuse the fingerprint retrieval logic into suspecting a different track, but since this and many other fingerprinting techniques depend on the actual frequency of the audio emitted, there's no shortcut to obscuring the actual track.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: