Hacker News new | past | comments | ask | show | jobs | submit login

This was the smart approach when Shazam launched in 2008. I would have done exactly the same thing - gone straight to developing a method to turn every song into a hash as computationally efficiently as possible. If you launched this today the default R&D approach would be to train a model which may turn out to be far less efficient and more expensive to host. It feels like the kind of thing a model might be good at, but given that there are a finite number of songs, taking a hash-based approach is probably way more performant.



> to turn every song into a hash

Just to be clear, it's not turning each song into a hash.

It's turning each song into many hundreds (thousands?) of hashes.

And then you're looking for the greatest number of mostly-consecutive matches of tens (or low hundreds) of hashes from your shorter sample.

Also, I don't think this would be done with training a model today, because you're adding many, many new songs each day, that would necessitate constant retraining. Hashes are still going to be the superior approach, not just for efficiency but for robustness generally.


I'm an MLE, I would probably chop the songs into short segments, add noise (particularly trying to layer in people talking, room noise, and apply frequency-based filtering), and create a dataset like that. Then I would create contrastive embeddings with a hinge loss with a convnet on the spectrogram.

Ultimately this looks the same, but the "hashes" come from a convnet now. But you still are doing some nearest neighbor thing to actually choose the best match.

I imagine this is what 90% of MLEs would do, not sure if it would work better or worse than what Shazaam did. Prior to knowing Shazaam works, I might think this is a pretty hard problem, knowing Shazaam works, I am very confident the approach above would be competitive.


Why add noise to the training set, rather than attempt to denoise the input?


So you want a location-sensitive hash, or embedding, and you want it to be noise resistant.

The ml approach is to define a family of data augmentations A, and a network N, such that for some augmentation f, we have N(f(x)) ~= N(x). Then we learn the weights of N, and on real data have N(x')~=N(x).

The denoising approach is to define a set of denoising algorithms D and hash function H, so that H(D(x'))~=H(x). This largely relies on D(x')~=x, which may have real problems.

So the neutral network learns the function we actually need, with the properties we want, where the denoiser is designed for a proxy problem.

But that's not all...

Eventually our noise model needs extending (eg, reverb is a problem): the ML approach adds a new set of augmentations to A. This is fine: it's easy to add new augmentations.

But the denoiser might need some real algorithm work, and hope that there's no bad interaction with other parts of the pipeline, or too much additional compute overhead. (And de-reverb is notoriously hard.)


That could work, I think denoising a song to be a perfect match to the original recording is probably a very hard problem, so hard that your model will still need to be robust to some deviation from the original track, and therefore you need to do what I said above anyway.

Generally it's much easier to generate noised pairs from clean input than it is to do the reverse, i.e. go record lots of noised inputs from the wild and match to the original song. So the denoising problem you mention would be tougher still due to covariate shift. I think the features you learn trying to fingerprint the song through noise will probably be a bit more robust, but I don't have a mathematical proof.


Because then you’re training it on data that is more similar to the operating environment for the application. It’s a better fit for purpose. If the target environment was a clean audio signal, you’d optimise for that instead.


Adding noise is generally helpful for regularization in ML. Most modern deep learning approaches do this in one way or the other - mostly dropout. It improves generalization capabilities of the model.


To start from an original song and move it towards something that resebme a real life recording ? IOW : make the NN learn to distinguish between the song sound and its environment ?


I'd assume adding noise is done once per song and is thus a bit computationally cheaper than trying to denoise each input.


Depends what the model targets. If I gave this problem to a bunch of musicians, they’d be pulling out features like the key, tempo, meter, chord progressions, any distinctive riffs or basslines, etc. Those are the things hashes could be built from and would be more information-dense than samples of the particular recording.

Using a model to deconstruct a song like that might enable the ability to recognize someone playing the opening bars of Mr. Brightside on a piano in a loud bar as well as its drunkest patrons.


"Features" was Pandora's original design (Music Genome Project). IIRC you can still see it in the UI in how they describe songs.


you couldn't recognize anything ambient like, let's say, Loscil


Hard times for Indian classical music, as well...


> that would necessitate constant retraining

You wouldn't necessarily need to retrain that frequently. If your model outputs hashes / vectors that can be used for searching, you just need to run inference on your new data as it comes in.


Definitely this, embeddings would be the modern approach.


The "modern" approach sounds like it would be a lot worse than this approach both in terms of input and runtime performance.

Trendy ("modern") is not necessarily better.


You would probably take the approach that is used for face recognition. You train a model that can tell if two faces are the “same”. Then you match an unknown face against your database of faces. You can get clever and pull out the “encoding” of a face from the trained model.


You would just rename hashes to embeddings.


The smart approach in 1975 was to use Parsons code, which was also turning songs into hashes, computable in your head. You could then find your song back as simply as looking a word in a dictionary. Hopefully this idea won't die any time soon.

[1]: https://en.wikipedia.org/wiki/Parsons_code


That requires identifying the melody, which is certainly not something all humans can do, and was probably not generally doable by a machine in 1975. It also throws away a huge amount of information, and requires starting from the beginning of the melody.


I'd guess that this won't work well for EDM (electronic dance music) tracks.


> This was the smart approach when Shazam launched in 2008.

Nitpick, but Shazam launched in 2002 as a dial-in service that replied with a text-message of the result. The first phone app was for BREW in 2006.

The 2008 date is just when Apple launched the app store; it was not possible for a third party to make an iPhone app before 2008.


That’s more than a nitpick, that’s incredible! It still feels a bit magic today; I can’t imagine how magic that seemed in 2002!


It launched in Britain. I remember dialing 2580 in a nightclub, waving the phone around, and everyone being impressed when the identification was correct, as it usually was. Even with relatively obscure music.


A friend was in the beta and demoed it to us in a bar.. it was insane

In the UK you dialled 2580 from your (non smart) cellphone, it would hang up after a few seconds and you’d get an SMS right away with the ID of the track


tbh for tools like Shazam there's no fundamental difference between a database + hashing algorithm and a self-supervised model; both are great indexing & compression solutions, just for different scales of data.


If you trained a model for this, how would you avoid having to run the entire training process again every time you needed to add another song?

I wonder if there's a way to build an embeddings model for this kind of thing, such that you can calculate an embedding vector for each new song without needing to fully retrain.


You'd just have the network generate fingerprints for any given song similar to how facial recogniton is done

Siamese networks are what you want, two identical pairs of layers (one cached in this case) which act as the fingerprints then then the final layers are doing the similarity matching


If the model is successful, it will be able to predict the artist and song title for music it’s never heard.


No,

People who are highly skilled at this, can be easily stumped. Sure it might workfor artist who are more focused (tailor swift), it might pick out some interesting guest appearances (Eddie Van Halen on Beat It) but when you get multi talented performers who change everything about what do, they don't fit a "model". The most current example would be Andre3000's latest release.


Um, yeah, you won't be able to model artists who don't follow a model (especially done so deliberately). As you say that is true of humans or computers alike. But it's not the problem anyone cares about and not what the parent comment intended.

Certainly a well trained model will be able to have incredible accuracy just with vocals alone. It will be able to identify Lady Gaga regardless of whether she is singing a new art pop track or old standard with Tony Bennett.


Top 40, pop != music

We could have a debate about the consistency of Gaga or Taylor Swift and profit a motive (and we could go all the way back to composers of the classical period with this).

What about all the people who back pop artists? I dont think picking out Wrecking Crew is gonna be possible (It might but harder) https://en.wikipedia.org/wiki/The_Wrecking_Crew_(music)

I could also point you to Diplo who, as a "producer" is responsible for diverse sounds with his name directly on them and then side projects Like Major Lazer or MIA's paper planes that have his hallmarks but aren't "musicaly" linked. How about the collected work of Richard D. James, I'm no so sure that all the parts fit together outside the whole of them.

Stuart Copland was the drummer for the police, a very distinct and POP sound. Are we going to be able to use ML to take those works and correlate them to his Film scores? How about his opera? Dave Grohl, Phil Colins, Sheila E, more drummers who became singers, what is the context for ML finding those connections (or people).

John Cages 4'33 is gonna be an interesting dilemma.

DO you think the player piano black hole sun, and C.R.E.A.M cover from Westworld are picked up as stylized choices by Ramin Djawadi, and would it link those to the sound track of Game of Thrones?

Even with all the details it's sometimes hard to believe what talented people can do and how diverse their output can be!


Given our current trajectory this will probably be possible in 10 to 15 years


this is a rather funny joke

but if it is not that would be extremely impressive! determinism/freewill reduces to shazam!?

whats the training data to predict new song titles? heh

check out this reply from claude2:

>predict the next 3 new song titles from artist Taylor Swift

1. Last Dance with You - A reflective ballad about finding closure after a breakup. 2. Never Getting Back Together - A pop tune emphasizing that the same mistakes won't be made twice in a relationship. 3. 22 Was My Prime - A lighthearted look back on her early 20s as carefree years that can't be replicated.

...


a quick aside...

Whenever music is mentioned in conjuction with technology, one artist seems to always - in a very literal sense - pop up like a zombie in a B-movie...Taylor Swift. No idea who this person is or what they do but they appear everywhere, all at once.

It feels like a conspiracy.


Shazam is used to identify the artist and name of songs. I don’t want it guessing when precise information is available.


>This was the smart approach when Shazam launched in 2008.

A noteworthy mention would be that Sony's TrackID did most likely the same thing on their feature phones a few years before Shazam.


Hi - This is Chris Barton (founder of Shazam). Sony's TrackID was built by licensing (and later buying) a technology invented by Philips. That tech was invented after Shazam. Shazam was the first to create an algorithm that identifies recorded music with background noise in a highly scaled fashion. My co-founder, Avery Wang, invented the algorithm in 2000. Chris (www.chrisjbarton.com)


Hi Chris, it's Cornel Masson . I was at the London office from 2002-2006, then 4 more years working remotely from South Africa.

I worked on all the Java infrastructure around the recognition cluster (the latter being handcrafted C and assembly, optimised for specific Intel hardware).

The thing that Shazam got right was not just the core recognition tech, but the business processes and supporting systems around it. I remember how much work Chris had to do to convince the 4 major mobile networks in the UK to give Shazam the same 2580 dialing code (the middle 4 buttons, top to bottom, on an early 2000s feature phone).

A major part of the business is the constant sourcing and ingestion of the latest music, in all target markets (think Afrikaans pop in South Africa), deals with pluggers and record labels, etc. Initially, the back catalog was ripped from CD by a huge team of people in a warehouse, on custom workstations.


Cough Cough, Geoff Schmidt, Matthew Belmonte, Tuneprint?

Edit: tho for sure, the Philips algorithm was better than either of ours.


Tuneprint was published in 2004, no? Shazam filed their patents in 2000 and launched in 2002.


No no no, Tuneprint was well before that. By 2004 we were LONG gone. Shazam didn't show up until I think years later.

And I might be confusing them with another group but I thought, at the time, they were doing some goofy hash of the highest energy Fourier components -- a source of entertainment in our office. ;-)

I think Geoff had the vision and algorithm from the 90s as part of an ISEF project (!?). We had funding in 2001, when we got the real world go-to-your-car-and-get-a-cd and then we identify it ... using the audio signal alone.... demo working.

With a corpus of hundreds of thousands of songs. Positive match in less than 2 seconds.

Sadly, in 2001 there's no market for such whizbang amazing tech.


> Sadly, in 2001 there's no market for such whizbang amazing tech.

Shazam only launched one year after that, maybe the problem was in the marketing not the market itself?


2008? I remember using it in the UK in 2000/2001?


Pretty sure it launched in 2002 for 60p per song.


I don’t remember it being something you paid for.


The phone call was free, and the text they sent off the identification was correct was around 50p.


Was it iphone or android?


It was a phone number you called, which then SMSed you the result


It was originally sms service


This is true, they even tried to treat it as a speech recognition (ASR) problem in 2012: https://research.google/pubs/pub37754/


I could totally see RIAA shutting it down or ASCAP issuing a licensing invoice to stop a train on its tracks just because that's how they think


What makes you say this? They existed and there were a ton of similar products as well. None of them got any grief from those organizations.


none of the companies making the previous hashing apps had the money that AI companies do. around here, the phrase "fish in a barrel" might be used to describe the situation.


Well, under the hood a neural net kind of also builds hashes, just less accurate.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: