Omnizart: Library for automatic music transcription

jacquesm · on Dec 18, 2021

Anybody experimenting with this for piano: I highly recommend the Google 'Onsets and Frames' algorithm as embodied in their demo:

https://piano-scribe.glitch.me/

I built something similar which is a lot faster but on a large scale test the google software handily outperforms my own (92% accuracy versus 87% or so, and that is a huge difference because it translates into ~30% fewer errors).

pizza · on Dec 18, 2021

Wow the Onsets and Frames algorithm is insanely interesting. It's like a mixture of run-length encoding of (vertical and horizontal) strings of (0dim/time) and (1d/time) structures (onsets as points in time, activations as lines in time). But..hm.. why stop at such low dimensionality structures..! :^)

koningrobot · on Dec 18, 2021

There has since been follow-up work to extend Onsets & Frames to multi-instrument music: https://magenta.tensorflow.org/transcription-with-transforme...

pizza · on Dec 18, 2021

Shame that it uses quadratically scaling transformers - there are many sub-quadratic transformers that work quite well or better (https://github.com/lucidrains?tab=repositories) - because that 4 second sub-sample limitation seems quite unlike how I imagine most people experience music. Interesting, though. I wonder if I could take a stab at this..

Also interesting that the absolute timing of onsets worked better than relative timing - that also seems kinda bizarre to me, since, when I listen to music, it is never in absolute terms (e.g. "wow I just loved how this connects to the start of the 12th bar" vs "wow I loved that transition from what was playing 2 bars ago".

Another thing on relative timing.. when I listen to music, for me, very nuanced, gradual, and intentional deviations of tempo have significant sentimenal effects - which suggests to me that you need a 'covariant' description of how the tempo needs to change over time, so, not only do you need relative timing of events, you also need relative timing of the relative timing of events as well

Some examples:

- Jonny Greenwood's Phantom Thread II from the Phantom Thread soundtrack [0]

- the breakdown in Holy Other's amazing "Touch" [1], where the song basically grinds to a halt before releasing all the pent up emotional potential energy.

[0] https://www.youtube.com/watch?v=ztFmXwJDkBY, especially just before the violin starts at 1:04

[1] https://www.youtube.com/watch?v=OwyXSmTk9as, around 2:20

pierrec · on Dec 19, 2021

There are already tools that can estimate the variation of tempo over time (rubato). Librosa's "tempo" function does the job well for some types of music - it can even give you a 3D/heatmap plot with the likelihood of every tempo value at every moment: https://librosa.org/doc/main/generated/librosa.beat.tempo.ht...

Rubato is everywhere in classical music, and understanding rubato is an essential part of any automatic transcription system that aims to show you notes in musically meaningful units of time.

ekelsen · on Dec 19, 2021

I would test whether a longer sub window actually helps before worrying too much about this. I doubt it matters in practice. Whether it's how people experience music or not is rather immaterial.

slantedview · on Dec 18, 2021

I just tried this and it works very very nicely! Thank you for sharing!

frutiger · on Dec 18, 2021

I just tried this with Sexy Sadie and the result was awful.

redka · on Dec 18, 2021

For anyone interested I've transcribed this song [1] using the replicate link the author provided (Colab throws errors for me) using mode music-piano-v2. It spits out mp3s there instead of midis so you can hear how it did [2]

[1] https://www.youtube.com/watch?v=h-eEZGun2PM [2] https://replicate.com/p/qr4lfzsqafc3rbprwmvg2cw5ve

bonestormii_ · on Dec 19, 2021

Awesome, thanks for running the test!

I can't help but feel it is heavily impacted by ambience of the recording as well. The midi is of course a very rigid and literal interpretation of what the model is hearing as pitches over time, but of course it lacks the subtlety of realizing a pitch is sustaining because of an ambient effect, or that the attach is is actually a little bit before the beginning of the pitch, etc.

If it could be enhanced to consider such things, I bet you would get much cleaner, more machine-like midis, which are generally preferable.

SquibblesRedux · on Dec 19, 2021

Listening to the input and output, the reproduction is like comparing a cat to a picture that resembles a cat, but isn't a cat.

8bitsrule · on Dec 19, 2021

Music's a lot more than a collection of notes ... and the timbre of one piano is about as far away from a mixture of reeds and dulcit electronics as you can get. (The Fulero is very nice.)

slantedview · on Dec 18, 2021

How do you get a midi?

redka · on Dec 18, 2021

You'd have to run it yourself. There's a docker image available but it's a pretty big download (11.7GB)

vji · on Dec 18, 2021

I'm still looking over this to see its capabilities, but if reading this right, we can turn any mp3/wav into a set of midis, which allows us to import into music editing software (like Finale). If this works, this is huge. Congrats to the team.

emerged · on Dec 18, 2021

That is a tremendously big “if” considering how many times that problem has been attempted. Even just detecting the key of a song is awfully fuzzy.

pindab0ter · on Dec 18, 2021

Detecting a key of a song is also not deterministic. Some song’s keys are truly ambiguous and/or subjective.

andybak · on Dec 18, 2021

Aye. Polyphonic pitch is a much simpler problem than "key" which is more akin to rigorous sentiment analysis or some other intractable goal.

haberman · on Dec 19, 2021

My gut is that computing key (given reliable pitch data) is a lot easier than computing polyphonic pitch data (given audio).

For relatively "conventional" music, there are very strong signals of key like beginning and ending chords, and overall note distributions which will generally cluster around one particular scale. For less conventional music, this will be more ambiguous, but it would have been more ambiguous for a human listener too.

nsonha · on Dec 20, 2021

the result can be used as audio fingerprints, which is not a new thing. This has something to do with how things like Shazam work.

jawns · on Dec 18, 2021

I tried transcribing from a YouTube link via the Colab link, but it generated a bunch of errors.

I found this link to be more helpful than the GitHub repo for understanding what it does:

https://music-and-culture-technology-lab.github.io/omnizart-...

marcodiego · on Dec 19, 2021

I don't have absolute ear. I once tried to "reverse engineer" a guitar solo using a tuner. Too much work, not very good result. Hope this finally brings music transcription to less skilled/gifted musicians/hobbyists.

still_grokking · on Dec 19, 2021

Neither the replicate.com nor the colab.research.google.com demos work for me.

The colab notebook is full of warnings and crashes with errors in the "Transcribe" box. Replicate.com does something but the results are garbage.

What am I doing wrong?

weinzierl · on Dec 18, 2021

Sounds incredible and I'm curious how well it works. For a quick intro about the state of the art in this space, watch the Melodyne videos on Youtube. In short and without having tried Omniscent Mozart: I would not expect that it gives perfect results without manual help. If it could aid transcription in a semi-automtic way, like Melodyne does, that would already be a big victory for an open source alternative.

jamsch · on Dec 19, 2021

Been using Omnizart for drum transcription, however the most accurate piano transcription model I've come across is from ByteDance -- https://github.com/bytedance/piano_transcription

yalok · on Dec 19, 2021

A friend of mine is a violin teacher, and he has been asking if it’s possible to build an app that would recognize the notes being played from a microphone, in realtime, and overlay it over the expected music sheet, to help kids see where might have missed a note…

Which of the existing string music transcription libraries would fit the bill?

chrismorgan · on Dec 19, 2021

In the field of speech, you have the concept of forced alignment, aligning an audio recording with a transcript by way of text-to-speech and pattern matching, which can certainly be used to detect some sorts of errors consistently and various other errors with some probability. (I’m not up to the point of doing this, but I’m early in making software for audiobook recording and plan to see what I can manage in error detection eventually. Unfortunately from what I’ve seen just with a little bit of casual playing around, small errors are not generally well identified because there will be too many false positives.) You could probably adapt a lot of that stuff to music without too much trouble, though it’d certainly require changes and integration effort. My vague feeling is that this direction, with MIDI rendering and pattern matching, would probably be surprisingly effective and I suspect better than trying to more directly match the recording to the score with the help of just Fourier transformations, or converting the audio to a score with something like this and comparing them, though even with the forced alignment adaptations I imagine you’d be changing it to use mostly Fourier transformations for feature detection.

(I’m a good programmer in general, but still decidedly a layman in audio matters. Take my thoughts here with a grain of salt.)

mushishi · on Dec 19, 2021

This is what Yousician does for singing (and other analog instruments) but unfortunately not for the violin: https://yousician.com/singing

Haven't tried it but maybe the following app could help?

> Trala uses signal processing, groundbreaking technology that analyzes the sound of your violin and gives you instant feedback on pitch and rhythm every time you practice. When you play the wrong notes, we’ll help you get back on track.

https://www.trala.com/ (I just remembered it being recommended somewhere.)

pea · on Dec 19, 2021

Does anyone know if the latency of this is good enough for realtime processing? I've been looking for something to transform guitar audio into MIDI for a synth for years, and all the commercial solutions are pretty poor. Best I've found is buying a guitar hero guitar which has MIDI out.

vagabund · on Dec 18, 2021

Probably not feasible but I've always thought an app that graded your ability to sing along to your favorite music would be fun. Being able to automate the isolation of the vocal track to judge the user against would obviously be required, so it's nice to see some advancements in that direction.

logbiscuitswave · on Dec 19, 2021

There’s plenty of video games that do this going back to 2003’s Karaoke Revolution - if not earlier. Plenty of other games and series followed suit including as Rock Band, Guitar Hero, Lips, SingStar. There’s plenty of mobile apps that work similarly. AFAIK they basically track your pitch and timing. I remember Lips in particular rewarding vibrato as well.

The algorithms as I understand them are grading you based on sustained pitch (adjusting for octave) and cadence. Generally they are easy to fool - you can basically say gibberish but as long as it’s within the expected parameters you’ll still get a good score.

They are still rather fun though if you enjoy that sort of thing.

MrAwesome · on Dec 19, 2021

Check out the Rock Band video game series, they had a really nice (non automated) version of this: https://youtu.be/pV9_O9bKiLg

aemreunal · on Dec 18, 2021

I think what you're describing is the SingStar game on PS3 :D We used to play it a lot with friends. Truly fun.

revskill · on Dec 18, 2021

What does transcription mean ?

yesenadam · on Dec 18, 2021

Well, usually (?) it means turning a piece of music (audio) into written sheet music/a score. It means that in jazz and...I believe many kinds of contemporary music. (In classical music I've heard it applied to e.g. turning an orchestral score into a piano score, i.e. an arrangement.) Here they appear to just make MIDI files, avoiding the problem of how to write the notes most readably, which is a significant part of transcription, in both these senses.

All the examples given here, though, appear to be of a super-simple variety, with dead-simple chords, all notes with robotic, mathematically-simple timing over fixed tempos produced apparently by drum machines - like toy music, the kind of music that's no challenge at all to transcribe, and in the real world I wouldn't bother transcribing by hand as there's nothing to be learnt by doing so, as you can hear exactly what's going on without it. So, that's weird.

tomcam · on Dec 18, 2021

In this context it means listening to music and writing it down in music notation.

IggleSniggle · on Dec 18, 2021

To transfer / scribe something that you have heard. Just like a “transcript” is the written record of spoken words, transcription is the act of transferring something heads into written form.

Transcription is a valued skill among musicians. The ability to hear a piece of music and write it down can be more than a little more difficult than transcribing speech.

Not sure if apocryphal, but Mozart is known an expert transcriber, able to hear an orchestral work and write down all the different instruments as they played their parts on just one listen.

dhosek · on Dec 20, 2021

Seems conceivable. I remember a friend in high school band camp with a cassette player and a stack of manuscript paper who transcribed a Chicago song by playing a couple seconds of it, writing down all the parts and then playing the next couple seconds. I really wish I'd understood back then that this was a learnable skill. It wasn't until I was in my 30s and had a few years of choral singing under my belt that I began to develop some simulacrum of that ability.

colechristensen · on Dec 19, 2021

Scribe means to write, and trans- means across or over

In this case they mean to take music (audio) and write down in musical notation everything that’s being played.

wombatmobile · on Dec 19, 2021

Here's an example

https://www.youtube.com/watch?v=6X44b25YStU

unbanned · on Dec 19, 2021

Wish they'd use a more globally recognisable song for the demo.