Hacker News new | past | comments | ask | show | jobs | submit login
Omnizart: Library for automatic music transcription (github.com/music-and-culture-technology-...)
284 points by pizza on Dec 18, 2021 | hide | past | favorite | 41 comments



Anybody experimenting with this for piano: I highly recommend the Google 'Onsets and Frames' algorithm as embodied in their demo:

https://piano-scribe.glitch.me/

I built something similar which is a lot faster but on a large scale test the google software handily outperforms my own (92% accuracy versus 87% or so, and that is a huge difference because it translates into ~30% fewer errors).


Wow the Onsets and Frames algorithm is insanely interesting. It's like a mixture of run-length encoding of (vertical and horizontal) strings of (0dim/time) and (1d/time) structures (onsets as points in time, activations as lines in time). But..hm.. why stop at such low dimensionality structures..! :^)


There has since been follow-up work to extend Onsets & Frames to multi-instrument music: https://magenta.tensorflow.org/transcription-with-transforme...


Shame that it uses quadratically scaling transformers - there are many sub-quadratic transformers that work quite well or better (https://github.com/lucidrains?tab=repositories) - because that 4 second sub-sample limitation seems quite unlike how I imagine most people experience music. Interesting, though. I wonder if I could take a stab at this..

Also interesting that the absolute timing of onsets worked better than relative timing - that also seems kinda bizarre to me, since, when I listen to music, it is never in absolute terms (e.g. "wow I just loved how this connects to the start of the 12th bar" vs "wow I loved that transition from what was playing 2 bars ago".

Another thing on relative timing.. when I listen to music, for me, very nuanced, gradual, and intentional deviations of tempo have significant sentimenal effects - which suggests to me that you need a 'covariant' description of how the tempo needs to change over time, so, not only do you need relative timing of events, you also need relative timing of the relative timing of events as well

Some examples:

- Jonny Greenwood's Phantom Thread II from the Phantom Thread soundtrack [0]

- the breakdown in Holy Other's amazing "Touch" [1], where the song basically grinds to a halt before releasing all the pent up emotional potential energy.

[0] https://www.youtube.com/watch?v=ztFmXwJDkBY, especially just before the violin starts at 1:04

[1] https://www.youtube.com/watch?v=OwyXSmTk9as, around 2:20


There are already tools that can estimate the variation of tempo over time (rubato). Librosa's "tempo" function does the job well for some types of music - it can even give you a 3D/heatmap plot with the likelihood of every tempo value at every moment: https://librosa.org/doc/main/generated/librosa.beat.tempo.ht...

Rubato is everywhere in classical music, and understanding rubato is an essential part of any automatic transcription system that aims to show you notes in musically meaningful units of time.


I would test whether a longer sub window actually helps before worrying too much about this. I doubt it matters in practice. Whether it's how people experience music or not is rather immaterial.


I just tried this and it works very very nicely! Thank you for sharing!


I just tried this with Sexy Sadie and the result was awful.


For anyone interested I've transcribed this song [1] using the replicate link the author provided (Colab throws errors for me) using mode music-piano-v2. It spits out mp3s there instead of midis so you can hear how it did [2]

[1] https://www.youtube.com/watch?v=h-eEZGun2PM [2] https://replicate.com/p/qr4lfzsqafc3rbprwmvg2cw5ve


Awesome, thanks for running the test!

I can't help but feel it is heavily impacted by ambience of the recording as well. The midi is of course a very rigid and literal interpretation of what the model is hearing as pitches over time, but of course it lacks the subtlety of realizing a pitch is sustaining because of an ambient effect, or that the attach is is actually a little bit before the beginning of the pitch, etc.

If it could be enhanced to consider such things, I bet you would get much cleaner, more machine-like midis, which are generally preferable.


Listening to the input and output, the reproduction is like comparing a cat to a picture that resembles a cat, but isn't a cat.


Music's a lot more than a collection of notes ... and the timbre of one piano is about as far away from a mixture of reeds and dulcit electronics as you can get. (The Fulero is very nice.)


How do you get a midi?


You'd have to run it yourself. There's a docker image available but it's a pretty big download (11.7GB)


I'm still looking over this to see its capabilities, but if reading this right, we can turn any mp3/wav into a set of midis, which allows us to import into music editing software (like Finale). If this works, this is huge. Congrats to the team.


That is a tremendously big “if” considering how many times that problem has been attempted. Even just detecting the key of a song is awfully fuzzy.


Detecting a key of a song is also not deterministic. Some song’s keys are truly ambiguous and/or subjective.


Aye. Polyphonic pitch is a much simpler problem than "key" which is more akin to rigorous sentiment analysis or some other intractable goal.


My gut is that computing key (given reliable pitch data) is a lot easier than computing polyphonic pitch data (given audio).

For relatively "conventional" music, there are very strong signals of key like beginning and ending chords, and overall note distributions which will generally cluster around one particular scale. For less conventional music, this will be more ambiguous, but it would have been more ambiguous for a human listener too.


the result can be used as audio fingerprints, which is not a new thing. This has something to do with how things like Shazam work.


I tried transcribing from a YouTube link via the Colab link, but it generated a bunch of errors.

I found this link to be more helpful than the GitHub repo for understanding what it does:

https://music-and-culture-technology-lab.github.io/omnizart-...


I don't have absolute ear. I once tried to "reverse engineer" a guitar solo using a tuner. Too much work, not very good result. Hope this finally brings music transcription to less skilled/gifted musicians/hobbyists.


Neither the replicate.com nor the colab.research.google.com demos work for me.

The colab notebook is full of warnings and crashes with errors in the "Transcribe" box. Replicate.com does something but the results are garbage.

What am I doing wrong?


Sounds incredible and I'm curious how well it works. For a quick intro about the state of the art in this space, watch the Melodyne videos on Youtube. In short and without having tried Omniscent Mozart: I would not expect that it gives perfect results without manual help. If it could aid transcription in a semi-automtic way, like Melodyne does, that would already be a big victory for an open source alternative.


Been using Omnizart for drum transcription, however the most accurate piano transcription model I've come across is from ByteDance -- https://github.com/bytedance/piano_transcription


A friend of mine is a violin teacher, and he has been asking if it’s possible to build an app that would recognize the notes being played from a microphone, in realtime, and overlay it over the expected music sheet, to help kids see where might have missed a note…

Which of the existing string music transcription libraries would fit the bill?


In the field of speech, you have the concept of forced alignment, aligning an audio recording with a transcript by way of text-to-speech and pattern matching, which can certainly be used to detect some sorts of errors consistently and various other errors with some probability. (I’m not up to the point of doing this, but I’m early in making software for audiobook recording and plan to see what I can manage in error detection eventually. Unfortunately from what I’ve seen just with a little bit of casual playing around, small errors are not generally well identified because there will be too many false positives.) You could probably adapt a lot of that stuff to music without too much trouble, though it’d certainly require changes and integration effort. My vague feeling is that this direction, with MIDI rendering and pattern matching, would probably be surprisingly effective and I suspect better than trying to more directly match the recording to the score with the help of just Fourier transformations, or converting the audio to a score with something like this and comparing them, though even with the forced alignment adaptations I imagine you’d be changing it to use mostly Fourier transformations for feature detection.

(I’m a good programmer in general, but still decidedly a layman in audio matters. Take my thoughts here with a grain of salt.)


This is what Yousician does for singing (and other analog instruments) but unfortunately not for the violin: https://yousician.com/singing

Haven't tried it but maybe the following app could help?

> Trala uses signal processing, groundbreaking technology that analyzes the sound of your violin and gives you instant feedback on pitch and rhythm every time you practice. When you play the wrong notes, we’ll help you get back on track.

https://www.trala.com/ (I just remembered it being recommended somewhere.)


Does anyone know if the latency of this is good enough for realtime processing? I've been looking for something to transform guitar audio into MIDI for a synth for years, and all the commercial solutions are pretty poor. Best I've found is buying a guitar hero guitar which has MIDI out.


Probably not feasible but I've always thought an app that graded your ability to sing along to your favorite music would be fun. Being able to automate the isolation of the vocal track to judge the user against would obviously be required, so it's nice to see some advancements in that direction.


There’s plenty of video games that do this going back to 2003’s Karaoke Revolution - if not earlier. Plenty of other games and series followed suit including as Rock Band, Guitar Hero, Lips, SingStar. There’s plenty of mobile apps that work similarly. AFAIK they basically track your pitch and timing. I remember Lips in particular rewarding vibrato as well.

The algorithms as I understand them are grading you based on sustained pitch (adjusting for octave) and cadence. Generally they are easy to fool - you can basically say gibberish but as long as it’s within the expected parameters you’ll still get a good score.

They are still rather fun though if you enjoy that sort of thing.


Check out the Rock Band video game series, they had a really nice (non automated) version of this: https://youtu.be/pV9_O9bKiLg


I think what you're describing is the SingStar game on PS3 :D We used to play it a lot with friends. Truly fun.


What does transcription mean ?


Well, usually (?) it means turning a piece of music (audio) into written sheet music/a score. It means that in jazz and...I believe many kinds of contemporary music. (In classical music I've heard it applied to e.g. turning an orchestral score into a piano score, i.e. an arrangement.) Here they appear to just make MIDI files, avoiding the problem of how to write the notes most readably, which is a significant part of transcription, in both these senses.

All the examples given here, though, appear to be of a super-simple variety, with dead-simple chords, all notes with robotic, mathematically-simple timing over fixed tempos produced apparently by drum machines - like toy music, the kind of music that's no challenge at all to transcribe, and in the real world I wouldn't bother transcribing by hand as there's nothing to be learnt by doing so, as you can hear exactly what's going on without it. So, that's weird.


In this context it means listening to music and writing it down in music notation.


To transfer / scribe something that you have heard. Just like a “transcript” is the written record of spoken words, transcription is the act of transferring something heads into written form.

Transcription is a valued skill among musicians. The ability to hear a piece of music and write it down can be more than a little more difficult than transcribing speech.

Not sure if apocryphal, but Mozart is known an expert transcriber, able to hear an orchestral work and write down all the different instruments as they played their parts on just one listen.


Seems conceivable. I remember a friend in high school band camp with a cassette player and a stack of manuscript paper who transcribed a Chicago song by playing a couple seconds of it, writing down all the parts and then playing the next couple seconds. I really wish I'd understood back then that this was a learnable skill. It wasn't until I was in my 30s and had a few years of choral singing under my belt that I began to develop some simulacrum of that ability.


Scribe means to write, and trans- means across or over

In this case they mean to take music (audio) and write down in musical notation everything that’s being played.



Wish they'd use a more globally recognisable song for the demo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: