Hacker News new | past | comments | ask | show | jobs | submit login
Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion (stability.ai)
374 points by JonathanFly on Sept 13, 2023 | hide | past | favorite | 203 comments



The solo piano was interesting because of how clean it is. I can imagine going from that sample to a score without too much difficulty. Once it's in a symbolic format it becomes much more flexible and re-usable.

While this does not seem to be the trend I hope more gen ai in the audio and visual realms start to produce more structured / symbolic output. For example, if I were Adobe I would be training models, not to output full images, but either layers or brush strokes and tool pallet usage. Same for organizations that have all the component tracks of music to work with.


Yessssss! I thought about MusicGAN and Markov chains last night thinking “Why can’t we just codify all chords and use a GAN to generate markov chains on chords of a key and have AI generate instruments and waveform from those chains?” IANA researcher but in my head, that sounded logical.


That's existed for decades. It's called Band in a Box. It's also cheezy as hell.


lol, no. Not autogenerate midi (although their latest versions of BiaB are pretty darn good now) but generate waveforms together. It would be similar to having AI generate whole scores of music but ensuring it's all in sync and in key. Not taking sample database of 88 sound files and triggering them when the midi-note strikes.


That's not how BiaB works at all. It has all kinds of patterns built into it. So, it knows, how to generate, say, a bluegrass bassline in a given key. There are plenty of ways to play back MIDI with high sound quality, including feeding it into an AI-driven VST like NotePerformer.


Then explain why you categorize it as cheesy? Sounds like it’s pretty cool.


he's saying the output is cheesy. it sounds like stuff you'd hear on a demo track for a kid's toy piano


It's not THAT bad, it's just that for a programming targetting jazz the playback is rather...square.


Yeah I definitely imagine certain genres sounding off due to the rhythmic nature of computer timing. Jazz doesn’t follow rules like that so good jazz has these timing idiosyncrasies that make it sound the way it sounds. That and a ridiculous obsession with adding half step and quarter tone intervals.


This is almost the exact approach I took for my project Neptunely (https://neptunely.com/). Working on bringing it to a VST at the moment so it's more portable.


I strongly agree about generating "editables" rather than finalized media. In fact, that's why text generators are more useful than current media generators: text is editable by default. Here's a tweetstorm about it: https://x.com/jsonriggs/status/1694490308220964999?s=20


Audio is definitely editable. While generative audio is new I am hopeful that a host of interesting applications will emerge (audio2audio etc.) within its ecosystem. Promising signal separation (audio to STEMs) and pitch detection tools already exist for raw audio signals. If you want to force Stability to focus on symbolic representations (such as severely lossy MIDI) I hope you can instead first try adapting to tools that work fundamentally with rich audio signals. Perhaps there will be room for symbolic music AI and perhaps Stability will even develop additional models that generate schematic music, but please please don't sacrifice audio generality for piano roll thinking alone. LORAs will undoubtedly be usable to generate more schematic audio via the Stable Audio model -- I imagine they could be easily purposedly to develop sample libraries compatible with DAW (digital audio workstation), sequencer and tracker production workflows.


Audio is editable, but it is a much more rare skill than text editing. Anyone that has completed primary school has basic proficiency. And those that have gone through college or held a job where email communication is common, has may years of experience in it.


What is audio2audio? Can I beat box into a mic and have professionally produced tracks come out the other end?


Train the model with midi notes as text in the prompt and the audio as target. It will learn to interpret notes.


Not all music is well represented with notes, nor are audio datasets with high-quality note representations readily available. But I guess if you work hard enough you can get close: https://www.youtube.com/watch?v=o5aeuhad3OM My example still sounds like the chiptune simulation that it is, however.


It's ok, the model would create music even from a vague prompt, it will learn even better from the notes imperfect as they are, because it has the interpreted version in the training target.


That raises an interesting difference between cleaning AI-generated sound and cleaning ordinary recordings. In an ordinary recording, there is an objective reality to discover -- a certain collection of voices was summed to create a signal. With (most? the best?) existing AI audio generation, the waveform is created from whole cloth, and extracting voices from it is an act of creation, not just discovery.

I've come across AI-generated music that outputs something like MIDI and controls synthesizers. Its audio quality was crystal-clear, but the music was boring. That's not to say the approach is a dead-end, of course -- and indeed, as a musician, the idea of that kind of output is exciting. But getting good data to train something that outputs separate MIDI-ish voices seems much harder than getting raw audio signals.


Generative models can certainly create midi, but no one has done it yet. Given the technique is making video, audio, images, and language, all you need to do is train and build a model with an appropriate architecture.

It’s easy to forget this is all pretty new stuff and it still costs a lot to make the base models. But the techniques are (more or less) well documented and implementable with open source tools.


> Generative models can certainly create midi, but no one has done it yet.

Note sequence generation from statistical models has a long history, at least as long if not longer than text generation.

Have a look at section 2.1 of this survey paper [0] that cites a paper from 1957 as the first work that applies Markov models to music generation.

And, of course, plenty of follow-up work 6 decades later on GANs, LSTMs, and transformers.

[0]: https://www.researchgate.net/publication/345915209_A_Compreh...


Yes, in fact I think at some point everyone has written their own Markov generators or at least run dissociative press. But we’ve really only seen meaningfully high quality output over the last few years.


I think it depends on how you define that. People were quite happy with HMM-based MIDI generators that could generate Beethoven- or Mozart-like sequences 10, maybe even 15 or 20 years ago. But of course other people pointed out the problems of it being boring eventually. Then LSTMs improved long-term dependencies and people were impressed by the improved quality of generating whole musical pieces. But still others thought it was not good enough. Then the goalposts moved again with transformers and neural vocoders and now we want top-40 direct audio generation. And these latest systems can kind of sort of do it! But still there are people who demand better. And so on, things will continue to improve.

Progress only moves as fast as expectations, and expectations move with technology. Music is not special in this respect. So you could say at any given time in the past that some people "see meaningfully high quality" and others are disappointed. You see exactly both these sides of the spectrum even now with text-to-image and text-to-audio technology.


> cites a paper from 1957

By Fred Brooks no less…

https://en.m.wikipedia.org/wiki/Fred_Brooks


Do you know if anyone has tried training a text-to-music or text-to-midi model where the training data includes things like emotion labels for each note interval or chord progression?


That sounds expensive and inefficient. Peoples' interpretations of music (and abstract art more generally) can be shockingly different; I suspect the model would not get a clear signal from the result.

But that makes me wonder to what extent labeling can be programmed -- extracting chord changes, dynamics changes, tempo, gross timbral characteristics, etc.


And maybe even labels like popularity/play count/etc so it has a better sense of what “sounds good” to certain groups


It has been done - first by OpenAI (MuseNet, which is no longer available) and later by Stanford (Anticipatory Music Transformer): https://nitter.net/jwthickstun/status/1669726326956371971


I believe Spotify's Basic Pitch[0] is already some work towards building something like this.

[0]: https://basicpitch.spotify.com/about


We’ve done it! wavtool.com


That’s really neat. How long have you been working on this?


Thanks! It grew out of an old side project. Been full time on it since December.


Having music editable for human post production is necessary for most professional adoption. Generating MIDIs would make much more sense than generating raw audio.

This is what we do with AI images: you can fix them in Photoshop, etc. You cannot do this for raw audio due to how music is produced.


Build or seek out a MIDI generating model. I hope Stable Audio is never the place for that. MIDI is deeply lossy and it would be tragedy if it was the only music representation. Imagine if instead of phonographs, compact disks and streaming audio we only had piano rolls. What a loss indeed.


Midi is not lossy, midi is symbolic. There's a huge difference.


MIDI is not inherently lossy. You could encode anything in it, just as you can encode any novel as an integer.

In practice, though, transformations from audio to MIDI discard an enormous amount of important information, with the possible exception of transcriptions of performances on piano (where volume, frequency, duration and a good physical model of a piano are enough to reconstruct everything important about the signal) and similar instruments.


MPE/MIDI 2.0 pretty much theoretically fixes the "discard an enormous amount of important information", in that you can have an extremely large number of parameters describing every single note and/or physical performance nuance.

Contemporary MIDI is ready to describe almost any human performance in an almost absurd level of detail. Whether anything can do something useful with the description is a different story.


No, it's lossy. It's an event model at a fixed data rate. You can only do so many things sequentially, even if you could represent any possible musical concept as a MIDI event. So even if you're not sticking to note-on, note-off, it's still extremely lossy.


MIDI is able to accommodate nearly everything that can be represented through a musical score and instrumental performance. What are you hoping to accomplish with AI-generated waveforms that can't be done with MIDI?


Absolutely not. I can write a score where 16 people each strike a xylophone note at the same instant. I pick xylophone because it's a very sharp transient, and in theory the performers could perform so well that it's literally one attack. If that's not enough, make a XM file with 16 channels and guarantee that they'll all be the same attack, or run a modular synth and some mults and trigger it off the rise time of a clock pulse, no computer involved.

Midi has to fire all the notes one at a time. You cannot represent even everything represented by a score through MIDI because it's serial and sends one message at a time. Stretch that out to 16 sharp clicky attacks and you'll notice that MIDI cannot fire them all at once, it'll blur.


The MIDI1 spec allows for up to 16 channels on each port, and would definitely be capable of handling 16 xylophones at once. It's complicated, but the current MIDI spec (and the pre-spec MPE extension) combines data, so what you think of as 16 xylophones is to the MIDI device 1 instrument consisting of 16 xylophones.

However, if you want to control attack, reverb, tone, volume, i.e., all the stuff that makes it seem like a real instrument and not a computer-generated tone, then you would need to dedicate some of the 16 channels on each port to each of those different controls. To handle 16 xylophones, you would need a MIDI interface with enough ports to handle the note data plus however many control channels you want to use. (Note that from the interface's perspective, the data incoming from 1 port is just a single channel, so an "8-channel" MIDI interface is actually handling 128 channels of MIDI data.)


You're not following me. It's a single serial bus at 3,125 bytes per second. Expressive controls like polyphonic pressure take at least two bytes, sometimes three: that's for a single message. A note on is typically three bytes, channel, key and velocity.

All these messages have to take turns. It's not a tracker, or a modular synth, where you can parallel click rise times electronically: it's not DINsync, where a chain of voltage pulses synchronize individual sequencers and aren't themselves notes.

1000 messages a second (at three bytes for each note-on) seems like a lot but it really isn't. With your 16 instruments (any drum, the xylophones, whatever) you can fire about 62 notes on all instruments per second. That seems like a lot too, but it's a hard limit, and it means your sixteen instruments have to 'blur' across 16 milliseconds to all fire off a note.

That means every time you fire all the instruments as one click, instead the attacks make up a 960hz tone. That's NOT one attack. Humans on percussion instruments can do better than that. It's the equivalent of 5.5 feet of space between speakers playing back: if you're time-aligning a tweeter and a midrange to produce a unified click, and you misalign one of the drivers by putting it five and a half feet back from where it would be, you'll notice the misalignment. Midi's timing issues are also noticeable.


Each MIDI port is serial, but multiple ports on an interface can operate in parallel.

Consider that Hollywood composers and music producers are able to use MIDI without issue for live-previews of orchestral compositions and live recordings and performances. The problem you are raising simply doesn't exist in the real world, accept that there are things about MIDI you don't understand and move on.


I can certainly accept that for you, the real world is multiple MIDI ports in parallel. You're correct in that if everything's in parallel on as many ports as you need, that'll help a lot.

Still ain't quite DINsync or modular, but there ya go :)


This is absurd. Sure, someone below posits that MIDI could perhaps represent a piano performance by Arturo Benedetti Michelangeli. I think it has been able to do a passing job at that, when you provide a decent piano. Regardless, piano rolls have been able to come close since the early 20th century. But how well does MIDI represent music performed by John Coltrane? Jimi Hendrix? It falls on its face. The long fetishized Western music notation abstraction, which MIDI poorly simulates, completely fails for many important examples of music. I would even venture to say that MIDI fails for most of them. But yes, MIDI is well-optimized for piano music where an acceptable piano or simulation is available.


These diffusion models give an ability to create far more than can be communicated with MIDI + instrument. Riffusion gave a hint of this - rather than just notes and drum hits + some processing, it becomes one big pulsating, expressive mass which would not be reproducible without the granularity of a diffusion model. These are reminiscent of some of the serendipity of live recordings with lots of tracks where interesting things happen from interplay of many different layers. Generating a few dozen clips generally would give me 2 or 3 with a beautiful emotional passage which really lights up the pleasurable music part of my brain. Mass farming these clips seems like a good route to some amazing music.


The problems have long been known and articulated: http://www.music.mcgill.ca/~gary/courses/papers/Moore-Dysfun...


Your example of the failures of MIDI are based on a 35-year old paper (from 1988!!!) about an earlier version of MIDI?

When that paper was written it took several weeks and many millions of dollars of equipment to render primitive, mono-color 3d graphics. Desktop computers had 512 kilobytes of RAM and the highest-end desktops 32 MB of hard drive storage space. Computer screens had two colors: black and green. Audio cards capable of making beeps and clicks were the cutting-edge. WIFI was still a decade away.


You clearly didn't read the article, and clearly don't understand how prevalent the MIDI 1.0 specification is today. MIDI 2.0 is a very recent development (this year LOL!) and has yet to be commercially adopted. The 1984 design is what is largely in use today. At the time of initial development, commercial synthesizers, not sound cards, were the intended generators of sound utilizing MIDI: https://www.vintagesynth.com/roland/juno106.php.


MPE, which is the model for the most important part of MIDI 2.0, has been around for several years now; there are numerous software synths (and a few hardware ones) that can use it, and several hardware controllers ("instruments") that can deliver it.


> What

"Intention" (as a tentative term)

The question becomes: what has impeded the creation of a MIDI file that can be confused with an actual concert from Arturo Benedetti Michelangeli.


Literally nothing is preventing this other than that nobody has bothered to take the time to do it.

The current version of MIDI is capable of replicating any of his performances, even down to the randomness.

Note that if you want to replicate the audio quality of his performances, you will need a high-quality MIDI instrument; the ones that ship with Windows will not suffice. These MIDI instruments can range from a few dollars to thousands of dollars. (See, e.g., Native Instruments)


> nobody has bothered to take the time to do it

In such case, we have a theoretical suggestion that «nothing is preventing this», but not an actual proof based on a "Turing test"-like scenario which would have specialists fooled, to corroborate that the new MIDI 2 would suffice.


I gave you the knowledge to do this research yourself, but since you are unwilling to do so, here is an example of the performance possible using MIDI instruments: https://www.youtube.com/watch?v=CvaChiq6gf0

As it is clear that you intend to keep shifting the goalposts to make a point that can't be made, I will withdraw from further participation in this thread.


well ok, you could say it's a lossy format for capturing physical movement, which it certainly is. My point is that it is not a format for capturing music any more than a score on a page is a format for capturing music. Both are instructions for a performer of music (one machine, one human), which is a very different beast.


MIDI 2.0 improves a lot of things, how much dynamicity and variation you can have. MIDI 1.0 is a standard from early 80s. It indeed has shortcomings, but also the upside is editability.

It’s then for the remaster / musician / actual interpretation / post production to make the score to something less of an event model.


MIDI is an extraordinarily lossy music representation. Even Claude Shannon would facepalm at the assertion that it could, in theory, represent audio faithfully. It is not its purpose, it is decidedly not its practice, and it is a ludicrously irrelevant example of pedantry to say otherwise. The false equivalency asserted by the commons can be aggravating :D


MIDI is not a lossy format for audio because it's not a representation of audio, period. It's a format for conveying the motion of a piano keyboard, meant from the beginning to be usable for various forms of audio.


>For example, if I were Adobe I would be training models, not to output full images, but either layers or brush strokes and tool pallet usage. Same for organizations that have all the component tracks of music to work with.

I really like this idea. Creating new tools for artists to use to create rather than whatever we're accepting as use now. The use of current full image creation is boring to me in the same way the choice of invisibility as a super power is. The invisibility is ultimately going to slide into pervy tendencies, just like deep fakes will slide in the same way or some other inappropriate use.


Hopefully, the entire industry will NOT move in such a schematic and lossy direction. Use separate tools to analyze audio streams please. Don't throw the timbre baby out with the bathwater. MusicGen utilizes a tokenized transformer model for music, which is attractive for symbolic translation use cases. However, the overall audio quality is far more lossy than the examples you hear from Stable Audio. I believe that symbolic representation should not be a foundational approach to adequately represent and generate rich audio signals.


I was wondering the same thing, definitely seems like generating the raw waveform runs into all kinds of weird issues (like they touched on in this post). I would imagine that training data would be a serious chokepoint here. Given how much discourse is currently kicking off around the intellectual property rights of just the final product (the mastered track), I can't imagine many musicians would be eager to share what is effectively the "proof of ownership" (track stems or MIDIs).


There are a lot of Lora models that are being made to generate textures, maps, diagrams, backgrounds, etc. You don’t need to wait for adobe, open source models like stable diffusion let you do whatever you think is useful. I’d look to the open source world for creative innovation. Adobe is just doing what’s on the product management roadmap.


I am approaching this from the symbolic angle via MIDI at neptunely (https://neptunely.com/)


It's interesting tech but none of the musical pieces impressed me (I play multiple instruments and have written and arranged music), most sounded too repetitive and not very imaginative. This is also an issue with diffusion based art AI in general, its good at a limited set of things but gets rather repetitive after a while. I could see using this as background music where quality is not important, like in games, though I doubt you could run the AI generator inside a game, you could generate it as a asset. People like Hans Zimmer and Ludwig Göransson have nothing to worry about.

Singing would an interesting experiment, but I don't see that here.


I disagree with your diffusion based art assessment, and I think it’s probably colored by what most people seem to want to make with it. Just as with regular art, you need some sort of vision to go beyond what everyone else does. Prompting “pretty girl wearing sexy clothing” for the umpteenth time isn’t new.

AI art gets rid of the technical skill step but the rest is there, although you may luck in to something at random. If you’re using ControlNet on Stable Diffusion or training your own models you have a lot of control over the output as well.


I thought the, "epic trailer music intense tribal percussion and brass" was pretty good. Rather, good enough for something like a video game where the game engine is dynamically generating music based on the present situation in the game.

I could easily find that music entertaining if it started playing the moment my character triggered a trap and suddenly, "the floor is lava" or my character enters a scene with the quest of winning over one of the love interests =)


I'm surprised that you even leap to people like Hans Zimmer and others.

The people we need to worry about are aallll of the people earning a living for everything else like background music for Indi games, Ambient music etc.


I thought “Prompt: 116 BPM rock drums loop clean production” wasn't bad, but I'll grant that most of the rest would be an excellent way of showing (for instance) a death metal fan what their favoured music sounds like to those who don't have an ear for it :D


>where quality is not important, like in games

I don't believe that's a good example. Video game music is an important part of the gaming experience, but its often taken for granted or overlooked.


What I haven't seen done well by these generative AI's so far is structure (having a chorus, a verse, a bridge, ...) and harmonic movement/progressions (except for maybe a V-I or I - VI - ii - V) And those two be things are exactly what makes a song interesting and non-repetitive.


OpenAI's jukebox -- now 3 years old -- is creative and non-repetitive. Witness, for instance, its jam on Uptown Funk here:

https://www.youtube.com/watch?v=KCaya74_NHw

Or the changes shortly after 1:15, 2:15 and 2:40 in these extensions of Take On Me:

https://www.youtube.com/watch?v=_3yOrUJ0SzY


Okay but here it still seems like it's just taking over snippets. It did not come up with the progression.


If inventing new chord progressions is one of the requirements for musical creativity then neither Handel nor Paul Simon would qualify.


Yeah, the "rock drums" example was like a student in a practice. I'll be impressed when it can sound like Danny Carey.

From all of the hype, I want to be impressed with results. Instead, we get these mediocre at best examples of what it can do. They are not good sales pitches to me.


Sir, your dog can talk!

Yes, but not very well.


You joke, but even those videos of people saying their dog can talk is just like this. It's cute because it's a real dog making sounds we really want to believe when it's just them mimicking sounds because they get pettin's and treats.

What I want is "AI" to do something impressive. Why are we trying to make the system generate the sounds itself? We don't make artists do that, we give them instruments. Give the models actual instruments, and then have it play them like a real artist. I will be much more impressed with an AI that understands composition and scoring, use of musical voices, key signatures. That would still be generative. I guess I just don't understand the point of the direction being taken. It's like a solution looking for a problem.


We work with what we have. We don't have a lot of recordings of the physical movements of musicians; we have recordings.

Similarly, we don't have recordings of the actions of painters; we have finished paintings -- but if you're not impressed with what AI can do in the visual sphere, your standards are, to put it mildly, high.


I'm not really sure how to take this. We absolutely have recordings of instruments. You can buy them as complete sets. You train on complete recordings, and then tell it how to use the sampled instruments to compose a song in the style of the trained data. Building something to make a waveform that looks like another waveform just seems like a very odd direction to take.

Yes, my standards are if it isn't at least as good as what's available now, what's the point.


Well, we found with Midjourney et al that these models can work very well despite having no pre-conceived or symbolic notions of composition, color theory, perspective or anything. Yet they can produce really good results in the image generation space. It's the same idea here, except much earlier days.

In the same way, many successful musicians can't read sheet music or know music theory, they just know how to produce something that sounds good.


>In the same way, many successful musicians can't read sheet music or know music theory, they just know how to produce something that sounds good.

Right, because they can operate the instruments that make the sound with natural talent, but they don't have to draw the waveforms. Audio generation is much different than image generation. It's just very odd to me.


Hate to break it to you but there is a vast market for mediocre content, and even Danny Carey has a bad day once in a while.


I think you've confused the vast market's forced acceptance of mediocre because that's what's available with their wanting the mediocrity. Proof of that is seen with the emergence of places for affordably licensed royalty free options. The quality of production and styles today compared to those from the 90s/00s is amazing. There's been a few options on these sites that I would play in a set as DJ. This ain't yo momma's needle drop selections.


This is the same kind of comment that got HN to seeth for months about how ChatGPT isn't the god programmer some clickbaity news sites claimed it was.

ChatGPT is good in a way that having it is better than not having it, especially with how bad google has become, audio generation will also be good in this way, some people don't need your "musical expertise" but just some background calm music to use with a tutorial video without having youtube take it down for copyright infringement.


Yes, while working on my AI Melodies Assistant project, it quickly became clear that generating a pleasant but boring music isn't too difficult. To create a catchy tune, an element of surprise is essential. In the end, I was able to use it as an assistant to compose 60 melodies that I'm happy with (https://www.melodies.ai/)


The real test is when this stuff is out in the wild and no one tells you it’s AI and the thought doesn’t cross your mind. Of course it’s not impressive nor surprising when the answer was given up front.


gamers don't want bad music either.

some go to video game music concerts or to fan covers


Thank you for sharing! On a tangent: I'm wondering if there are any good open source models/libraries to reconstruct audio quality. I'm thinking about an end-to-end open source alternative to something like Adobe Podcast [1] to make noisy recordings sound professional. Anecdotally it's supposed to be very good. In a recent search, I haven't found anything convincing. In my naive view this tasks seems much simpler than audio generation and the demand far bigger, since not everyone has a professional audio setup ready at all times.

[1] https://podcast.adobe.com/


We've been researching an audio denoiser for music that we will present at the AES conference in October. Description page: https://tape.it/denoising

We'll also publish a webapp where you can use the denoiser for free. Mail me if you want beta access to it (email in profile).

It won't be open-source though, although the paper will of course be public. It will also only reduce noise, and not reconstruct other aspects of audio quality. However, it can do so on any audio (in particular music), not just speech like Adobe Podcast, and it fully preserves the audio quality. It's designed exactly for the use case you want: to make noisy recordings sound professional.


Are you sure the demo sound files are correct on the website? Couldn't appreciate any glaringly obvious differences between the original and denoised with studio grade headphones here. Or, the originals aren't noisy enough.


denoising seems to fail in the guitar and vocals example


Can you clarify where it fails? It's designed to remove stationary noise only, and removes it very well in the guitar and vocals example.

Generally speaking, if you have other sounds that you don't want in the audio, we don't remove them - it's hard to decide from a musical point of view whether you want a certain sound or not. To give an extreme example: a barking dog probably doesn't belong into a Zoom conference, but it may very well belong into your audio recording. Removing such elements would be a creative decision.

The guitar and vocals example has certain clicks in the background that we don't remove - but the stationary noise is gone. Existing professional (and complex) audio restoration tools like iZotope RX don't remove those clicks, either. It's a conservative approach, sure, but in return you can throw any audio at it and it always improves it.


It’s not open but Nvidia has RTX Voice for free if you have and Nvidia card.

Only weird thing it’s designed to be used real time but I’ve had some luck on cleaning up voice recordings replayed back through it via audio routing.


There seems to have been a fork in the road:

On one side the tech for literal denoising has stagnated a bit. It’s a very hard problem to remove all noise while keeping things like transients.

On the other side, AI is being rapidly developed for it’s ability to denoise by recreating the recording, just without the noise.


In our denoiser (see other comment), we worked on combining these two forks. That’s how we can mathematically guarantee great audio quality.

This combination was non-trivial as training old school DSP denoisers is not easily possible. We’ll describe the math needed in our paper. We hope our publication will help the wider community work not just on denoising but also tasks like automatic mixing.


I have had a lot of success with this: https://ultimatevocalremover.com/ for de-noising


https://youtu.be/o-kJ4_CuWzA

This video from MKBHD's studio channel dives into this topic


It's interesting that the Death Metal was the hardest to reproduce. I conclude that it's the most fundamentally human of all genres.


Well... they hardly tried all genres :)

It sounds like it can't handle lyrics or semantics that well so I suspect any genre where the lyricism is important would also be quite mushy and recognizably AI


The Beatles seemed to be the hardest music for JukeBox to emulate.


The sound sample seemed to fit the 'vibe', but lacked any discernible definition. Could it be that it's too sonically dense to easily reproduce? Perhaps this could be improved with a more tailored training set.


I think it was just the genre that was the least represented in the training data


dadabots here: haven't gotten good death metal with it. problem is there's not really much in the dataset.


(context: i make ai death metal & also i worked on stable audio). 100% it was a dataset problem. Diffusion models still work well when you train them on death metal: https://www.youtube.com/watch?v=rlsRMQzD_6Q


It sounds more like break core haha


Now imagine Spotify using this to generate individual earworms for everybody based on their personal tastes (likes, playlists).

Yes, AI is partly hype, but had someone told me this even two years ago, I wouldn't have believed it.


This is why it’s vital that AI is openly available. Imagine a world where Spotify is the only company that can do that, and they use it to make sure they never pay royalties again.


How is Spotify for finding new music based on your tastes? I’ve only used Amazon and Pandora; Amazon is quite poor, Pandora is pretty good. I suspect (although, without proof) that if a service can’t suggest new music, it will have trouble generating new music as well.

Anyway, I very much would rather run this sort of thing locally. You could just manually set your taste profile. Plus, music can be quite personal, imagine you start listening to too much music inspired by The Cure and suddenly Amazon starts advertising black makeup and antidepressants or something like that, it would be too disconcerting.


> How is Spotify for finding new music based on your tastes?

I haven't tried any alternatives really, but so far for me I'd say decent. I put on an album, and once over it'll play similarish stuff. If I don't like a song I'll skip it and it seems to incorporate that feedback.

Only thing is that it doesn't seem to be too adventurous and it adheres rather strictly to the local context. Meaning, if I played a stoner rock track, it'll continue suggesting stoner rock and not much else, even though I have quite varied music favorited in my library.

Overall though I've found a lot of new bands I enjoy that way so, positive experience for me.

edit: as an example, here are the two most recent ones it suggested where I ended up buying the albums on Bandcamp. Both have quite few monthly listeners (1-2k), so not what I'd call mainstream.

SUIR https://open.spotify.com/artist/6zOeQ2hyNfqi9UMHtyTSlF

Mount Hush https://open.spotify.com/artist/13clfeXxTPsDsqzSlLIBZJ


I'll +1. I'm generally a FOSS guy, so if Spotify hadn't helped so much in discovering new music that I really enjoy, I'd be potentially acquiring music via questionable means and just playing them as audio files directly.

There are two companies that have done well by Gabe Newell's "piracy is a service problem - not a price problem" position: Valve/Steam (who also contribute to FOSS through Proton and SteamOS which I heavily appreciate), and Spotify. Spotify makes discovering and aggregating music so easy that the alternatives don't seem appealing.


I've tried Amazon and Pandora. Spotify is so vastly superior at finding music I like it's in an entirely different league.

Having said that, starting about a year ago (maybe ~1.5 years?) Spotify started inserting obviously paid promotion tracks into my auto-generated "Daily Mix n" playlists and it seriously bothered me. When my playlists are made up of very specific genres of EDM and suddenly a pop song plays from a famous person I get seriously angry.

It hasn't happened in many months though so maybe they learned their lesson. I was so mad I seriously considered ending my Premium subscription right then and there when that track played.


I find the discover weekly playlists that are made for me are pretty hit or miss, overall I have found many news songs I like with their help.


Similar - I found many good new things. I like the discovery playlists, even with the tracks I don't like. If they never missed, how could they ever suggest something actually different and exciting? It's "discover" not "average of what you already like".


Same! New songs, new artists, new genres, it's pretty cool! I agree it can be hit or miss but it feels like the longer I use it and the older, and more mature the platform gets, the more hit discover weekly ends up being!


Similarly to others, I haven't really used other things in a LONG time but Spotify's Discover Weekly playlist is usually a list of bops that I enjoy a lot. I frequently end up adding a huge portion of them to my liked songs and to regularly used playlists!


Machine-generated music might be functionally equivalent to human-generated music, but that ignores the cultural role of art as a shared human experience - witness the liturgy of live music. That can't happen with music tailored to each listener, it can't happen without tracks that are fixed in time and can be referred to. I can imagine it well-accepted for dynamic music such as gaming soundtracks, but I suppose that machine generation will be mostly a production technique resulting in branded pieces.


Saying that's impossible makes me immediately wonder whether it's not. There are already headphone dance parties. What if a musical act's output was being interpreted through genre lenses specific to each listener?


I know quite a few technically talented musicians who have next to no creativity in actually writing the music (aside from jazz style improv sessions). Most of them never really play live unless it's part of a similarly uncreative band of college friends. I wonder if having a catchy/complex AI generated song created for them to play live might be interesting to them. Gonna check in and see what they think.


> such as gaming soundtracks

These days I also feel like my workout playlists might as well be randomly generated dance music.


Spotify does not need to generate a tailored earworm for me. It could already suggest songs that I like based on my personal taste out of their 100-million-songs catalog - and it's absolutely unable to do it.


Building a tailored earworm might actually be easier.


I really want this. I have a band that I like, and I want more!

Or I'd like to take a song I like, and make it educational, like make it include the period table of elements.


I keep thinking back to when we didn't have stabilityai and it was just google and meta teasing us with mouth watering papers but never letting us touch them. I'm so thankful stability exists.


Stability is great but Meta's MusicGen is available with code and weights while this isn't so that's a really odd place to make that comparison and complaint.


Before stable diffusion, nobody released weights at all. Meta et al only started sharing their models with the world when they realized how fast a developer ecosystem was building around the best models.

Without stability, all of AI would still be closed and opaque.


>Before stable diffusion, nobody released weights at all.

That's not true. There's been a lot of models with weights from every player before Stability.

>Without stability, all of AI would still be closed and opaque.

Most GANs (the practically spiritual predecessor to diffusion models) for example were available. Huggingface existed and has realistically done more to keep AI open. And again, this specific release we are talking about by Stability is not Open.

Stability is great but you are re-writting history and doing it on the release where it makes least sense to do so.


Nah. Dunno where this is coming from but infamously no AI models were released by big players for years. Rewind 18 months and all you got is GPT-3.0 that no one seems to care about and Disco Diffusion-y type stuff.


You are looking at a very short part of recent history. It has not been like that at all.


I'm all ears. I was "in the room" from 2019 on. Can't name one art model you could run on your GPU from a FAANG or OpenAI before SD, and can't name one LLM with public access before ChatGPT, much less weights available till LLaMA 1.

But please, do share.


Openai - GPT2 2019 - https://openai.com/research/gpt-2-1-5b-release

Google - T5 - Feb 2020 - https://blog.research.google/2020/02/exploring-transfer-lear...

Both of these were and still are used heavily for on-going research and T5 has been found to be decently useful when fine-tuned.

Weights were available for both.



> Can't name one art model you could run on your GPU from a FAANG or OpenAI before SD

Google published dozens to promote Tensorflow:

https://experiments.withgoogle.com/font-map https://experiments.withgoogle.com/sketch-rnn-demo https://experiments.withgoogle.com/curator-table https://experiments.withgoogle.com/nsynth-super https://experiments.withgoogle.com/t-sne-map

The list goes on. Many are source-available with weights too.

> can't name one LLM with public access before ChatGPT, much less weights available till LLaMA 1.

Do any of these ring a bell?

- DistilBERT/MobileBERT/DeBERTa/RoBERTa/ALBERT

- FNet

- GPT2/GPT-Neo/GPT-J

- Marian

- MBart

- M2m100

- NLLB

- Electra

- T5/LongT5/T5-flan

- XLNet

- Reformer

- ProphetNet

- Pegasus

That's not comprehensive but may be enough to jog your memory.


I understand your point.

The gap in communication is we don't mean _literally_ no one _ever_ open-sourced models. I agree, that would be absurd. [1]

Companies, quite infamously and well-understood, _did_ hold back their "real" generative models, even from being available for pay.

Take a stab at a literal definition: - post-GPT2 LLMs (ex. PALM, PALM2) - art like DaLL-E, Imagen, Parti

Loosely, we had Disco Diffusion for art, and GPT-3 for LLMs, and then Dall-E, then Midjourney. That was over an _entire year_, and the floodgates on private ones didn't open till post SD/ChatGPT.

[1] thank you for the lengths you went to highlight the best over a considered span of time, I would have just said something snarky :)

[2] I did not realize FLAN was open-sourced a month before ChatGPT, that's fascinating: we're stretching a bit, beyond that, IMHO: the BERTs aren't recognizable as LLMs.


All good. I've also been working on LLMs since 2019-ish, so I wanted to toss a hat in the ring for the underrepresented transformer models. They were cool (eg. dumb), fast and worked better than they had any right to. In a lot of ways they are the ancestors of ChatGPT and Llama, so it's important to at least bring them into the discussion.


> Can't name one art model you could run on your GPU from a FAANG or OpenAI before SD

CLIP could be used as an image generator, slowly.

> and can't name one LLM with public access before ChatGPT, much less weights available till LLaMA 1

InstructGPT was available on OpenAI playground for months before ChatGPT and was basically as capable as GPT3, people were really missing out. Don't know any good public models though.



In the image generation space, weights were never released for ImageGen and Dall-e, but yes you can find weights for more specialized generative models like StyleGAN (2, 3 etc). Stable Diffusion was arguably one of the most influential open model releases, and I think the substantial investment in StabilityAI is evidence of that.


There were open reproductions of DALLE1 like ruDALLE.


GPT-2, GPT-J, XLNET, BERT, Longformers and T5 were all freely available before Stable Diffusion was even a press release.


Stable Diffusion 1 contains a model OpenAI released. The CLIP encoder that was trained on text/image pairs at OpenAI.

https://huggingface.co/runwayml/stable-diffusion-v1-5

https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/m...

Uploaded to Hugging Face Jan 2021

https://huggingface.co/openai/clip-vit-large-patch14


Unfortunately MusicGen's output quality isn't strong enough. I applaud Meta for open sourcing it. The audio samples released for Stable Audio show much more promise. I look forward to code and model releases. I built out a Cog model for MusicGen and took it for a fairly extensive test drive and came back disappointed.


The way I see it regarding the point "but meta is also releasing models" is: there was one span of time between say 2014-2019 when mostly ML was just classifiers (nothing generative). People did open source those. Then there was a period between 2019-2023 when generative AI was possible. It's true that meta is releasing models in that space now finally. But there was an excruciating 3-4 year period between 2019 and 2022 when stable diffusion was finally made and released which opened the floodgates to others doing so as well. But I'm eternally grateful for emad and stabilitai for opening the gates that had been titillatingly closed for 4 annoying years.


Nah it's not because without the releases of Stability/ChatGPT it'd be the same situation. Cool nihilism though


Stability def helped push things forward , it even probably showed them that open source is inevitable


Is the source for this available? I found no mention of it on the page.


It says source is coming


This is gonna be great to finetune on. There's only so many boards of canada/aphex twin songs out there but I wish there were more and this will let us generate more.


This is not the way.


Why not? Mostly for private use in my case. SDXL has created some beautiful works of art in my experiments and I would love to have a similar experience in the music world.


Come on, be creative and make something new instead of copying someone else.

It’s just kind of lame imo.

“Mostly” private use? Mmm. :thumbs down emoji:


I meant private use and maybe share with a few friends. I actually agree with you that we probably shouldn't finetune on great artists and try to sell the output without modification or added creativity. Private or close friends sharing is fun and life enriching and inspiring though in my eyes.


Boards of Canada came to their sound because in their youth, the brothers had to move to Canada for a time. Even though it was only a couple of years their experience made an indelible mark on them—particularly school days watching old National Film Board of Canada tapes on worn VCR heads.

When they moved back to Scotland and started their music they started incorporating both the machinery and the sounds from the tapes in their compositions. And they could play their compositions live. It was quite the rig.

It’s not just entertainment. It’s communicating a very specific feeling and perspective. Keep learning and create, don’t be satisfied with just copying.

The biggest difference here is in the doing. You have to grow into one mode over time and energy spent, the other is immediate gratification with minimal personal energy.

Everything valuable comes during the course of that process of growing and committing energy. And it’s so good. Don’t deny yourself.


I get that. I think it's really cool what they did and when musicians put in time and energy into making amazing tracks. I get enough satisfaction from my normal coding job though, I don't have time to dedicate my life to music like they have. So from that perspective I'm just happy that it's possible to get more music like that type. Just a cool thing that exists in the world now, but I still think working hard to realize an artistic vision is also cool, separately.


I think it's worth contemplating that if there were no Scottish brothers very temporarily moved to Canada, and entire sound may have gone un-established in the popular psyche.

So shortchanging yourself in experience by skipping over all of the things that make art a practice and not just a material commodity you may be missing out on such a moment. Nothing to do with what's cool or not. One has soul, the other is void.


there is very little creativity in most music already

(Axis of Awesome - 4 Four Chord Song)

https://youtu.be/5pidokakU4I?t=52


Do you know where the sounds came from that you like so much? I think such a perspective is only reachable if you do not.

I recommend learning about that before deciding it’s satisfactory to reduce it to an algorithm suited for copying.

It’ll enrich your life. Endless copies will not. They take that music, that emergence of order out of chaos, and return it back to chaos.

It’s void.



As an amateur musician, I’d be more interested in these tools if, along with the text description, they took as input a melody or chord progression or performance data. Maybe ABC notation or a MIDI track? Anyone doing that?

Other cool things would be a way to generate a sampled instrument from a text description, or to generate a new track given a text description and all the previous tracks for other instruments. There could be a new generation of audio tools that let you generate placeholders or better for everything.


The analogue from stable diffusion would be ControlNet, where you can train a superimposed model on auxiliary data, this should be possible to do with chords for example, just like you can do with human poses, 3D depth maps etc in stable diffusion using controlnet


It’s coming


thats gonna be dope af


Does this model support / "understand" concepts of spatial audio? For example, something like "an alarm moving around you in a circle".

When AudioGen was announced this was my first question, but from what I've been able to test the model just ignores spatial audio prompts.

Unfortunately I haven't been able to find any discussion or interest in online discussion about the importance / significance of spatial audio. Why not?


My guess is that it's not a very interesting problem because it's not particularly difficult to add spatial dimensions to arbitrary audio - after all, it is already commonly done in video games. All you have to do is manipulate the multichannel outputs with an understanding of the spatial positioning of each channel's speaker location relative to the listener and some basic trig.


Dolby wouldn't appreciate it.


Just like all examples of generative "AI" I've seen, there's always some bit of uncanny valley vibe present. In the audio examples, there's always this weird distortion like a really poorly compression sources were used as training data. The sounds are muddled together, and rarely do I hear clean musical voices. It's just a smear of sounds coming together that our brains try really hard to say "oh, that's a _____" situation. While the samples in the TFA are probably the closest I've heard to date, the issue is still present.

I guess the thing that strikes me so odd about the generative thing is all of the press releases on people presenting things like it's a final product, yet it's clearly pre-release beta at best but more likely alpha versions of code in the results in quality. If a non-AI product released something that was so clearly not finished, it would be panned to no end for not working.


Everything looks very convincing apart from the airpane pilot and the sound effects. They sound very weird as if one is hallucinating


The airplane is super convincing as an encrypted Empire communication :-) (see Star Wars episode 5 IIRC)


The airplane one just sounds like a foreign language over a bad intercom, I think that could still be useful for some stuff.


Perhaps because generating good white noise requires randomness without autocorrelation or detectable patterns.


I still consider OpenAI's JukeBox (now at least 2 years old!) far and away the most creative music AI. But the combination of coherence, sound quality and creativity of this model is (to my knowledge) easily best in class.


The sound quality of Jukebox is muddled. There are many inconsistencies. The loudness of vocals and the quality of instruments really stand out and not in a good way. Hard to talk about creativity because it's so subjective but I've found it lacking in all AI music including JukeBox. Don't get me wrong - this tech is amazing.


It's mushy and inconsistent, absolutely. But it also comes up with wild yet coherent changes that I've seen from nothing else.

At this comment I listed a few instances:

https://news.ycombinator.com/item?id=37499067


The death metal example reminded me of the continuous streaming death metal here:

https://www.youtube.com/watch?v=MwtVkPKx3RA


It's funny how some of those examples give me this creepy uncanny valley feel for music (the lowfi hip hop example) - I've never experienced it this way before.

It's sort of reminds me of the audio effects they use to indicate that you're incapacitated and things start distorting in a weird way.

Entertaining !


gamechanging stuff for sample based rap producers. havent been able to log-in yet but i think a good benchmark to start off with is to see if it can replicate the 'al green' sound from the early 70s - very distinct sounding production - drumless and instrumental.

you dont need 45 or 90 straight seconds of a coherent song rendered. just need to dip in the 45 sec clip and cut out 4 seconds here, another 4 there. reroll those cuts through stable audio, keep rolling, keep rolling. cut up and get a pile of clips together. arrange, layer, voila - you saved money on paying royalties for sampling.

the lofi melodic sample on the stability page was passable. thought the bluegrass one sounded great actually. imagine being able to program bluegrass like rap.

edit: oof. fully trained on a licensed commercial dataset from AudioSparx. muzak in, muzak out.


Ed Newton-Rex, VP of Audio at Stability, is speaking about how this was built at The AI Conference in 2 weeks. https://aiconference.com


This is yet another amazing release from Stability AI.

Will be adding this to my SaaS side grift and introduce generated music you can listen to while you're chatting with your PDFs.

Can't wait for the next one.


Another alternative: http://www.word.band

Can produce longer content, and more genres and range of music. Isn't 48khz though.


Is the extreme metal music lacking from the training set? Why do the extreme metal examples always sound horrible?


Metal is especially hard to mix in a way that keeps the voices distinct and clear. Maybe the training catalog includes a lot of low-budget metal.


Because metal sounds horrible.


This is the most impressive of this genre of SD audio so far, by a long shot. Really impressive!


As a hobby musician, why don't they start with instrument samples? Every sampler user out there would love a button press generate sample on the fly as a plugin. It would blow away gigs and gigs of ridiculous duplicative or near duplicate samples.


So.... Wait for llama for audio and train your own voice to having to call you friends by text and the software instead of actually saying the words? This is going to be nice for authentication, proving to a third party that you are yourself


Just wait for iOS17 https://youtu.be/oMt02DNbQlk


The bluegrass one is super weird. I can’t identify exactly why.


I can identify a bunch of things. The chord structure jumps all over randomly in a genre that usually does the opposite. The banjo is clearly not an actual banjo being strummed/frailed, but a weird agglomeration of bright toned instruments including both frailed/scruggs banjo and dobro, and maybe harmonica and fiddle creeping in. The AI doesn't know it's making a combination of instruments, so where it's trained on instruments blending, it thinks it can produce pre-blended sounds. I guess maybe this is more like a return to being a child hearing music for the first time with no preconceptions or expectations.


I think the super weird part is that it's not great? I understand this is most likely very impressive technologically but musically it is disjointed, inconsistent and fake sounding. Most of the "music" examples have weird phrasing and confusing harmonic rhythm.

Kudos to stability.ai for achieving this as I am sure it took a lot of effort and this is a huge leap forward in terms of generation of audio by generative AI.

However as a musician (BMus and MMus at 2 different conservatoires) I think it's important to say that the job risk being experienced by creative writers will not be extending to musicians... yet.


I feel like music composition is a fundamentally hard task for AI. Music production seems like it should be a lot easier but I haven’t seen that


From what I've seen in the generated tracks so far (this one and others), they're pretty good locally, but just ignore the overall composition. For example any generated blues tracks will have the vague blues feel, but won't keep the 12 bar style. The bluegrass example here doesn't even seem to keep to 4/4 (or is extremely fluid about it...). Maybe one day someone will add a higher level "what's the current section, how far are you into it" inputs to that model to get something better - literally preparing the structure first and then filling it in. That should get much better results for context like "you're playing blues in A with quick change and generating bars 3-4, match the previous bars in style".

I mean, chatgpt knows how to plan this out https://chat.openai.com/share/976077c0-138b-4363-8065-3c8eed... Painting in that picture should be much easier than generating something freeflowing. Generating a good structure isn't that hard for most styles, because you can literally use the same pattern and do a few random changes that keep the key. (See lots of pop songs using the same 3/4 chord progression)


Yes, and surprisingly so. I never would have guessed we'd have AI stock photographs before AI muzak.


The AI seems to understand 4/4 time but doesn't understand groupings of 4 measures into phrases. It definitely doesn't understand ABABACA or even the basic parts of a song.

It is the musical equivalent of a meandering paragraph.


Absolutely. And all AI for music I have seen suffers that problem.

It makes me wonder whether the music generation should be stratified -- a coarse model lays out where parts like verse and chorus are, what distinguishes them, how to transition, etc., and then a finer-grained model fills in the details.


Also, the music is not bluegrass as much as it is old-time, a confusion that continually irritates old-time players.


You are right it feels off.

The position of the guitar in stereo is all over the place, higher frequency elements appear to come from the left while other parts are more centered.


Same for the death metal


I would love to use it for background music when I am working. I have specific tastes that depend on the task, mood, energy level, and ambiance.


If you're not attuned to Cryo Chamber (label), check them out. Maybe not fitting all use-cases, but a strong and deep catalogue.


Seems nobody has really cracked vocals with songs...


I wonder if it makes sense to generate a combo of instruments rather than individual voices and then combine those with an arranger DNN. I would think it'd be much easier to capture each instrument's transients and dynamics that way, much less allow more subtlety in how they combine, like allowing the lead voice to shift among instruments, or even let the listener choose how each voice expresses stylistically and how they should combine.

Trying to do all of that in a single DNN, much less parameterize it useably seems overly ambitious (or will be of more limited value ultimately).


i want something that can take in a song and transform it into a different genre


it's funny how they're all very impressive except the death metal


Humans take a long time to get good at art; in the meantime they still have to eat.

So they compete with generative AI for a fixed number of jobs. The AI is cheaper and faster. Humans stop training to become artists.

Without new training data, the generative AI models stagnate. Progress in art stops globally, forever.

But for a brief glorious moment, we were able to say "huh, that's not bad".


This is by design - the capital that backs modern art isn’t doing it for love of the art but for money.

For fine art, it’s a way for them to launder money and keep it out of bank accounts where it can be seized trivially.

For mass art, it’s about selling to enough rubes to make a profit.

Neither are impacted by a stagnation in art. If anything, they’re aided by it - suddenly the art you bought to launder money retains its value because it’s no longer the flavor of the week with the arts crowd.


This is crazy tech!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: