Stable Audio: Fast Timing-Conditioned Latent Audio Diffusion

iandanforth · on Sept 13, 2023

The solo piano was interesting because of how clean it is. I can imagine going from that sample to a score without too much difficulty. Once it's in a symbolic format it becomes much more flexible and re-usable.

While this does not seem to be the trend I hope more gen ai in the audio and visual realms start to produce more structured / symbolic output. For example, if I were Adobe I would be training models, not to output full images, but either layers or brush strokes and tool pallet usage. Same for organizations that have all the component tracks of music to work with.

reactordev · on Sept 13, 2023

Yessssss! I thought about MusicGAN and Markov chains last night thinking “Why can’t we just codify all chords and use a GAN to generate markov chains on chords of a key and have AI generate instruments and waveform from those chains?” IANA researcher but in my head, that sounded logical.

TylerE · on Sept 13, 2023

That's existed for decades. It's called Band in a Box. It's also cheezy as hell.

reactordev · on Sept 13, 2023

lol, no. Not autogenerate midi (although their latest versions of BiaB are pretty darn good now) but generate waveforms together. It would be similar to having AI generate whole scores of music but ensuring it's all in sync and in key. Not taking sample database of 88 sound files and triggering them when the midi-note strikes.

TylerE · on Sept 13, 2023

That's not how BiaB works at all. It has all kinds of patterns built into it. So, it knows, how to generate, say, a bluegrass bassline in a given key. There are plenty of ways to play back MIDI with high sound quality, including feeding it into an AI-driven VST like NotePerformer.

reactordev · on Sept 13, 2023

Then explain why you categorize it as cheesy? Sounds like it’s pretty cool.

93po · on Sept 13, 2023

he's saying the output is cheesy. it sounds like stuff you'd hear on a demo track for a kid's toy piano

TylerE · on Sept 13, 2023

It's not THAT bad, it's just that for a programming targetting jazz the playback is rather...square.

reactordev · on Sept 13, 2023

Yeah I definitely imagine certain genres sounding off due to the rhythmic nature of computer timing. Jazz doesn’t follow rules like that so good jazz has these timing idiosyncrasies that make it sound the way it sounds. That and a ridiculous obsession with adding half step and quarter tone intervals.

rhelsing · on Sept 14, 2023

This is almost the exact approach I took for my project Neptunely (https://neptunely.com/). Working on bringing it to a VST at the moment so it's more portable.

schazers · on Sept 13, 2023

I strongly agree about generating "editables" rather than finalized media. In fact, that's why text generators are more useful than current media generators: text is editable by default. Here's a tweetstorm about it: https://x.com/jsonriggs/status/1694490308220964999?s=20

waffletower · on Sept 13, 2023

Audio is definitely editable. While generative audio is new I am hopeful that a host of interesting applications will emerge (audio2audio etc.) within its ecosystem. Promising signal separation (audio to STEMs) and pitch detection tools already exist for raw audio signals. If you want to force Stability to focus on symbolic representations (such as severely lossy MIDI) I hope you can instead first try adapting to tools that work fundamentally with rich audio signals. Perhaps there will be room for symbolic music AI and perhaps Stability will even develop additional models that generate schematic music, but please please don't sacrifice audio generality for piano roll thinking alone. LORAs will undoubtedly be usable to generate more schematic audio via the Stable Audio model -- I imagine they could be easily purposedly to develop sample libraries compatible with DAW (digital audio workstation), sequencer and tracker production workflows.

jononor · on Sept 14, 2023

Audio is editable, but it is a much more rare skill than text editing. Anyone that has completed primary school has basic proficiency. And those that have gone through college or held a job where email communication is common, has may years of experience in it.

fragmede · on Sept 14, 2023

What is audio2audio? Can I beat box into a mic and have professionally produced tracks come out the other end?

visarga · on Sept 13, 2023

Train the model with midi notes as text in the prompt and the audio as target. It will learn to interpret notes.

waffletower · on Sept 13, 2023

Not all music is well represented with notes, nor are audio datasets with high-quality note representations readily available. But I guess if you work hard enough you can get close: https://www.youtube.com/watch?v=o5aeuhad3OM My example still sounds like the chiptune simulation that it is, however.

visarga · on Sept 14, 2023

It's ok, the model would create music even from a vague prompt, it will learn even better from the notes imperfect as they are, because it has the interpreted version in the training target.

Jeff_Brown · on Sept 13, 2023

That raises an interesting difference between cleaning AI-generated sound and cleaning ordinary recordings. In an ordinary recording, there is an objective reality to discover -- a certain collection of voices was summed to create a signal. With (most? the best?) existing AI audio generation, the waveform is created from whole cloth, and extracting voices from it is an act of creation, not just discovery.

I've come across AI-generated music that outputs something like MIDI and controls synthesizers. Its audio quality was crystal-clear, but the music was boring. That's not to say the approach is a dead-end, of course -- and indeed, as a musician, the idea of that kind of output is exciting. But getting good data to train something that outputs separate MIDI-ish voices seems much harder than getting raw audio signals.

fnordpiglet · on Sept 13, 2023

Generative models can certainly create midi, but no one has done it yet. Given the technique is making video, audio, images, and language, all you need to do is train and build a model with an appropriate architecture.

It’s easy to forget this is all pretty new stuff and it still costs a lot to make the base models. But the techniques are (more or less) well documented and implementable with open source tools.

radarsat1 · on Sept 13, 2023

> Generative models can certainly create midi, but no one has done it yet.

Note sequence generation from statistical models has a long history, at least as long if not longer than text generation.

Have a look at section 2.1 of this survey paper [0] that cites a paper from 1957 as the first work that applies Markov models to music generation.

And, of course, plenty of follow-up work 6 decades later on GANs, LSTMs, and transformers.

[0]: https://www.researchgate.net/publication/345915209_A_Compreh...

fnordpiglet · on Sept 13, 2023

Yes, in fact I think at some point everyone has written their own Markov generators or at least run dissociative press. But we’ve really only seen meaningfully high quality output over the last few years.

radarsat1 · on Sept 14, 2023

I think it depends on how you define that. People were quite happy with HMM-based MIDI generators that could generate Beethoven- or Mozart-like sequences 10, maybe even 15 or 20 years ago. But of course other people pointed out the problems of it being boring eventually. Then LSTMs improved long-term dependencies and people were impressed by the improved quality of generating whole musical pieces. But still others thought it was not good enough. Then the goalposts moved again with transformers and neural vocoders and now we want top-40 direct audio generation. And these latest systems can kind of sort of do it! But still there are people who demand better. And so on, things will continue to improve.

Progress only moves as fast as expectations, and expectations move with technology. Music is not special in this respect. So you could say at any given time in the past that some people "see meaningfully high quality" and others are disappointed. You see exactly both these sides of the spectrum even now with text-to-image and text-to-audio technology.

bch · on Sept 14, 2023

> cites a paper from 1957

By Fred Brooks no less…

https://en.m.wikipedia.org/wiki/Fred_Brooks

fassssst · on Sept 13, 2023

Do you know if anyone has tried training a text-to-music or text-to-midi model where the training data includes things like emotion labels for each note interval or chord progression?

Jeff_Brown · on Sept 13, 2023

That sounds expensive and inefficient. Peoples' interpretations of music (and abstract art more generally) can be shockingly different; I suspect the model would not get a clear signal from the result.

But that makes me wonder to what extent labeling can be programmed -- extracting chord changes, dynamics changes, tempo, gross timbral characteristics, etc.

fassssst · on Sept 13, 2023

And maybe even labels like popularity/play count/etc so it has a better sense of what “sounds good” to certain groups

MrCheeze · on Sept 13, 2023

It has been done - first by OpenAI (MuseNet, which is no longer available) and later by Stanford (Anticipatory Music Transformer): https://nitter.net/jwthickstun/status/1669726326956371971

jskherman · on Sept 13, 2023

I believe Spotify's Basic Pitch[0] is already some work towards building something like this.

[0]: https://basicpitch.spotify.com/about

TheActualWalko · on Sept 13, 2023

We’ve done it! wavtool.com

fnordpiglet · on Sept 13, 2023

That’s really neat. How long have you been working on this?

TheActualWalko · on Sept 13, 2023

Thanks! It grew out of an old side project. Been full time on it since December.

miohtama · on Sept 13, 2023

Having music editable for human post production is necessary for most professional adoption. Generating MIDIs would make much more sense than generating raw audio.

This is what we do with AI images: you can fix them in Photoshop, etc. You cannot do this for raw audio due to how music is produced.

waffletower · on Sept 13, 2023

Build or seek out a MIDI generating model. I hope Stable Audio is never the place for that. MIDI is deeply lossy and it would be tragedy if it was the only music representation. Imagine if instead of phonographs, compact disks and streaming audio we only had piano rolls. What a loss indeed.

iainctduncan · on Sept 13, 2023

Midi is not lossy, midi is symbolic. There's a huge difference.

Jeff_Brown · on Sept 13, 2023

MIDI is not inherently lossy. You could encode anything in it, just as you can encode any novel as an integer.

In practice, though, transformations from audio to MIDI discard an enormous amount of important information, with the possible exception of transcriptions of performances on piano (where volume, frequency, duration and a good physical model of a piano are enough to reconstruct everything important about the signal) and similar instruments.

PaulDavisThe1st · on Sept 14, 2023

MPE/MIDI 2.0 pretty much theoretically fixes the "discard an enormous amount of important information", in that you can have an extremely large number of parameters describing every single note and/or physical performance nuance.

Contemporary MIDI is ready to describe almost any human performance in an almost absurd level of detail. Whether anything can do something useful with the description is a different story.

Applejinx · on Sept 13, 2023

No, it's lossy. It's an event model at a fixed data rate. You can only do so many things sequentially, even if you could represent any possible musical concept as a MIDI event. So even if you're not sticking to note-on, note-off, it's still extremely lossy.

gamblor956 · on Sept 13, 2023

MIDI is able to accommodate nearly everything that can be represented through a musical score and instrumental performance. What are you hoping to accomplish with AI-generated waveforms that can't be done with MIDI?

Applejinx · on Sept 14, 2023

Absolutely not. I can write a score where 16 people each strike a xylophone note at the same instant. I pick xylophone because it's a very sharp transient, and in theory the performers could perform so well that it's literally one attack. If that's not enough, make a XM file with 16 channels and guarantee that they'll all be the same attack, or run a modular synth and some mults and trigger it off the rise time of a clock pulse, no computer involved.

Midi has to fire all the notes one at a time. You cannot represent even everything represented by a score through MIDI because it's serial and sends one message at a time. Stretch that out to 16 sharp clicky attacks and you'll notice that MIDI cannot fire them all at once, it'll blur.

gamblor956 · on Sept 14, 2023

The MIDI1 spec allows for up to 16 channels on each port, and would definitely be capable of handling 16 xylophones at once. It's complicated, but the current MIDI spec (and the pre-spec MPE extension) combines data, so what you think of as 16 xylophones is to the MIDI device 1 instrument consisting of 16 xylophones.

However, if you want to control attack, reverb, tone, volume, i.e., all the stuff that makes it seem like a real instrument and not a computer-generated tone, then you would need to dedicate some of the 16 channels on each port to each of those different controls. To handle 16 xylophones, you would need a MIDI interface with enough ports to handle the note data plus however many control channels you want to use. (Note that from the interface's perspective, the data incoming from 1 port is just a single channel, so an "8-channel" MIDI interface is actually handling 128 channels of MIDI data.)

Applejinx · on Sept 15, 2023

You're not following me. It's a single serial bus at 3,125 bytes per second. Expressive controls like polyphonic pressure take at least two bytes, sometimes three: that's for a single message. A note on is typically three bytes, channel, key and velocity.

All these messages have to take turns. It's not a tracker, or a modular synth, where you can parallel click rise times electronically: it's not DINsync, where a chain of voltage pulses synchronize individual sequencers and aren't themselves notes.

1000 messages a second (at three bytes for each note-on) seems like a lot but it really isn't. With your 16 instruments (any drum, the xylophones, whatever) you can fire about 62 notes on all instruments per second. That seems like a lot too, but it's a hard limit, and it means your sixteen instruments have to 'blur' across 16 milliseconds to all fire off a note.

That means every time you fire all the instruments as one click, instead the attacks make up a 960hz tone. That's NOT one attack. Humans on percussion instruments can do better than that. It's the equivalent of 5.5 feet of space between speakers playing back: if you're time-aligning a tweeter and a midrange to produce a unified click, and you misalign one of the drivers by putting it five and a half feet back from where it would be, you'll notice the misalignment. Midi's timing issues are also noticeable.

gamblor956 · on Sept 15, 2023

Each MIDI port is serial, but multiple ports on an interface can operate in parallel.

Consider that Hollywood composers and music producers are able to use MIDI without issue for live-previews of orchestral compositions and live recordings and performances. The problem you are raising simply doesn't exist in the real world, accept that there are things about MIDI you don't understand and move on.

Applejinx · on Sept 16, 2023

I can certainly accept that for you, the real world is multiple MIDI ports in parallel. You're correct in that if everything's in parallel on as many ports as you need, that'll help a lot.

Still ain't quite DINsync or modular, but there ya go :)

waffletower · on Sept 13, 2023

This is absurd. Sure, someone below posits that MIDI could perhaps represent a piano performance by Arturo Benedetti Michelangeli. I think it has been able to do a passing job at that, when you provide a decent piano. Regardless, piano rolls have been able to come close since the early 20th century. But how well does MIDI represent music performed by John Coltrane? Jimi Hendrix? It falls on its face. The long fetishized Western music notation abstraction, which MIDI poorly simulates, completely fails for many important examples of music. I would even venture to say that MIDI fails for most of them. But yes, MIDI is well-optimized for piano music where an acceptable piano or simulation is available.

toasternz · on Sept 14, 2023

These diffusion models give an ability to create far more than can be communicated with MIDI + instrument. Riffusion gave a hint of this - rather than just notes and drum hits + some processing, it becomes one big pulsating, expressive mass which would not be reproducible without the granularity of a diffusion model. These are reminiscent of some of the serendipity of live recordings with lots of tracks where interesting things happen from interplay of many different layers. Generating a few dozen clips generally would give me 2 or 3 with a beautiful emotional passage which really lights up the pleasurable music part of my brain. Mass farming these clips seems like a good route to some amazing music.

waffletower · on Sept 13, 2023

The problems have long been known and articulated: http://www.music.mcgill.ca/~gary/courses/papers/Moore-Dysfun...

gamblor956 · on Sept 13, 2023

Your example of the failures of MIDI are based on a 35-year old paper (from 1988!!!) about an earlier version of MIDI?

When that paper was written it took several weeks and many millions of dollars of equipment to render primitive, mono-color 3d graphics. Desktop computers had 512 kilobytes of RAM and the highest-end desktops 32 MB of hard drive storage space. Computer screens had two colors: black and green. Audio cards capable of making beeps and clicks were the cutting-edge. WIFI was still a decade away.

waffletower · on Sept 13, 2023

You clearly didn't read the article, and clearly don't understand how prevalent the MIDI 1.0 specification is today. MIDI 2.0 is a very recent development (this year LOL!) and has yet to be commercially adopted. The 1984 design is what is largely in use today. At the time of initial development, commercial synthesizers, not sound cards, were the intended generators of sound utilizing MIDI: https://www.vintagesynth.com/roland/juno106.php.

PaulDavisThe1st · on Sept 14, 2023

MPE, which is the model for the most important part of MIDI 2.0, has been around for several years now; there are numerous software synths (and a few hardware ones) that can use it, and several hardware controllers ("instruments") that can deliver it.

mdp2021 · on Sept 13, 2023

> What

"Intention" (as a tentative term)

The question becomes: what has impeded the creation of a MIDI file that can be confused with an actual concert from Arturo Benedetti Michelangeli.

gamblor956 · on Sept 13, 2023

Literally nothing is preventing this other than that nobody has bothered to take the time to do it.

The current version of MIDI is capable of replicating any of his performances, even down to the randomness.

Note that if you want to replicate the audio quality of his performances, you will need a high-quality MIDI instrument; the ones that ship with Windows will not suffice. These MIDI instruments can range from a few dollars to thousands of dollars. (See, e.g., Native Instruments)

mdp2021 · on Sept 13, 2023

> nobody has bothered to take the time to do it

In such case, we have a theoretical suggestion that «nothing is preventing this», but not an actual proof based on a "Turing test"-like scenario which would have specialists fooled, to corroborate that the new MIDI 2 would suffice.

gamblor956 · on Sept 13, 2023

I gave you the knowledge to do this research yourself, but since you are unwilling to do so, here is an example of the performance possible using MIDI instruments: https://www.youtube.com/watch?v=CvaChiq6gf0

As it is clear that you intend to keep shifting the goalposts to make a point that can't be made, I will withdraw from further participation in this thread.

iainctduncan · on Sept 13, 2023

well ok, you could say it's a lossy format for capturing physical movement, which it certainly is. My point is that it is not a format for capturing music any more than a score on a page is a format for capturing music. Both are instructions for a performer of music (one machine, one human), which is a very different beast.

miohtama · on Sept 13, 2023

MIDI 2.0 improves a lot of things, how much dynamicity and variation you can have. MIDI 1.0 is a standard from early 80s. It indeed has shortcomings, but also the upside is editability.

It’s then for the remaster / musician / actual interpretation / post production to make the score to something less of an event model.

waffletower · on Sept 13, 2023

MIDI is an extraordinarily lossy music representation. Even Claude Shannon would facepalm at the assertion that it could, in theory, represent audio faithfully. It is not its purpose, it is decidedly not its practice, and it is a ludicrously irrelevant example of pedantry to say otherwise. The false equivalency asserted by the commons can be aggravating :D

iainctduncan · on Sept 13, 2023

MIDI is not a lossy format for audio because it's not a representation of audio, period. It's a format for conveying the motion of a piano keyboard, meant from the beginning to be usable for various forms of audio.

dylan604 · on Sept 13, 2023

>For example, if I were Adobe I would be training models, not to output full images, but either layers or brush strokes and tool pallet usage. Same for organizations that have all the component tracks of music to work with.

I really like this idea. Creating new tools for artists to use to create rather than whatever we're accepting as use now. The use of current full image creation is boring to me in the same way the choice of invisibility as a super power is. The invisibility is ultimately going to slide into pervy tendencies, just like deep fakes will slide in the same way or some other inappropriate use.

waffletower · on Sept 13, 2023

Hopefully, the entire industry will NOT move in such a schematic and lossy direction. Use separate tools to analyze audio streams please. Don't throw the timbre baby out with the bathwater. MusicGen utilizes a tokenized transformer model for music, which is attractive for symbolic translation use cases. However, the overall audio quality is far more lossy than the examples you hear from Stable Audio. I believe that symbolic representation should not be a foundational approach to adequately represent and generate rich audio signals.

tech_ken · on Sept 13, 2023

I was wondering the same thing, definitely seems like generating the raw waveform runs into all kinds of weird issues (like they touched on in this post). I would imagine that training data would be a serious chokepoint here. Given how much discourse is currently kicking off around the intellectual property rights of just the final product (the mastered track), I can't imagine many musicians would be eager to share what is effectively the "proof of ownership" (track stems or MIDIs).

fnordpiglet · on Sept 13, 2023

There are a lot of Lora models that are being made to generate textures, maps, diagrams, backgrounds, etc. You don’t need to wait for adobe, open source models like stable diffusion let you do whatever you think is useful. I’d look to the open source world for creative innovation. Adobe is just doing what’s on the product management roadmap.

rhelsing · on Sept 14, 2023

I am approaching this from the symbolic angle via MIDI at neptunely (https://neptunely.com/)

coldcode · on Sept 13, 2023

It's interesting tech but none of the musical pieces impressed me (I play multiple instruments and have written and arranged music), most sounded too repetitive and not very imaginative. This is also an issue with diffusion based art AI in general, its good at a limited set of things but gets rather repetitive after a while. I could see using this as background music where quality is not important, like in games, though I doubt you could run the AI generator inside a game, you could generate it as a asset. People like Hans Zimmer and Ludwig Göransson have nothing to worry about.

Singing would an interesting experiment, but I don't see that here.

AuryGlenz · on Sept 13, 2023

I disagree with your diffusion based art assessment, and I think it’s probably colored by what most people seem to want to make with it. Just as with regular art, you need some sort of vision to go beyond what everyone else does. Prompting “pretty girl wearing sexy clothing” for the umpteenth time isn’t new.

AI art gets rid of the technical skill step but the rest is there, although you may luck in to something at random. If you’re using ControlNet on Stable Diffusion or training your own models you have a lot of control over the output as well.

riskable · on Sept 13, 2023

I thought the, "epic trailer music intense tribal percussion and brass" was pretty good. Rather, good enough for something like a video game where the game engine is dynamically generating music based on the present situation in the game.

I could easily find that music entertaining if it started playing the moment my character triggered a trap and suddenly, "the floor is lava" or my character enters a scene with the quest of winning over one of the love interests =)

Maschinesky · on Sept 13, 2023

I'm surprised that you even leap to people like Hans Zimmer and others.

The people we need to worry about are aallll of the people earning a living for everything else like background music for Indi games, Ambient music etc.

cwillu · on Sept 13, 2023

I thought “Prompt: 116 BPM rock drums loop clean production” wasn't bad, but I'll grant that most of the rest would be an excellent way of showing (for instance) a death metal fan what their favoured music sounds like to those who don't have an ear for it :D

mpsprd · on Sept 13, 2023

>where quality is not important, like in games

I don't believe that's a good example. Video game music is an important part of the gaming experience, but its often taken for granted or overlooked.

PatronBernard · on Sept 13, 2023

What I haven't seen done well by these generative AI's so far is structure (having a chorus, a verse, a bridge, ...) and harmonic movement/progressions (except for maybe a V-I or I - VI - ii - V) And those two be things are exactly what makes a song interesting and non-repetitive.

Jeff_Brown · on Sept 13, 2023

OpenAI's jukebox -- now 3 years old -- is creative and non-repetitive. Witness, for instance, its jam on Uptown Funk here:

https://www.youtube.com/watch?v=KCaya74_NHw

Or the changes shortly after 1:15, 2:15 and 2:40 in these extensions of Take On Me:

https://www.youtube.com/watch?v=_3yOrUJ0SzY

PatronBernard · on Sept 14, 2023

Okay but here it still seems like it's just taking over snippets. It did not come up with the progression.

Jeff_Brown · on Sept 15, 2023

If inventing new chord progressions is one of the requirements for musical creativity then neither Handel nor Paul Simon would qualify.

dylan604 · on Sept 13, 2023

Yeah, the "rock drums" example was like a student in a practice. I'll be impressed when it can sound like Danny Carey.

From all of the hype, I want to be impressed with results. Instead, we get these mediocre at best examples of what it can do. They are not good sales pitches to me.

chpatrick · on Sept 13, 2023

Sir, your dog can talk!

Yes, but not very well.

dylan604 · on Sept 13, 2023

You joke, but even those videos of people saying their dog can talk is just like this. It's cute because it's a real dog making sounds we really want to believe when it's just them mimicking sounds because they get pettin's and treats.

What I want is "AI" to do something impressive. Why are we trying to make the system generate the sounds itself? We don't make artists do that, we give them instruments. Give the models actual instruments, and then have it play them like a real artist. I will be much more impressed with an AI that understands composition and scoring, use of musical voices, key signatures. That would still be generative. I guess I just don't understand the point of the direction being taken. It's like a solution looking for a problem.

Jeff_Brown · on Sept 13, 2023

We work with what we have. We don't have a lot of recordings of the physical movements of musicians; we have recordings.

Similarly, we don't have recordings of the actions of painters; we have finished paintings -- but if you're not impressed with what AI can do in the visual sphere, your standards are, to put it mildly, high.

dylan604 · on Sept 13, 2023

I'm not really sure how to take this. We absolutely have recordings of instruments. You can buy them as complete sets. You train on complete recordings, and then tell it how to use the sampled instruments to compose a song in the style of the trained data. Building something to make a waveform that looks like another waveform just seems like a very odd direction to take.

Yes, my standards are if it isn't at least as good as what's available now, what's the point.

chpatrick · on Sept 13, 2023

Well, we found with Midjourney et al that these models can work very well despite having no pre-conceived or symbolic notions of composition, color theory, perspective or anything. Yet they can produce really good results in the image generation space. It's the same idea here, except much earlier days.

In the same way, many successful musicians can't read sheet music or know music theory, they just know how to produce something that sounds good.

dylan604 · on Sept 13, 2023

>In the same way, many successful musicians can't read sheet music or know music theory, they just know how to produce something that sounds good.

Right, because they can operate the instruments that make the sound with natural talent, but they don't have to draw the waveforms. Audio generation is much different than image generation. It's just very odd to me.

hoosieree · on Sept 13, 2023

Hate to break it to you but there is a vast market for mediocre content, and even Danny Carey has a bad day once in a while.

dylan604 · on Sept 13, 2023

I think you've confused the vast market's forced acceptance of mediocre because that's what's available with their wanting the mediocrity. Proof of that is seen with the emergence of places for affordably licensed royalty free options. The quality of production and styles today compared to those from the 90s/00s is amazing. There's been a few options on these sites that I would play in a set as DJ. This ain't yo momma's needle drop selections.

chaosbolt · on Sept 14, 2023

This is the same kind of comment that got HN to seeth for months about how ChatGPT isn't the god programmer some clickbaity news sites claimed it was.

ChatGPT is good in a way that having it is better than not having it, especially with how bad google has become, audio generation will also be good in this way, some people don't need your "musical expertise" but just some background calm music to use with a tutorial video without having youtube take it down for copyright infringement.

zone411 · on Sept 13, 2023

Yes, while working on my AI Melodies Assistant project, it quickly became clear that generating a pleasant but boring music isn't too difficult. To create a catchy tune, an element of surprise is essential. In the end, I was able to use it as an assistant to compose 60 melodies that I'm happy with (https://www.melodies.ai/)

Art9681 · on Sept 13, 2023

The real test is when this stuff is out in the wild and no one tells you it’s AI and the thought doesn’t cross your mind. Of course it’s not impressive nor surprising when the answer was given up front.

dvngnt_ · on Sept 14, 2023

gamers don't want bad music either.

some go to video game music concerts or to fan covers

kherud · on Sept 13, 2023

Thank you for sharing! On a tangent: I'm wondering if there are any good open source models/libraries to reconstruct audio quality. I'm thinking about an end-to-end open source alternative to something like Adobe Podcast [1] to make noisy recordings sound professional. Anecdotally it's supposed to be very good. In a recent search, I haven't found anything convincing. In my naive view this tasks seems much simpler than audio generation and the demand far bigger, since not everyone has a professional audio setup ready at all times.

[1] https://podcast.adobe.com/

earthnail · on Sept 13, 2023

We've been researching an audio denoiser for music that we will present at the AES conference in October. Description page: https://tape.it/denoising

We'll also publish a webapp where you can use the denoiser for free. Mail me if you want beta access to it (email in profile).

It won't be open-source though, although the paper will of course be public. It will also only reduce noise, and not reconstruct other aspects of audio quality. However, it can do so on any audio (in particular music), not just speech like Adobe Podcast, and it fully preserves the audio quality. It's designed exactly for the use case you want: to make noisy recordings sound professional.

haywirez · on Sept 13, 2023

Are you sure the demo sound files are correct on the website? Couldn't appreciate any glaringly obvious differences between the original and denoised with studio grade headphones here. Or, the originals aren't noisy enough.

white_beach · on Sept 13, 2023

denoising seems to fail in the guitar and vocals example

earthnail · on Sept 13, 2023

Can you clarify where it fails? It's designed to remove stationary noise only, and removes it very well in the guitar and vocals example.

Generally speaking, if you have other sounds that you don't want in the audio, we don't remove them - it's hard to decide from a musical point of view whether you want a certain sound or not. To give an extreme example: a barking dog probably doesn't belong into a Zoom conference, but it may very well belong into your audio recording. Removing such elements would be a creative decision.

The guitar and vocals example has certain clicks in the background that we don't remove - but the stationary noise is gone. Existing professional (and complex) audio restoration tools like iZotope RX don't remove those clicks, either. It's a conservative approach, sure, but in return you can throw any audio at it and it always improves it.

whywhywhywhy · on Sept 13, 2023

It’s not open but Nvidia has RTX Voice for free if you have and Nvidia card.

Only weird thing it’s designed to be used real time but I’ve had some luck on cleaning up voice recordings replayed back through it via audio routing.

joshspankit · on Sept 13, 2023

There seems to have been a fork in the road:

On one side the tech for literal denoising has stagnated a bit. It’s a very hard problem to remove all noise while keeping things like transients.

On the other side, AI is being rapidly developed for it’s ability to denoise by recreating the recording, just without the noise.

earthnail · on Sept 13, 2023

In our denoiser (see other comment), we worked on combining these two forks. That’s how we can mathematically guarantee great audio quality.

This combination was non-trivial as training old school DSP denoisers is not easily possible. We’ll describe the math needed in our paper. We hope our publication will help the wider community work not just on denoising but also tasks like automatic mixing.

cosmok · on Sept 13, 2023

I have had a lot of success with this: https://ultimatevocalremover.com/ for de-noising

spdif899 · on Sept 14, 2023

https://youtu.be/o-kJ4_CuWzA

This video from MKBHD's studio channel dives into this topic

jasbur · on Sept 13, 2023

It's interesting that the Death Metal was the hardest to reproduce. I conclude that it's the most fundamentally human of all genres.

dontreact · on Sept 13, 2023

Well... they hardly tried all genres :)

It sounds like it can't handle lyrics or semantics that well so I suspect any genre where the lyricism is important would also be quite mushy and recognizably AI

Jeff_Brown · on Sept 13, 2023

The Beatles seemed to be the hardest music for JukeBox to emulate.

Ninovdmark · on Sept 13, 2023

The sound sample seemed to fit the 'vibe', but lacked any discernible definition. Could it be that it's too sonically dense to easily reproduce? Perhaps this could be improved with a more tailored training set.

awestroke · on Sept 13, 2023

I think it was just the genre that was the least represented in the training data

emperorcj · on Sept 15, 2023

dadabots here: haven't gotten good death metal with it. problem is there's not really much in the dataset.

emperorcj · on Sept 15, 2023

(context: i make ai death metal & also i worked on stable audio). 100% it was a dataset problem. Diffusion models still work well when you train them on death metal: https://www.youtube.com/watch?v=rlsRMQzD_6Q

chankstein38 · on Sept 13, 2023

It sounds more like break core haha

hubraumhugo · on Sept 13, 2023

Now imagine Spotify using this to generate individual earworms for everybody based on their personal tastes (likes, playlists).

Yes, AI is partly hype, but had someone told me this even two years ago, I wouldn't have believed it.

joshspankit · on Sept 13, 2023

This is why it’s vital that AI is openly available. Imagine a world where Spotify is the only company that can do that, and they use it to make sure they never pay royalties again.

bee_rider · on Sept 13, 2023

How is Spotify for finding new music based on your tastes? I’ve only used Amazon and Pandora; Amazon is quite poor, Pandora is pretty good. I suspect (although, without proof) that if a service can’t suggest new music, it will have trouble generating new music as well.

Anyway, I very much would rather run this sort of thing locally. You could just manually set your taste profile. Plus, music can be quite personal, imagine you start listening to too much music inspired by The Cure and suddenly Amazon starts advertising black makeup and antidepressants or something like that, it would be too disconcerting.

magicalhippo · on Sept 13, 2023

> How is Spotify for finding new music based on your tastes?

I haven't tried any alternatives really, but so far for me I'd say decent. I put on an album, and once over it'll play similarish stuff. If I don't like a song I'll skip it and it seems to incorporate that feedback.

Only thing is that it doesn't seem to be too adventurous and it adheres rather strictly to the local context. Meaning, if I played a stoner rock track, it'll continue suggesting stoner rock and not much else, even though I have quite varied music favorited in my library.

Overall though I've found a lot of new bands I enjoy that way so, positive experience for me.

edit: as an example, here are the two most recent ones it suggested where I ended up buying the albums on Bandcamp. Both have quite few monthly listeners (1-2k), so not what I'd call mainstream.

SUIR https://open.spotify.com/artist/6zOeQ2hyNfqi9UMHtyTSlF

Mount Hush https://open.spotify.com/artist/13clfeXxTPsDsqzSlLIBZJ

seanw444 · on Sept 13, 2023

I'll +1. I'm generally a FOSS guy, so if Spotify hadn't helped so much in discovering new music that I really enjoy, I'd be potentially acquiring music via questionable means and just playing them as audio files directly.

There are two companies that have done well by Gabe Newell's "piracy is a service problem - not a price problem" position: Valve/Steam (who also contribute to FOSS through Proton and SteamOS which I heavily appreciate), and Spotify. Spotify makes discovering and aggregating music so easy that the alternatives don't seem appealing.

riskable · on Sept 13, 2023

I've tried Amazon and Pandora. Spotify is so vastly superior at finding music I like it's in an entirely different league.

Having said that, starting about a year ago (maybe ~1.5 years?) Spotify started inserting obviously paid promotion tracks into my auto-generated "Daily Mix n" playlists and it seriously bothered me. When my playlists are made up of very specific genres of EDM and suddenly a pop song plays from a famous person I get seriously angry.

It hasn't happened in many months though so maybe they learned their lesson. I was so mad I seriously considered ending my Premium subscription right then and there when that track played.

K5EiS · on Sept 13, 2023

I find the discover weekly playlists that are made for me are pretty hit or miss, overall I have found many news songs I like with their help.

viraptor · on Sept 13, 2023

Similar - I found many good new things. I like the discovery playlists, even with the tracks I don't like. If they never missed, how could they ever suggest something actually different and exciting? It's "discover" not "average of what you already like".

chankstein38 · on Sept 13, 2023

Same! New songs, new artists, new genres, it's pretty cool! I agree it can be hit or miss but it feels like the longer I use it and the older, and more mature the platform gets, the more hit discover weekly ends up being!

chankstein38 · on Sept 13, 2023

Similarly to others, I haven't really used other things in a LONG time but Spotify's Discover Weekly playlist is usually a list of bops that I enjoy a lot. I frequently end up adding a huge portion of them to my liked songs and to regularly used playlists!

liotier · on Sept 13, 2023

Machine-generated music might be functionally equivalent to human-generated music, but that ignores the cultural role of art as a shared human experience - witness the liturgy of live music. That can't happen with music tailored to each listener, it can't happen without tracks that are fixed in time and can be referred to. I can imagine it well-accepted for dynamic music such as gaming soundtracks, but I suppose that machine generation will be mostly a production technique resulting in branded pieces.

Jeff_Brown · on Sept 13, 2023

Saying that's impossible makes me immediately wonder whether it's not. There are already headphone dance parties. What if a musical act's output was being interpreted through genre lenses specific to each listener?

jimmygrapes · on Sept 13, 2023

I know quite a few technically talented musicians who have next to no creativity in actually writing the music (aside from jazz style improv sessions). Most of them never really play live unless it's part of a similarly uncreative band of college friends. I wonder if having a catchy/complex AI generated song created for them to play live might be interesting to them. Gonna check in and see what they think.

broast · on Sept 13, 2023

> such as gaming soundtracks

These days I also feel like my workout playlists might as well be randomly generated dance music.

ragazzina · on Sept 13, 2023

Spotify does not need to generate a tailored earworm for me. It could already suggest songs that I like based on my personal taste out of their 100-million-songs catalog - and it's absolutely unable to do it.

zachthewf · on Sept 13, 2023

Building a tailored earworm might actually be easier.

hospitalJail · on Sept 13, 2023

I really want this. I have a band that I like, and I want more!

Or I'd like to take a song I like, and make it educational, like make it include the period table of elements.

naillo · on Sept 13, 2023

I keep thinking back to when we didn't have stabilityai and it was just google and meta teasing us with mouth watering papers but never letting us touch them. I'm so thankful stability exists.

Tenoke · on Sept 13, 2023

Stability is great but Meta's MusicGen is available with code and weights while this isn't so that's a really odd place to make that comparison and complaint.

Taek · on Sept 13, 2023

Before stable diffusion, nobody released weights at all. Meta et al only started sharing their models with the world when they realized how fast a developer ecosystem was building around the best models.

Without stability, all of AI would still be closed and opaque.

Tenoke · on Sept 13, 2023

>Before stable diffusion, nobody released weights at all.

That's not true. There's been a lot of models with weights from every player before Stability.

>Without stability, all of AI would still be closed and opaque.

Most GANs (the practically spiritual predecessor to diffusion models) for example were available. Huggingface existed and has realistically done more to keep AI open. And again, this specific release we are talking about by Stability is not Open.

Stability is great but you are re-writting history and doing it on the release where it makes least sense to do so.

refulgentis · on Sept 13, 2023

Nah. Dunno where this is coming from but infamously no AI models were released by big players for years. Rewind 18 months and all you got is GPT-3.0 that no one seems to care about and Disco Diffusion-y type stuff.

thomashop · on Sept 13, 2023

You are looking at a very short part of recent history. It has not been like that at all.

refulgentis · on Sept 13, 2023

I'm all ears. I was "in the room" from 2019 on. Can't name one art model you could run on your GPU from a FAANG or OpenAI before SD, and can't name one LLM with public access before ChatGPT, much less weights available till LLaMA 1.

But please, do share.

thewataccount · on Sept 13, 2023

Openai - GPT2 2019 - https://openai.com/research/gpt-2-1-5b-release

Google - T5 - Feb 2020 - https://blog.research.google/2020/02/exploring-transfer-lear...

Both of these were and still are used heavily for on-going research and T5 has been found to be decently useful when fine-tuned.

Weights were available for both.

refulgentis · on Sept 13, 2023

See https://news.ycombinator.com/item?id=37501964

smoldesu · on Sept 13, 2023

> Can't name one art model you could run on your GPU from a FAANG or OpenAI before SD

Google published dozens to promote Tensorflow:

https://experiments.withgoogle.com/font-map https://experiments.withgoogle.com/sketch-rnn-demo https://experiments.withgoogle.com/curator-table https://experiments.withgoogle.com/nsynth-super https://experiments.withgoogle.com/t-sne-map

The list goes on. Many are source-available with weights too.

> can't name one LLM with public access before ChatGPT, much less weights available till LLaMA 1.

Do any of these ring a bell?

- DistilBERT/MobileBERT/DeBERTa/RoBERTa/ALBERT

- FNet

- GPT2/GPT-Neo/GPT-J

- Marian

- MBart

- M2m100

- NLLB

- Electra

- T5/LongT5/T5-flan

- XLNet

- Reformer

- ProphetNet

- Pegasus

That's not comprehensive but may be enough to jog your memory.

refulgentis · on Sept 13, 2023

I understand your point.

The gap in communication is we don't mean _literally_ no one _ever_ open-sourced models. I agree, that would be absurd. [1]

Companies, quite infamously and well-understood, _did_ hold back their "real" generative models, even from being available for pay.

Take a stab at a literal definition: - post-GPT2 LLMs (ex. PALM, PALM2) - art like DaLL-E, Imagen, Parti

Loosely, we had Disco Diffusion for art, and GPT-3 for LLMs, and then Dall-E, then Midjourney. That was over an _entire year_, and the floodgates on private ones didn't open till post SD/ChatGPT.

[1] thank you for the lengths you went to highlight the best over a considered span of time, I would have just said something snarky :)

[2] I did not realize FLAN was open-sourced a month before ChatGPT, that's fascinating: we're stretching a bit, beyond that, IMHO: the BERTs aren't recognizable as LLMs.

smoldesu · on Sept 13, 2023

All good. I've also been working on LLMs since 2019-ish, so I wanted to toss a hat in the ring for the underrepresented transformer models. They were cool (eg. dumb), fast and worked better than they had any right to. In a lot of ways they are the ancestors of ChatGPT and Llama, so it's important to at least bring them into the discussion.

astrange · on Sept 13, 2023

> Can't name one art model you could run on your GPU from a FAANG or OpenAI before SD

CLIP could be used as an image generator, slowly.

> and can't name one LLM with public access before ChatGPT, much less weights available till LLaMA 1

InstructGPT was available on OpenAI playground for months before ChatGPT and was basically as capable as GPT3, people were really missing out. Don't know any good public models though.

hofstee · on Sept 13, 2023

https://github.com/google/deepdream

waffletower · on Sept 13, 2023

In the image generation space, weights were never released for ImageGen and Dall-e, but yes you can find weights for more specialized generative models like StyleGAN (2, 3 etc). Stable Diffusion was arguably one of the most influential open model releases, and I think the substantial investment in StabilityAI is evidence of that.

astrange · on Sept 13, 2023

There were open reproductions of DALLE1 like ruDALLE.

smoldesu · on Sept 13, 2023

GPT-2, GPT-J, XLNET, BERT, Longformers and T5 were all freely available before Stable Diffusion was even a press release.

vhold · on Sept 13, 2023

Stable Diffusion 1 contains a model OpenAI released. The CLIP encoder that was trained on text/image pairs at OpenAI.

https://huggingface.co/runwayml/stable-diffusion-v1-5

https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/m...

Uploaded to Hugging Face Jan 2021

https://huggingface.co/openai/clip-vit-large-patch14

waffletower · on Sept 13, 2023

Unfortunately MusicGen's output quality isn't strong enough. I applaud Meta for open sourcing it. The audio samples released for Stable Audio show much more promise. I look forward to code and model releases. I built out a Cog model for MusicGen and took it for a fairly extensive test drive and came back disappointed.

naillo · on Sept 14, 2023

The way I see it regarding the point "but meta is also releasing models" is: there was one span of time between say 2014-2019 when mostly ML was just classifiers (nothing generative). People did open source those. Then there was a period between 2019-2023 when generative AI was possible. It's true that meta is releasing models in that space now finally. But there was an excruciating 3-4 year period between 2019 and 2022 when stable diffusion was finally made and released which opened the floodgates to others doing so as well. But I'm eternally grateful for emad and stabilitai for opening the gates that had been titillatingly closed for 4 annoying years.

refulgentis · on Sept 13, 2023

Nah it's not because without the releases of Stability/ChatGPT it'd be the same situation. Cool nihilism though

seydor · on Sept 13, 2023

Stability def helped push things forward , it even probably showed them that open source is inevitable

Jeff_Brown · on Sept 13, 2023

Is the source for this available? I found no mention of it on the page.

fnordpiglet · on Sept 13, 2023

It says source is coming

naillo · on Sept 13, 2023

This is gonna be great to finetune on. There's only so many boards of canada/aphex twin songs out there but I wish there were more and this will let us generate more.

52-6F-62 · on Sept 13, 2023

This is not the way.

naillo · on Sept 13, 2023

Why not? Mostly for private use in my case. SDXL has created some beautiful works of art in my experiments and I would love to have a similar experience in the music world.

wokwokwok · on Sept 13, 2023

Come on, be creative and make something new instead of copying someone else.

It’s just kind of lame imo.

“Mostly” private use? Mmm. :thumbs down emoji:

naillo · on Sept 13, 2023

I meant private use and maybe share with a few friends. I actually agree with you that we probably shouldn't finetune on great artists and try to sell the output without modification or added creativity. Private or close friends sharing is fun and life enriching and inspiring though in my eyes.

52-6F-62 · on Sept 13, 2023

Boards of Canada came to their sound because in their youth, the brothers had to move to Canada for a time. Even though it was only a couple of years their experience made an indelible mark on them—particularly school days watching old National Film Board of Canada tapes on worn VCR heads.

When they moved back to Scotland and started their music they started incorporating both the machinery and the sounds from the tapes in their compositions. And they could play their compositions live. It was quite the rig.

It’s not just entertainment. It’s communicating a very specific feeling and perspective. Keep learning and create, don’t be satisfied with just copying.

The biggest difference here is in the doing. You have to grow into one mode over time and energy spent, the other is immediate gratification with minimal personal energy.

Everything valuable comes during the course of that process of growing and committing energy. And it’s so good. Don’t deny yourself.

naillo · on Sept 13, 2023

I get that. I think it's really cool what they did and when musicians put in time and energy into making amazing tracks. I get enough satisfaction from my normal coding job though, I don't have time to dedicate my life to music like they have. So from that perspective I'm just happy that it's possible to get more music like that type. Just a cool thing that exists in the world now, but I still think working hard to realize an artistic vision is also cool, separately.

52-6F-62 · on Sept 14, 2023

I think it's worth contemplating that if there were no Scottish brothers very temporarily moved to Canada, and entire sound may have gone un-established in the popular psyche.

So shortchanging yourself in experience by skipping over all of the things that make art a practice and not just a material commodity you may be missing out on such a moment. Nothing to do with what's cool or not. One has soul, the other is void.

93po · on Sept 13, 2023

there is very little creativity in most music already

(Axis of Awesome - 4 Four Chord Song)

https://youtu.be/5pidokakU4I?t=52

52-6F-62 · on Sept 13, 2023

Do you know where the sounds came from that you like so much? I think such a perspective is only reachable if you do not.

I recommend learning about that before deciding it’s satisfactory to reduce it to an algorithm suited for copying.

It’ll enrich your life. Endless copies will not. They take that music, that emergence of order out of chaos, and return it back to chaos.

It’s void.

skilled · on Sept 13, 2023

Relevant,

https://news.ycombinator.com/item?id=37493741

https://www.stableaudio.com/

skybrian · on Sept 13, 2023

As an amateur musician, I’d be more interested in these tools if, along with the text description, they took as input a melody or chord progression or performance data. Maybe ABC notation or a MIDI track? Anyone doing that?

Other cool things would be a way to generate a sampled instrument from a text description, or to generate a new track given a text description and all the previous tracks for other instruments. There could be a new generation of audio tools that let you generate placeholders or better for everything.

l33tman · on Sept 13, 2023

The analogue from stable diffusion would be ControlNet, where you can train a superimposed model on auxiliary data, this should be possible to do with chords for example, just like you can do with human poses, 3D depth maps etc in stable diffusion using controlnet

emadm · on Sept 13, 2023

It’s coming

_sys49152 · on Sept 15, 2023

thats gonna be dope af

TheAceOfHearts · on Sept 13, 2023

Does this model support / "understand" concepts of spatial audio? For example, something like "an alarm moving around you in a circle".

When AudioGen was announced this was my first question, but from what I've been able to test the model just ignores spatial audio prompts.

Unfortunately I haven't been able to find any discussion or interest in online discussion about the importance / significance of spatial audio. Why not?

cheald · on Sept 13, 2023

My guess is that it's not a very interesting problem because it's not particularly difficult to add spatial dimensions to arbitrary audio - after all, it is already commonly done in video games. All you have to do is manipulate the multichannel outputs with an understanding of the spatial positioning of each channel's speaker location relative to the listener and some basic trig.

Jerrrry · on Sept 13, 2023

Dolby wouldn't appreciate it.

dylan604 · on Sept 13, 2023

Just like all examples of generative "AI" I've seen, there's always some bit of uncanny valley vibe present. In the audio examples, there's always this weird distortion like a really poorly compression sources were used as training data. The sounds are muddled together, and rarely do I hear clean musical voices. It's just a smear of sounds coming together that our brains try really hard to say "oh, that's a _____" situation. While the samples in the TFA are probably the closest I've heard to date, the issue is still present.

I guess the thing that strikes me so odd about the generative thing is all of the press releases on people presenting things like it's a final product, yet it's clearly pre-release beta at best but more likely alpha versions of code in the results in quality. If a non-AI product released something that was so clearly not finished, it would be panned to no end for not working.

jacooper · on Sept 13, 2023

Everything looks very convincing apart from the airpane pilot and the sound effects. They sound very weird as if one is hallucinating

wiz21c · on Sept 13, 2023

The airplane is super convincing as an encrypted Empire communication :-) (see Star Wars episode 5 IIRC)

sebzim4500 · on Sept 13, 2023

The airplane one just sounds like a foreign language over a bad intercom, I think that could still be useful for some stuff.

xpe · on Sept 13, 2023

Perhaps because generating good white noise requires randomness without autocorrelation or detectable patterns.

Jeff_Brown · on Sept 13, 2023

I still consider OpenAI's JukeBox (now at least 2 years old!) far and away the most creative music AI. But the combination of coherence, sound quality and creativity of this model is (to my knowledge) easily best in class.

4RealFreedom · on Sept 13, 2023

The sound quality of Jukebox is muddled. There are many inconsistencies. The loudness of vocals and the quality of instruments really stand out and not in a good way. Hard to talk about creativity because it's so subjective but I've found it lacking in all AI music including JukeBox. Don't get me wrong - this tech is amazing.

Jeff_Brown · on Sept 13, 2023

It's mushy and inconsistent, absolutely. But it also comes up with wild yet coherent changes that I've seen from nothing else.

At this comment I listed a few instances:

https://news.ycombinator.com/item?id=37499067

cesaref · on Sept 13, 2023

The death metal example reminded me of the continuous streaming death metal here:

https://www.youtube.com/watch?v=MwtVkPKx3RA

moonchrome · on Sept 14, 2023

It's funny how some of those examples give me this creepy uncanny valley feel for music (the lowfi hip hop example) - I've never experienced it this way before.

It's sort of reminds me of the audio effects they use to indicate that you're incapacitated and things start distorting in a weird way.

Entertaining !

_sys49152 · on Sept 13, 2023

gamechanging stuff for sample based rap producers. havent been able to log-in yet but i think a good benchmark to start off with is to see if it can replicate the 'al green' sound from the early 70s - very distinct sounding production - drumless and instrumental.

you dont need 45 or 90 straight seconds of a coherent song rendered. just need to dip in the 45 sec clip and cut out 4 seconds here, another 4 there. reroll those cuts through stable audio, keep rolling, keep rolling. cut up and get a pile of clips together. arrange, layer, voila - you saved money on paying royalties for sampling.

the lofi melodic sample on the stability page was passable. thought the bluegrass one sounded great actually. imagine being able to program bluegrass like rap.

edit: oof. fully trained on a licensed commercial dataset from AudioSparx. muzak in, muzak out.

shon · on Sept 13, 2023

Ed Newton-Rex, VP of Audio at Stability, is speaking about how this was built at The AI Conference in 2 weeks. https://aiconference.com

colesantiago · on Sept 13, 2023

This is yet another amazing release from Stability AI.

Will be adding this to my SaaS side grift and introduce generated music you can listen to while you're chatting with your PDFs.

Can't wait for the next one.

chandhoo · on Sept 18, 2023

Another alternative: http://www.word.band

Can produce longer content, and more genres and range of music. Isn't 48khz though.

Cloudef · on Sept 13, 2023

Is the extreme metal music lacking from the training set? Why do the extreme metal examples always sound horrible?

Jeff_Brown · on Sept 13, 2023

Metal is especially hard to mix in a way that keeps the voices distinct and clear. Maybe the training catalog includes a lot of low-budget metal.

rafaelero · on Sept 13, 2023

Because metal sounds horrible.

junon · on Sept 14, 2023

This is the most impressive of this genre of SD audio so far, by a long shot. Really impressive!

smrtinsert · on Sept 13, 2023

As a hobby musician, why don't they start with instrument samples? Every sampler user out there would love a button press generate sample on the fly as a plugin. It would blow away gigs and gigs of ridiculous duplicative or near duplicate samples.

2Gkashmiri · on Sept 13, 2023

So.... Wait for llama for audio and train your own voice to having to call you friends by text and the software instead of actually saying the words? This is going to be nice for authentication, proving to a third party that you are yourself

systoll · on Sept 14, 2023

Just wait for iOS17 https://youtu.be/oMt02DNbQlk

jncfhnb · on Sept 13, 2023

The bluegrass one is super weird. I can’t identify exactly why.

zzbzq · on Sept 13, 2023

I can identify a bunch of things. The chord structure jumps all over randomly in a genre that usually does the opposite. The banjo is clearly not an actual banjo being strummed/frailed, but a weird agglomeration of bright toned instruments including both frailed/scruggs banjo and dobro, and maybe harmonica and fiddle creeping in. The AI doesn't know it's making a combination of instruments, so where it's trained on instruments blending, it thinks it can produce pre-blended sounds. I guess maybe this is more like a return to being a child hearing music for the first time with no preconceptions or expectations.

ewan251 · on Sept 13, 2023

I think the super weird part is that it's not great? I understand this is most likely very impressive technologically but musically it is disjointed, inconsistent and fake sounding. Most of the "music" examples have weird phrasing and confusing harmonic rhythm.

Kudos to stability.ai for achieving this as I am sure it took a lot of effort and this is a huge leap forward in terms of generation of audio by generative AI.

However as a musician (BMus and MMus at 2 different conservatoires) I think it's important to say that the job risk being experienced by creative writers will not be extending to musicians... yet.

jncfhnb · on Sept 13, 2023

I feel like music composition is a fundamentally hard task for AI. Music production seems like it should be a lot easier but I haven’t seen that

viraptor · on Sept 13, 2023

From what I've seen in the generated tracks so far (this one and others), they're pretty good locally, but just ignore the overall composition. For example any generated blues tracks will have the vague blues feel, but won't keep the 12 bar style. The bluegrass example here doesn't even seem to keep to 4/4 (or is extremely fluid about it...). Maybe one day someone will add a higher level "what's the current section, how far are you into it" inputs to that model to get something better - literally preparing the structure first and then filling it in. That should get much better results for context like "you're playing blues in A with quick change and generating bars 3-4, match the previous bars in style".

I mean, chatgpt knows how to plan this out https://chat.openai.com/share/976077c0-138b-4363-8065-3c8eed... Painting in that picture should be much easier than generating something freeflowing. Generating a good structure isn't that hard for most styles, because you can literally use the same pattern and do a few random changes that keep the key. (See lots of pop songs using the same 3/4 chord progression)

Jeff_Brown · on Sept 13, 2023

Yes, and surprisingly so. I never would have guessed we'd have AI stock photographs before AI muzak.

jnwatson · on Sept 13, 2023

The AI seems to understand 4/4 time but doesn't understand groupings of 4 measures into phrases. It definitely doesn't understand ABABACA or even the basic parts of a song.

It is the musical equivalent of a meandering paragraph.

Jeff_Brown · on Sept 13, 2023

Absolutely. And all AI for music I have seen suffers that problem.

It makes me wonder whether the music generation should be stratified -- a coarse model lays out where parts like verse and chorus are, what distinguishes them, how to transition, etc., and then a finer-grained model fills in the details.

benesing · on Sept 13, 2023

Also, the music is not bluegrass as much as it is old-time, a confusion that continually irritates old-time players.

smat · on Sept 13, 2023

You are right it feels off.

The position of the guitar in stereo is all over the place, higher frequency elements appear to come from the left while other parts are more centered.

stef25 · on Sept 13, 2023

Same for the death metal

stared · on Sept 13, 2023

I would love to use it for background music when I am working. I have specific tastes that depend on the task, mood, energy level, and ambiance.

k12sosse · on Sept 13, 2023

If you're not attuned to Cryo Chamber (label), check them out. Maybe not fitting all use-cases, but a strong and deep catalogue.

londons_explore · on Sept 14, 2023

Seems nobody has really cracked vocals with songs...

randcraw · on Sept 13, 2023

I wonder if it makes sense to generate a combo of instruments rather than individual voices and then combine those with an arranger DNN. I would think it'd be much easier to capture each instrument's transients and dynamics that way, much less allow more subtlety in how they combine, like allowing the lead voice to shift among instruments, or even let the listener choose how each voice expresses stylistically and how they should combine.

Trying to do all of that in a single DNN, much less parameterize it useably seems overly ambitious (or will be of more limited value ultimately).

stainablesteel · on Sept 13, 2023

i want something that can take in a song and transform it into a different genre

PcChip · on Sept 13, 2023

it's funny how they're all very impressive except the death metal