Hacker News new | past | comments | ask | show | jobs | submit login
Riffusion – Stable Diffusion fine-tuned to generate music (riffusion.com)
2421 points by MitPitt on Dec 15, 2022 | hide | past | favorite | 465 comments



Other author here! This got a posted a little earlier than we intended so we didn't have our GPUs scaled up yet. Please hang on and try throughout the day!

Meanwhile, please read our about page http://riffusion.com/about

It’s all open source and the code lives at https://github.com/hmartiro/riffusion-app --> if you have a GPU you can run it yourself

This has been our hobby project for the past few months. Seeing the incredible results of stable diffusion, we were curious if we could fine tune the model to output spectrograms and then convert to audio clips. The answer to that was a resounding yes, and we became addicted to generating music from text prompts. There are existing works for generating audio or MIDI from text, but none as simple or general as fine tuning the image-based model. Taking it a step further, we made an interactive experience for generating looping audio from text prompts in real time. To do this we built a web app where you type in prompts like a jukebox, and audio clips are generated on the fly. To make the audio loop and transition smoothly, we implemented a pipeline that does img2img conditioning combined with latent space interpolation.


Wow, I am blown away. Some of these clips are really good! I love the Arabic Gospel one. John and George would have loved this so much. And the fact that you can make things that sound good by going through visual space feels to me like the discovery of a Deep Truth, one that goes beyond even the Fourier transform because it somehow connects the aesthetics of the two domains.


I can simultaneously burst a bubble and provide fuel for more -- the alignment of the intrinsic manifolds of different domains has been an interesting research topic for zero shot research for a few years. I remember seeing at CVPR 2018 the first zero shot...classifier, I think? That if I recall correctly trained in two domains that were automatically basically aligned with each other enough to provide very good zero shot accuracy.

Calling it a Deep Truth might be a bit of an emotional marketing spin but the concept is very exciting nonetheless I believe.


It is a Deep Truth in that the universe is predictable and can be represented (at least the parts we interact with) mathematically. Matrix algebra is a hellova a drug. I could imagine someone developing the ability to listen to spectrograms by looking at them.


There is a whole piece in Godel Escher Bach where they look at vinyl records as alll the soud data is in there.


On deep truths: a review of the concept of harmony in the universe: https://www.sciencedirect.com/science/article/pii/S240587262...


I can't listen to them, but I can certainly point out different instruments, background noise sources and the like, and get an idea of the tone of a piece. This is easy. The hard part is distilling texture, timbre etc. of each sound.


Well it's no surprise that it kinda sorta works. Neural networks are very good at learning the underlying structure of things and working with suboptimally represented inputs. But if working with images of spectrograms works better than just samples in time domain, that is a valid and non-obvious finding.


My characterization of it as a Deep Truth might just be a reflection of my ignorance of the current state of the art in AI. But it's still pretty frickin' cool nonetheless.


Alright so this is a pretty amazing new development. I want to tell you something about what the state of the art is in AI. When you wrote that it is a deep truth it was before I actually listened to the pieces. I had just read the descriptions. At the time, I thought that you were probably right because I was thinking that music is only pleasing because of the structure of our brains it's not like vision where originally we are interpreting the world and that's where art comes from. Music is purely sort of abstract or artistic. However, after I listened to the pieces, I realised that they really sound exactly like the instruments that are making the physical noises. For example it really sounds exactly like a physical piano. So I don't know about a deep truth, but it does seem that there is a physical sense that the music represents which it can successfully mimic using this essentially image generating capability. One thing about all of these amazing AI development, is that I still make some long comments by dictating to Google. When it first got to the point that it was able to catch almost everything that I was saying I was absolutely blown away. However, it's really not that good at taking dictation, and I have to go back and replace each and every individual comma and period with the corresponding punctuation mark. Seeing such an amazing developments happening month after month year after year it makes me feel like we are really approaching what some people have called the singularity. When I read about a net positive fusion being announced my first instinct was to think oh of course it's now that that ChatGPT is available of course announcing a major fusion breakthrough would happen within days to weeks it just makes perfect sense that AI's can solve problems that have have confounded scientists for decades. To see just how far we still have to go take a look at how this comment read before I manually corrected it to what I had actually said.

-- [I copied and pasted the below to the above and then corrected it. Below is the original version. This is how I dictate to Google sometimes, on Android. Normally I would have further edited the above but in this case I wanted to show how far basic things like dictation still have to go. By the way I dictated in a completely quiet room. I can't wait for more advanced AI like ChatGPT to take my dictation.]

Alright so this is a pretty amazing our new development period I want to tell you something about out why the state of the heart is is in a i period when you wrote that it is a deep truth it was before I actually listen to The Pieces, I have just read the descriptions period at the time, I thought that you were probably right because I was thinking that music is only pleasing because of the structure of our brains it's not like vision where originally we are interpreting the world and that Where Art comes from music is purely so dove abstract or artistic period however, after I listen to the pieces, I realise that they really sound exactly like the instruments that are making the physical noises period for example it really sounds exactly like a physical piano period so I don't know about out a deep truth karma but it does seem that there is a physical sense that the music are represents which it can successfully mimic using this essentially image generating capability period one thing about all of these amazing AI development, is that I still make some long comments by dictating to Google. When it first got to the point that it was able to catch almost everything then was saying I was absolutely blown away period however, it's really not that good at taking dictation karma and I have to go back and replace each and every individual, and period with with the corresponding punctuation mark period seeing such an amazing developments happening month after month year after year ear makes me feel like we are really approaching what some people have called the singularity period when I read about out net positive fusion being announced my first Instinct was to think oh of course it's now that that chat GPT is available of course announcing a major fusion breakthrough would happen within in days to weeks it just makes perfect sense DJ eyes can solve problems that have have confounded scientists for decades period to see just how far we still have to go take a look at how this comment red before I manually corrected it to what I had actually set


As one of the meatsacks whose job you're about to kill... eh, I got nothin, it's damn impressive. It's gonna hit electronic music like a nuclear bomb, I'd wager.


As a listener, I think you're probably still safe. Can you use this to help you though? Maybe.

It's impressive what it produces, but I think it probably lacks substance in the same way the visual AI art stuff does. For the most part, it passes what I call the at-a-glanceness test. It's little better than apophenia (the same thing that makes you see shapes in clouds, faces in rocks, or think you've recognised a familiar word in a foreign language; the last one can happen more often though).

So, I think these tools will be used to do background work (ie for visuals maybe help with background tasks in CGI or faraway textures in games). I know less about audio, but I assume it could maybe help a DJ create a transition between two segments they want to combine, as opposed make the whole composition for them, but idk if that example makes sense.

Now, onto a more human point: I think that people often listen to music because it means something to them. Similar for people who appreciate visual art.

I also love interactive and light art, and I love talking to other artists at light festivals who make them because of the stories and journeys behind their art too. Humans and art are a package deal, IMO.

Edit: typos and to add: Also, I think prompt authorship is an art unto itself. I'm amazed what people can craft with it, but I'm more impressed by the craft itself than the outputs. Don't get me wrong, the outputs are darn cool, but not if you look closer. And it's impossible to look beneath the surface altogether, as there is nothing in the output but the pixels.


I think this type of generative stuff opens up entirely new possibilities. For the longest time I've wanted to host a rowing or treadmill competition, where contestants submit a music track. The tracks are mashed up with weighting based on who is in the lead and by how much.

I don't know of existing tech that can generate actual good mashups in realtime given arbitrary mp3s, but this has promise!


It's not too hard these days with open source BPM detection and stem separation libraries: https://github.com/deezer/spleeter


no, because is a function ("AI") that generates an image of a spectogram given text.

neither a set of MP3 nor a set of spectrograms from MP3s supplies the function arguments

or a connection to a path that uses that function


It says all StableDiffusion capabilities work, so you can prompt it with an image (either "img2img" or "textual inversion"). Their UI just doesn't expose it.


In general all this stuff is chopping the bottom off the market. AI art, code, writing, music, etc. can all generate passable "filler" content, which will decimate all human employment generating same.

I don't think this stuff is a threat to genuinely innovative, thoughtful, meaningful work, but that's the top of the market.

That being said the bottom of the market is how a lot of artists make their living, so this is going to deeply impact all forms of art as a profession. It might soon impact programming too because while GPT-type systems can't do advanced high level reasoning they will chop the bottom off the market and create a glut of employees that will drive wages down across the board.

Basic income or revolution. That's going to be our choice.


Basic income or revolution. That's going to be our choice.

Evolution.

We have such vast wealth and our historic methods for trying to make sure most people are taken care of are failing us. Those methods were rooted in the nuclear family with a head of household earning most of the money and jobs designed with an assumption that he had a full-time homemaker wife buying the groceries, cooking the meals etc so he could focus on his job.

We need jobs to evolve. In the US at least, we need to move away from tying all benefits (such as medical benefits and retirement) to a primary earner. We need to make it possible to live a comfortable life without a vehicle. We need to make it possible for small households to find small homes that make sense for them, both financially and in terms of lifestyle.

There is a lot we can do to make this not a disaster and make it possible for some people to survive on very little while while pursuing their bliss so that we stop this trend of pitting The Haves against The Have Nots and make the current Have Nots a group that has real hope of creating their own brilliant tech or such someday while not being utterly miserable if they aren't currently wealthy.


Those making the decisions can very well just say "WE and a 10-20% still needed just need to live comfortably, and the rest 80% can live in slums in the edge of town".


That sounds like the "revolution" option.


Sadly, if we look at human history, it usually resolves to that.


But not always. And education and communication are some of the forces that can help avoid that.

Knowledge is power.


education is controlled by government and communication is controlled by corporations


> Basic income or revolution. That's going to be our choice.

So many menial jobs are kind of like basic income anyway - you put in 2 hours of actual work to pad out the entire day at some shitty low end job, knowing all the time that your contribution isn't valued and that if your employer ever got their shit together your job wouldn't even be needed, and the robots are coming for it anyway. You get paid a small amount for doing nothing much useful.

The rich today are rich largely because they or their ancestors were plunderers. Perhaps they plundered the planet, exploiting the cheap energy that fossil fuels provide. Perhaps they plundered our social cohesion building skinner boxes that manipulate the minds of millions just to gain eyeballs and clicks.

Why should the bill for these past excesses fall on those who never benefited from them? In previous times, a young person of average intellect could get a job on a farm or factory and be a valued contributor. What happens when automation removes the last of these jobs - do we really expect people to put up with more and more menial and slavish existences?

Basic income is, like carbon taxes, an obvious solution. Maybe it will take off when a tipping point arrives - when the rich class decides that their repugnance to giving someone a "free ride" is overtaken by their need to have masses dulled and stupified, sitting at home with blinds drawn in front of their playstations, so they don't revolt at the obvious unfairness of the world.


Paying people to stay home and play on social media led to the mass explosion of conspiracy theories at the beginning of covid. Ultimately, unemployment and covid checks go a long way toward explaining the January 6th insurrection. Which is just to say that giving a free ride and narcotic forms of entertainment to the masses isn't necessarily the safety valve for "the rich" that it's made out to be.

Work gives people dignity. And idle hands are the devil's plaything. Put that together and UBI would be a disaster. Also, it's not "the rich class" who decides whether or not to bestow such a lifestyle on the masses... that in itself is a conspiratorial line of thought. Right down that road is the thought "hey, this UBI isn't enough!"

Jobs disappear. Other jobs replace them. Often, jobs are not fun, and often they feel meaningless, but working is still much more dignified than not working. Raising generations who've never worked and simply take their UBI and breed - what would even be the point of educating such people? Eventually they'd just be totally disposable and, no doubt, be disposed of.


Plunderers; well put. Capitalists cant lie their way to infinite growth forcasts and suck all the wealth into 401ks that do nothing but rob everyone elses grand children. Its a cycle that has been going on since existance itself, an ebb and flow that accelerates, crashes, and takes off again leaving its wake humanity as we know it.


Chopping the bottom off makes things higher up the ladder more accessible though. The original Zelda took six people multiple years to build, but one person could develop something similar but much better looking in a few weeks with Unity and AI generated assets. It obviously won't be a AAA title, but people have shown that they're happy to play slightly rough, retro games if they're fun. All this holds true for writing, music, art and other areas as well.

The big problem is that it's hard to filter through the huge amount of content being produced by humans to find things that you'll like, so we rely on kingmakers curating the culture. This means a few huge winners taking all and a lot of great creative work at the same level going unrewarded. If we can solve the content discovery problem in a more personalized and fair way and make it easier for people to support creators they like that would go a long way towards cushioning the job losses that AI will create.


> Basic income or revolution. That's going to be our choice.

Third option. Mandatory 4 day weeks.

Although I'd specify it as no one can work more than X hours a week.

And then adjust X down - or up in short timescales but likely down overall - as needed.

The competition is for "work". If AI is taking large chunks of "work" off the table. Spread the rest of it around.

Now notionally people will tell you that there is no finite "work" limit. You are effectively limiting competition.

To which I say - good. The rat race IS the competition. Don't we all want to slow it down a little? If F1 can put limits on a race, we should too for humanity.

Work smarter, not harder.


The only thing that affects whether you have a job is the Federal Reserve, not how good productivity tools are. You always have comparative advantage vs an AI, so you always have the qualifications for an entry level job.

There will never be a revolution and there's no such thing as late capitalism. Well, not if the Fed does their job.


I see a lot of AI naysayers neglecting the comparative advantage part.

If AI completely eliminates low skill art labour from the job pool, it's not like those affected by it are gonna disintegrate, riot, and restructure society. They have the choice of filling an art niche an AI can't or they can spend that time learning other, more in-demand skills. This also ignores that fact that some companies would rather reallocate you to more profitable projects even if your art skills don't change.

Selling a product with relative value like a painting or a sculpture will always be an uphill battle. Now that there's more competition from AI, it just gives artists/businesses incentive to find what people want that an AI can't deliver. Worst case scenario, employment rates in this sector are rough while the market recalibrates. Interested to see how these technologies develop.


That seems a bit like wishful thinking.

People don't have unlimited ability to learn new skills. Training takes time, and someone who spent several years honoring their craft won't be able to pick up a new skill overnight.

On top of that, people have preferences regarding their work – even if someone has the ability to do a different work, they might find it less meaningful and less satysfing.

Finally, don't ignore the speed at which AI capabilities improve. Compare GPT-1 with the current model, and how quickly we got here. Eventually we'll get to a point where humans just won't be able to catch up quickly enough.


Agree 100%. When I was young and idealistic I believed in the "learn new skills" mantra, but learning completely new skills would look a lot different at 50 than it did at 20. When career choices were being made 30 years ago it would have been hard to predict the current & upcoming AI-driven destruction of lower-end "thinking" jobs. Attempting to retrain after ~30 years puts you at a massive disadvantage vs a new graduate (I mentor some of our companies graduates & trainees & I've been assigned a guy in his mid-40s, after a few months I just don't see how he'll get to a point where he's adding value). Not really a personal whinge, as my skillset isn't under immediate threat from any AI I've heard about, though the rate of change in the field is something to behold.


I agree that intentional retraining doesn’t really work, but I don’t think it matters. As I said, all that matters for whether you have a job is the Federal Reserve. If you hire random people to do a computer job, some of them will just turn out to magically learn everything on the job.


I think specifically in the area of creative "products" such as art and music you have to think about the customer as well. I have zero interest in AI-created art or music. None. The value of art is its humanity; its expression of the artist's message, vision, and passion. AI doesn't have that, so it's not of any interest to me.

I don't know how many custoners feel the same way, but I won't be purchasing any AI art or music or knowingly giving it any of my attention.


The AI is a tool the human used to make it. Sometimes clumsily, but sometimes they write poems as text prompts and it's an illustration, or things like that. If an AI is making and selling art by itself, it's probably become sentient and not patronizing it would be speciesism.

Although personally, I think using "AI art" to create impossible photographs is more interesting and doesn't compete with illustrators as much.


I think it's an interesting perspective but I will be very surprised if it's one that is common when this becomes more of a real choice. If there's two mp3s and one of them is more enjoyable to listen to, very few people will stick with the song they enjoy less because it's not AI generated.

Maybe a parallel would be furniture; there are people who buy hand crafted furniture but it's kind of a luxury. Most people just have Ikea and wouldn't pay more for the same (or have less good furniture) just to get some artisanal dinner table chair.


How would you even know? Vast majority of art you don't purchase directly, but as a part of some product. At most you get a line in the credits and what's stopping anyone from inventing a pseudonym for AI.


> I have zero interest in AI-created art or music.

I'm afraid in near future we will all bombarded with AI-created music, art and text whether we want it or not.


The top of the market started at the bottom. Entry level is requiring higher and higher skills and capabilities.


> basic income or revolution

I’ve been trying to play through the scenario in my head. At least in terms of software developers being replaced by AI, I think we’re going to first see AI doing work in parallel or under monitor by humans. Basically, Google will take AI and send it off to do work that they lack the staff to do. Now, on the other hand, they could also temporally play it out where first they feign an inability to staff people due to finances so there are layoffs/terminations, and then maybe a quarter later they replace those people with low cost AI compute time that is orders of magnitude more productive.

In any case, AI disrupting people’s ability to feed, shelter, and clothe themselves is sure to trigger a pretty brutal and hostile response, which would be grounds for legislation and perhaps a class war.

The weird part is that if the potential of AI is truly orders of magnitude expansion beyond what we already have, then the longterm surely has room for a tiny little mankind fief. But, in order to get to the long term our hyper-competitive technocratic overlords may strangle out part of or all of the rest of us while justifying accelerating through the near-term window to achieve AI-dominance.


If the only people who can have meaningful good paying jobs are thoughtful geniuses we're in a lot of trouble as a society still.


> Basic income or revolution. That's going to be our choice.

I fear you are right. But neither of those is going to be an easy transition, if only because the effects of all this innovation is felt disproportionally by people in countries where such a revolution will not do anything to give solace.

Basic income assumes that the funds to do this are available and revolution assumes that the powers that be are the parties that are in the way of a more equitable division of the spoils. Neither of those are necessarily true for all locations affected.


>Basic income or revolution. That's going to be our choice

Basic income sounds good in theory in some imaginary futuristic society of harmony and grace.

In real life, it's a way for the masses be controlled down to your very substinence by the state. Where the state is basically an intermediary for big private interests and lobbies.


It gets better very quickly and we have no idea where its limitations are. In other words, we have no idea when the development will slow down significantly and how much of the bottom will it have chopped down by then. Whether it's 10% or maybe a 100.

> Basic income or revolution. That's going to be our choice.

I'm definitely pro basic income, but I've heard an interesting remark a few weeks ago. And that's that COVID was kind of a UBI experiment (in the US), albeit very limited, and it turned out that if people don't have to worry about making a living and don't have a job to work in then they'll start do stupid things on the internet. Like make up stupid conspiracy theories about vaccines. I can't remember who said this, it was one of the guests on Lex Fridman's podcast. I'm also not sure if it's a valid analogy but reminds me of Vonnegut's Player Piano.


As a musician and listener i'm inclined to agree. There were a couple cool examples i bumped into, but some prompts generate results that don't represent any single word or combination of words that were presented to the AI.

What this means for the future is maybe a little more unsettling however.


I fully agree with what you wrote. This AI-generated music, while a great achievement, still sounds soulless. It's one thing to look at AI-generated pictures for a few seconds, but listening to this music with its gibberish "lyrics" for minutes really creeps me out - it's the "uncanny valley" all over again, I guess.

Regarding "can you use this to help you through?" - yeah, you could probably use it as a source of inspiration, but at the risk of getting sued by someone whos music you didn't even know you were copying...


Yea, it's uncanney valley, sure. For now.

With Stable Diffusion and similar generative systems we have seen a leap in generative art/media, partially with significant improvements within a few months. What makes you think this was the last or only leap in the next 5 to 10 years? As if progress would just stop here? Huh?!

Do you think we hit a ceiling were progress is only tangential? A line which is impossible to cross? Otherwise I dont get this mindset in the face of these modern generative AI systems popping up left and right.


I think this is a good point. To make this useful for music creators, and to make music creation more generally accessible, the output needs to be more useful. We are working on that at https://neptunely.com


Potentially it could be used as temporary atmospherical music for pre-viz video shots


These are tools. Don't think of them as replacements, they aren't. But as tools that will help us be creative. As smart as these apps seem, they will still need a human to decide where and how to use them. They won't replace us but we need to adapt to a new reality.


I hear this a lot (in relation to various jobs) and I still don't get it. Yes, it is a tool. Yes, if it can, it will replace humans. That's the whole point.

For some reason people tend to think that these tools/AI/ML systems will never be good enough to do their job (or a specific job). This argument can take different forms, sometimes stating that it will just do the boring part of the work (e.g. with programming) or that it will still need human creativity (maybe, but not necessarily and that's not the point) or that it will just replace low level, unskilled or mediocre professionals. And somehow everyone thinks they are not mediocre (i.e. average). But even these assumptions are unfounded. Why would anyone think that these systems will top out below their skill levels? Why would anyone think that they can't become superhuman?

They did in chess, go, I think poker too. Not to mention protein folding. And without much of a hitch between mediocre/good enough and superhuman. Because that difference is just interesting for us, but doesn't necessarily mean that there is huge step, that the system needs to undergo serious development and that it would take a long time. (Like decades or so.) People thought that was the case when AlphaGo beat Fan Hui saying that Lee Sedol was a completely different level. Which, of course, he is. Still, it just took DeepMind half a year to improve alphago to that level.

So yeah, you can be pretty sure that if this track (no pun intended), if this solution is good enough then it will quickly evolve into something that will replace some music creators.


>As smart as these apps seem, they will still need a human to decide where and how to use them. They won't replace us

Well, they will, if the AI plus 1 human deciding "where and how to use them" can replace producers and musicians playing...


It already is a replacement. You can make a visual novel video game with AI generated character art, backgrounds, music, run your dialogs though AI if you can't write well yourself - and your game will have higher production level than 90% of competition. All those artist you would normally hire or commission the above stuff from are now out of the process if you want. Sure, it's not a particularly high bar, but it's only going to raise from here.


Kurt Vonnegut and Player Piano has a message for you.


These will be full replacements in no time, give or take 10 years.


10 years‽ StableDiffusion was released August 22nd of this year!


I love simple generative approaches to get ideas, and go from there. This seems like an extension of that (well, it's what I'm going to try - sample the output, make stems, pull MIDI etc). Will make the creative process more interesting for me, not less.

Having said that, it's not my job, and I can see where the issues lay there.


I can't think of a genre that would embrace it faster. The pay-for-knock-off rap beat market will feel more pressure from this kind of tool, especially as loop-oriented as it already is.


Why do you think this will kill your job? To me this looks like an extension of the hip-hop genre.


I am an active musician, but I don't actually make money at it, I was mostly joking, but: I believe that we are (one determined smart person + six months) away from bots on Youtube and other streaming platforms that generate endless "new" music that follows those simple formulas (beat, bass, sample = loop, several loops connected up = song) 24/7.

Raves that have no human DJs and never stop.

Good? Bad? Not for me to say, really, the most I make at it is a couple hundred bucks for two nights in a bar playing FM radio hits, and there's lots of people younger than me who like that music, so obviously I'm doing it for different reasons and I don't anticipate losing access to as many bar gigs as I want for the rest of my life.

But certain genres are very tolerant of low-effort music, and I think the people who are monetizing low-effort music are gonna lose their income streams. I do different things than those people, but I still consider them compatriots, even if I don't care for their art.


Isn't this just a sampler with extra steps?


All the AI music I’ve heard so far has a really unpleasant resonant quality to it. Why is that? Can it be removed?


I've done some work on AI audio synthesis and the artifacts you're hearing in these clips are coming from the algorithm that is used to go from the synthesized spectrogram to the audio (the Griffin-Lim algorithm).

Audio spectrograms have two components: the magnitude and the phase. Most of the information and structure is in the magnitude spectrogram so neural nets generally only synthesize that. If you were to look at a phase spectrogram it looks completely random and neural nets have a very, very difficult time learning how to generate good phases.

When you go from a spectrogram to audio you need both the magnitudes and phases, but if the neural net only generates the magnitudes you have a problem. This is where the Griffin-Lim algorithm comes in. It tries to find a set of phases that works with the magnitudes so that you can generate the audio. It generally works pretty well, but tends to produce that sort of resonant artifact that you're noticing, especially when the magnitude spectrogram is synthesized (and therefore doesn't necessarily have a consistent set of phases).

There are other ways of using neural nets to synthesize the audio directly (Wavenet being the earliest big success), but they tend to be much more expensive than Griffin-Lim. Raw audio data is hard for neural nets to work with because the context size is so large.


Phase is crtical for pitch. Here is why. The spectral transformation breaks up the signal into frequency bins. The frequency bins are not accurate enough to convey pitch properly. When a periodic signal is put through a FFT, it will land into a particular frequency bin. Say that the frequency of the signal is right in the middle of that bin. If you vary its pitch a little bit, it will still hand into the same bin. Knowing the amplitude of the bin doesn't give you the exact pitch. The phase information will not give it to you either. However, between successife FFT samples, the phase will rotate. The more off-center the frequency is, the more the phase rotates. If the signal is dead center, then each successive FFT frame will show the same phase. When it is off center, the waveform shifts relative to the window, and so the phase changes for every sample. From the rotating phase, you can determine the pitch of that signal with great accuracy.


Yes, this is exactly right and is why Griffin-Lim generated audio often has a sort of warbly quality. If you use a large FFT you can mitigate the issues with pitch because the frequency resolution in your spectrogram is higher, so the phase isn't so critical to getting the right pitch. But the trade-off of a bigger FFT is that the pitches now have to be stationary for longer.

The other place where phase is critical is in impulse sounds like drum beats. A short impulse is essentially just energy over a broad range of frequencies, but the phases have been chosen such that all the frequencies cancel each other out everywhere except for one short duration where they all add constructively. Without the right phases, these kinds of sounds get smeared out in time and sound sort of flat and muffled. The typing example on their demo page is actually a good example of this.


So what is phase? From dabbling with waveforms in audio editors, sampling, and later learning a little bit about complex numbers, phase seems eventually equivalent to what would sound like changing pitch, modulating the frequency of a periodic signal.

The simplest demonstration of it is the doppler shift. But it's not at all that simple because moving relative to the source the sound pressure and thus the perceived loudness also change, distorting the wave form, thereby introducing resonant frequencies. Now imagine that the transducer is always moving, eg. a plucked string.

The ideal harmonic pendulum swings periodically, only losing attenuation. But the resonant transducer picks up reflections of its own signal, like coupled pendulums, which are intractable according to the three body problem.

On top of that, our hearing is fine tuned to voices and qualities of noise.


Phase is the offset in time. The functions sin(θ) and sin(θ + c), for arbitrary real c, represent the same frequency signal; they are offset from each other horizontally by c, and that c is a phase difference. It has an interpretation as an angle, when the full cycle of the wave is regarded as degrees around a circle; and that's what I mean by rotating phase.

When you take a window of samples of a signal, and run the FFT on it, for every frequency bin, the calculation determines what is the amplitude and phase of the signal. If you have a frequency bin whose center is 200 Hz, and there is a 200 Hz signal, then what you get for that frequency bin is a complex number. The complex number's magnitude ("modulus") is the amplitude of that signal, and its angle ("argument"d) is the phase.

If the signal is exactly 200 Hz, and if the successive FFT windows move by a multiple of 1/200th of a second, then the phase will be the same in succcessive FFT windows.

But suppose that the signal is actually 201 Hz: a little faster. Then with each successive FFT window, the phase will not line up any more with the previous window; it will advance a little bit. We will see a rotating complex value: same modulus, but the angle advancing.

From how fast the angle advances relative to the time step between FFT windows, we can deduce that we are capturing a 201 Hz signal in that bin (on the hypothesis that we have a pure, periodic signal in there).

How is the phase determined in the frequency bin? It's basically a vector correlation: a dot product. The samples are a vector which is dot-producted with a complex unit vector. The complex unit vector in the 200 Hz bin is essentially a 200 Hz sine and cosine wave, rolled into a single vector with the help of complex numbers. Sine and cosine are 90 degrees apart in phase, so they form a rectilinear basis (coordinate system). The calculation projects the signal, expressing it as a sum of the sine and cosine vectors. How much of one versus the other is the phase. A signal that is 100% correlated with the sine will have a phase angle of 0 degrees or possibly 180. If it correlates with the cosine component, it will be 90 or 270. Or some mixture thereof.

Because a complex number is two real numbers rolled into one, it simplifies the calculation: instead of doing a dot product with a sine and cosine vector to separately correlate the signal to the two coordinate bases, the complex numbers do it in one dot product operation. When we go around the unit circle, each position on the circle is cos(θ) + isin(θ). These complex values values give us samples of both functions. Exactly such values are stuffed into the rows of the DFT matrix: complex values from the unit circle divided into equal divisions.

If you look here at the definition of the ω (omega) parameter:

https://en.wikipedia.org/wiki/DFT_matrix

It is the N-th complex root of unity. But what that really means is that it is a 1/Nth step of the way around the unit cicrcle. For instance if N happened to be 360, then ω is the complex number whose |ω| = 1 (unit vector), and whose modulus is 1 degree: one degree around the circle. The second row of the DFT matrix has 1, ω, ω², ω³, ... the second row represents the lowest frequency (after zero, which is the first row). It captures a single cycle of a sine and cosine waveform, in N samples. The values in that row step around the unit circle in the smallest increment, so they go around the circle exactly once. The subsequent rows go around the circle in skipped steps, yielding higher frequencies: 1, ω², ω⁴ for twice around the circle; 1, ω³, ω⁶ for three times, ... we get all the harmonics up to our N resolution.


> on the hypothesis that we have a pure, periodic signal in there

That pure sine wouldn't generate any artefacts. It would result in a 200Hz output from the AI if it throws the phase information out. You wouldn't hear a difference unless its an (aptly so called) complex signal. Eg. 200 and 201 Hz layered is an impure signal with a period below 1Hz, far outside the scope. Eventually the signals will cancel out completely. [1]

The important point is, I think, that FFT doesn't simply look at the offset aka phase. Rather, 201 Hz looks like a 200 Hz that is moving. So it encodes phase-shift in the delta of the offset between two windows. For a sum of 200 and 201 Hz it has to assume that the magnitude is also changing, which I find entirely counterintuitive.

From the mathematical perspective, this seems like a borring homework, far detached from accoustics. So, I don't know. The funny thing is that rotation is very real in the movement of strings. If the orbit in one point is elliptic, that's like two sinusoids at different magnitudes offset by some 90 degree, in a simplified model. But it has nearly infinite coupled points along its axis. As they exite each other, and each point has a different distance to the receiver, that's where phase shift happens.

> If you look here at the definition of the ω (omega) parameter

I wasn't going to make drone, but I will take a look.

1: https://graphtoy.com/?f1(x,t)=100*sin(x)&v1=true&f2(x,t)=100...


I wonder if this could be improved by using the Hartley transform instead of the Fourier transform.


Considering Stable Diffusion generates 3-channel (RGB) images, maybe it would be possible to train it on amplitude and phase data as two different channels?


People have tried that, but the model essentially learns to discard the phase channel because it is too hard for it to learn any useful information from it.


Got any citations... that sounds like a fascinating thing to read about.


We took a look at encoding phase, but it is very chaotic and looks like Gaussian noise. The lack of spatial patterns is very hard for the model to generate. I think there are tons of promising avenues to improve quality though.


Phase itself looks random, but what makes the sound blurry is that the phase doesn't line up like it should across frequencies at transients. Maybe something the model could grab hold of better is phase discontinuity (deviation from the expected phase based on the previous slices) or relative phase between peaks, encoded as colour?

But the same thing could be done as a post-processing step, finding points where the spectrum is changing fast and resetting the phases to make a sharper transient.


That makes a lot of sense, I would be keen to see attempts at that.


I'm curious why, instead of using magnitude and phase, you wouldn't use real and imaginary parts?


There have been some attempts at doing this, some of which have been moderately successful. But fundamentally you still have the problem that from the NN's perspective, it's relatively easy for it to learn the magnitude but very hard for it to learn the phase. So it'll guess rough sizes for the real and imaginary parts, but it'll have a hard time learning the correct ratio between the two.

Models which operate directly on the time domain have generally had a lot more success than models that operate on spectrograms. But because time-domain models essentially have to learn their own filterbank, they end up being larger and more expensive to train.


I wonder if there might be room for a hybrid approach, with a time-domain model taking machine-generated spectrograms as input and turning them into sound. (Just a thought, no idea whether it actually makes sense.)


would it be an approach to use separate color channels for the freq amplitude and freq phase in the same picture? Maybe the network then has a better way of learning the relationships and there would be no need for the postprocessing to generate a phase.


RAVE attacks the phase issue by using a second step of training. I don't completely understand it, but it uses a GAN architecture to make the outputs of a VAE sound better.


Griffin-Lim is slow and is almost certainly not being used.

A neural vocoder such as Hifi-Gan [1] can convert spectra to audio - not just for voices. Spectral inversion works well for any audio domain signal. It's faster and produces much higher quality results.

[1] https://github.com/jik876/hifi-gan


If you check their about page they do say they're using Griffin-Lim.

It's definitely a useful approach as an early stage in a project since Griffin-Lim is so easy to implement. But I agree that these days there are other techniques that are as fast or faster and produce higher quality audio. They're just a lot more complicated to run than Griffin-Lim.


Author here: Indeed we are using Griffin-Lim. Would be exciting to swap it out with something faster and better though. In the real-time app we are running the conversion from spectrogram to audio on the GPU as well because it is a nontrivial part of the time it takes to generate a new audio clip. Any speed up there is helpful.


I think this is because the generation is done in the frequency domain. Phase retrieval is based on heuristics and not perfect, so it leads to this "compressed audio" feel. I think it should be improvable


The link is down now, so I don't know about this one. But most generated music is generated in the note domain, rather than the audio domain. Any unpleasant resonance would introduced in the audio synthesis step. And audio synthesis from note data is a very solved problem for any kind of timbre you can conceive of, and some you can't.


You're probably talking about the artifacts of converting a low resolution spectrogram to audio.


Can the spectrogram image be AI upscaled before transforming back to the time domain?


Yes it exists: https://ccrma.stanford.edu/~juhan/super_spec.html

But the issue is not that the spectrogram is low quality.

The issue is that the spectrogram only contains the amplitude information. You also need phase information for generating audio from the spectogram


Interesting, can't you quantize and snap to a phase that makes sense to create the most musical resonance?


What happens if you run one of the spectrogram pictures through an upscaler for images like ESRGAN ?


It sounds kind of like the visual artifacts that are generated by resampling in two dimensions. Since the whole model is based on compressing image content, whatever it's doing DSP-wise is more-or-less "baked in", and a probable fix would lie in doing it in a less hacky way.


The first ever recordings had people shouting to get anything to register. They sounded like tin. Fast forward to today.

Looking back at image generation just a year or two ago and people would have said similar things.

Not hard to imagine the trajectory of synthesized audio taking a similar path.


Presumably for similar reasons that the vast majority of AI generated art and text is off-puttingly hideous or bland. For every stunning example that gets passed around the internet, thousands of others sucked. Generating art that is aesthetically pleasing to humans seems like the Mt. Everest of AI challenges to me.


I think your comment is off-topic to the post you are replyng to. That wasn't asking about the general aesthetic quality - more about a specific audio artifact.

> For every stunning example that gets passed around the internet, thousands of others sucked.

From personal experience this is simply untrue. I don't want to debate it because you seem to have strong feelings about the topic.


Even if you remove the artifact, the exact same comment applies. It generates a somewhat less interesting version of elevator music. This is not to crap on what they did. As I said, they underlying problem is extremely difficult and nobody has managed to solve it.

I don't feel strongly about this topic at all.


> It generates a somewhat less interesting version of elevator music.

This iteration does, but that's an artifact of how it's being generated: small spectograms that mutate without emotional direction (by which I mean we expect things like chord changes and intervals in melodies that we associate with emotional expressions - elevator music also stays in the neutral zone by design).

I expect with some further work, someone could add a layer on top of this that could translate emotional expressions into harmonic and melodic direction for the spectrogram generator. But maybe that would also require more training to get the spectrogram generator to reliably produce results that followed those directions?


The vast majority of human generated art is hideous or bland. Artists throw away bad ideas or sketches that didn’t work all the time. Plus you should see most of the stuff that gets pasted up on the walls at an average middle School.


Hard disagree. The average middle school picture will have certain aspects exaggerated giving you insights into the minds eye of the creator, how they see the world, what details they focus on. There is no such minds eye behind AI art so it's incredibly boring and mundane, no matter how good a filter you apply on top of it's fundamental lack of soul or anything interesting to observe in the picture beyond surface level. It's great for making art for assets for businesses to use, it's almost a perfect match, as they are looking to have no controversial soul to the assets they use, but lots of pretty bubblegum polish.


Perhaps most of the AI art out there (that honestly represents itself as such) is boring and mundane, but after many hours exploring latent space, I assure you that diffusion models can be wielded with creativity and vision.

Prompting is an art and a science in its own right, not to speak of all the ways these tools can be strung together.

In any case, everything is a remix.


I have to agree, the act of coming up with a prompt is one and the same with providing "insights into the minds eye of the creator, how they see the world, what details they focus on" - two people will describe the same scene with completely different prompts.


And the vast majority of professionally produced artwork is for business use. It’s packaging design or illustration or corporate graphics or logos or whatever.

I don’t get the objection.


> For every stunning example that gets passed around the internet, thousands of others sucked

…implying there may be an art to AI art. Hmm.

Meanwhile, the degree to which it is off-puttingly hideous in general can be seen in the popularity of Midjourney — which is to observe millions of folks (of perhaps dubious aesthetic taste) find the results quite pleasing.


Not sure about this. Models like Midjourney seem to put out very consistently good images.


I've compiled/run a dozen different image to sound programs and none of them produce an acceptable sound. This bit of your code alone would be a great application by itself.

It'd be really cool if you could implement an MS paint style spectrum painting or image upload into the web app for more "manual" sound generation.


Amazing work! Did you use CLIP or something like that to train genre + mel-spectrogram? What datasets did you use?


I was very surprised this was not mentioned.


/u/threevox on reddit made a colab for playing with the checkpoint:

https://colab.research.google.com/drive/1FhH3HlN8Ps_Pr9OR6Qc...


Hi Hayk, I see that the inference code and the final model are open source. I am not expecting it, but is the training code and the dataset you used for fine-tuning, and process to generate the dataset open source?


"fine-tuned on images of spectrograms paired with text"

How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.


The audio sounds a bit lossy, would it be possible to create high quality spectograms from music, downsample them, and use that as training data for a spectogram upscaler?

It might be the last step this AI needs to bring some extra clarity to the output.


This is amazing! This is a fantastic concept generator. The verisimilitude with specific composers and techniques is more than a little uncanny. A few thoughts after exploring today…

- My strongest suggestion is finding some strategy for smoothing over the sometimes harsh-sounding edge of the sample window - Perhaps it could be filling in/passing over segments of what is sounded to user as a larger loop? Both giving it a larger window to articulate things but maybe also showcasing the interpolation more clearly… - Tone control may seem challenging but I do wonder if you couldn’t “tune” the output of the model as a whole somehow (given the spectrogram format it could be a translation/scale knob potentially?)


When you say fine tuned do you mean fine tuned on an existing stable diffusion checkpoint? If so which?

It would be very interesting to see what the stable diffusion community that is using automatic1111 version would do with this if it were made into an extension.


Yes from https://huggingface.co/runwayml/stable-diffusion-v1-5. Our checkpoint works with automatic1111, and if you'd like to make an extension to decode to audio, it should be pretty straightforward: https://github.com/hmartiro/riffusion-inference/blob/main/ri...


Can you run this on any hardware already capable of running SD 1.5? I am downloading the model right now, might play with this later.

Guessing at the speed with which AI is developing these days someone is going to have the extension up in two hours at most.


I bet the AUTOMATIC1111 web UI music plugin drops within 48 hours.


I have made a basic version here:

https://github.com/enlyth/sd-webui-riffusion


Yes! Although to have real time playback with our defaults you need to be able to generate 40 steps at 512x512 in under 5 seconds.


Good to know. I was just so close with just under 7s using 40 steps and Euler a as sampler.


Super clever idea of course. But leaving aside how it was produced, I’ll be one of those who is underwhelmed by the musicality of this. I am judging this in terms of classical music. I repeatedly tried to get it to just play pure piano music without any other add-ons (cymbals etc). It kept mixing the piano with other stuff.

Also the key question is - would something like this ever produce something as hauntingly beautiful and unique as classical music pieces?


Hayk! How smart are you! I loved your work on SymForce and Skydio - totally wasn't expecting you to be co-author on this!

On a serious note, I'd really love some advice from you on time management and how you get so much done? I love Skydio and the problems you are solving, especially on the autonomy front, are HARD. You are the VP of Autonomy there and yet also managed to get this done! You are clearly doing something right. Teach us, senpai!


Hello - this is awesome work. Like other commenters, I think the idea that if you are able to transfer a concept into a visual domain (in this case via fft) it becomes viable to model with diffusion is super exciting but maybe an oversimplification. With that in mind, do you think this type of approach might work with panels of time series data?


Did you have a data set for training the relationship between words and the resulting sound?


Super! Makes sense since Skydio is also amazing.

How much data is used for fine tuning? Since spectrograms are (surely?) very out of distribution for the pre training dataset, how much does value does the pre training really bring?


To be honest, we're not sure how much value image pre training brings. We have not tried to train from scratch, but it would be interesting.

One thing that's very important though is the language pre-training. The model is able to do some amazing stuff with terms that do not appear in our data set at all. It does this by associating with related words that do appear in the dataset.


Hi, I really admire the skill you put at work on this project. At the same time, I think everyone is overlooking how crucial and problematic the training factor is.

Why was stable diffusion able to generate spectrograms? Because it was fed some. Presumably, those original spectrograms were scraped with little concern over creators' permissions, just like it has been for artists' work in order to produce art-looking image generation. Please, research what has been happening in the art community lately. https://www.youtube.com/watch?v=Nn_w3MnCyDY

A protest on ArtStation has been shown to influence Midjourney's results, proving that huge amounts of proprietary work are constantly scraped without the creators' permission. AIs like these work so well just because they steal and remix real artists' work in the first place. There are going to be legal wars about this.

Stable Diffusion doesn't have an official music generation Ai precisely because it couldn't train it with the same approach without being sued by music labels right away, while isolated artists don't have the same power.

So, back to my question: have you wondered whose work is Stable Diffusion remixing here? Your endeavour is great technically, but as we progress into the future we have to be more aware of the ethical implications that come with different forms of progress.

You could try to base your project on a collection of free-to-use spectograms, and see how it performs. If you do, I think it could actually be very interesting and useful to discuss the results here on Hacker News.

Cheers!


What I would really like to know - what happens if one trains that model from scratch (or is that not possible and training requirements are different? Sry for my ignorance, I never fine-tuned any diffusion model before)?

In my experience (CNN based imagery segmentation) proven architectures (e.g. U-Net) performed similar with or without fine-tuning existing models (that have been mostly trained on imagenet, citiscapes, etc.) IF the domain was rather different.

At least in the field of imagery segmentation there is not much of a point in fine-tuning an off-the-shelf model on let's say medical imagery.

So maybe it's the same for the stable diffusion model. I don't see how some knowledge about the relationship between the prompt and given imagery describing that prompt should help this model map the prompt to a spectrogram of the given prompt.


You can embed images in spectrograms.. might sound weird though


This is groundbreaking! All other attempts at AI generated music have IMO, fallen flat... These results are actually listenable, and enjoyable! This is almost frightening how powerful this can be


Obviously this needs a little more polish, but I've wanted this for so long I'm willing to pay for it now if it helps push the tech forward. Can I give you money?


What sort of setup do you need to be able to fine tune Stable Diffusion models? Are there good tutorials out there for fine tuning with cloud or non-cloud GPUs?


Reach out to the Beatstars CEO. He was looking for an AI play for his music producers marketplace. Probably solid B2B lead there.


Amazing work. Can this be applied to voice?

Example prompt: “deep radio host voice saying ‘hello there’”

Kind of like a more expressive TTS?


Author here: It can certainly be applied to voice, but the model would need deeper training to speak intelligibly. If you want to hear more singing, you can try a prompt like "female voice", and increase the denoising parameter in the settings of the app.

That said, our GPUs are still getting slammed today so you might face a delay in getting responses. Working on it!


Amazing work! Do you plan on open-sourcing the code to train the model?


The site isn't working for me? Anything I have to fix on my side to make it work?


Crashes repeatedly on iOS in Firefox (my usual browser), is OK on Safari though, so probably not a webkit thing.


This is super awesome.

Have you already explored doing the same with voice cloning?


How many songs did you use for the training data?


is classical music harder? noticed you didn't have any classical music tracks. i wonder if it is because it is more structured?


funny that Hayk is an early skydio guy!

2 amazing AI projects. Huge respect :)


This really is unreasonably effective. Spectrograms are a lot less forgiving of minor errors than a painting. Move a brush stroke up or down a few pixels, you probably won't notice. Move a spectral element up or down a bit and you have a completely different sound. I don't understand how this can possibly be precise enough to generate anything close to a cohesive output.

Absolutely blows my mind.


Author here: We were blown away too. This project started with a question in our minds about whether it was even possible for the stable diffusion model architecture to output something with the level of fidelity needed for the resulting audio to sound reasonable.


Any chance of spoken voice-work being possible? It would be interesting to see if a model could "speak" like James Earl Jones or Steve Blum.


James Earl Jones: https://fakeyou.com/tts/result/TR:9ek4x6eb80kq49e94grnhctk4g...

Steve Blum: https://fakeyou.com/tts/result/TR:xmjjq9ty0hnsyjrjnw806k6rnp...

Furiously working on voice-to-voice (web, real time, and singing!) Should be out the door tomorrow!


Excellent work! Singing would be amazing - karaoke can finally sound good :p

Have you released a tool for volumetric capture? I'm applying this to LED lighting fixture setup for tv/film/live shows and 3D positioning is the last step to fully automated configuration.

My goal is real-time sync between 3D model and real world.


Be careful not to choke on your aspirations :P


This already exists [1].

[1] https://www.respeecher.com/


Are there any open source models with good quality?

I had a look around several months ago, and it seems like everything is locked behind SaaS APIs.


have a look at UberDuck, they do something like this


Wasn't this Fraunhofer's big insight that led to the development of MP3? Human perception actually is pretty forgiving of perturbations in the Fourier domain.


You probably mean Karlheinz Brandenburg, the developer of MP3, who worked on psychoacoustics. Not completely off though, as he did the research at a Fraunhofer research institute, which takes its name from Joseph von Fraunhofer, the inventor of the spectroscope.


Does the institute not also claim that work?


Fair enough. But for me, when talking about `having an insight`, I don't imagine a non-human entity doing that. And to be pedantic (talking about Germans doing research, I hope everyone would expect me to be), the institute is called Fraunhofer IIS. `Fraunhofer` would colloquially refer to the society, which is an organization with 76 institutes total. Although, of course, the society will also claim the work...


It's an interesting question, one I hadn't thought of before. But in common language, it sometimes makes sense to credit the institution, others just the individuals. I think may be more based around how much the institution collectively presents itself as the author and speaks on behalf of the project versus the individuals involved. Here is my own general intuition for a few contrasting cases:

Random forests: Ko and Breiman's, not really Bell Labs and UC-Berkeley

Transistors: Bardeen, Brattain, and Shockley, not really Bell Labs (thank the Nobel Prize for that)

UNIX: Primarily Bell Labs, but also Ken Thompson and Dennis Richie (this is a hard one)

GPT-n: OpenAI, not really any individual, and I can't seem to even recall any named individual from memory


Bringing the right people together and having the right environment that gives rise to „having an insight“ can be a big part as well.


btw, it is public funded non-profit organisation


In very limited situations. You can move a frequency around (or drop it entirely) if it's being masked by a nearby loud frequency. Otherwise, you would be amazed at the sensitivity of pitch perception.


The easy example of this is playing a slightly out of tune guitar, or a mandolin where the strings in the course aren't matched in pitch perfectly. You can hear it, and it's just a few cents off.


You can also add another neural-network to "smooth" the spectrogram, increase the resolution and remove artefacts, just like they do for image generation.


Pretty sure that's how RAVE works


It's...not effective though. Am I listening to the wrong thing here? Everything I hear from the web app is jumbled nonsense.


I think the progression from church bells to electronic beats is especially good: https://www.riffusion.com/about/church_bells_to_electronic_b...


I think we're at the point, with these AI generative model thingies, where the practitioners are mesmerized by the mechatronic aspect like a clock maker who wants to recreate the world with gears, so they make a mechanized puppet or diorama and revel in their ingenuity.


And that's a bad thing?

How do you think human endeavours progress other than by small steps?


Look at GAN art from a few years ago, compared to MidJourney v4.


Really? They sound quite clearly like the prompt to me if I “squint my ears” a little


This is a genius idea. Using an already-existing and well-performing image model, and just encoding input/output as a spectrogram... It's elegant, it's obvious in retrospective, it's just pure genius.

I can't wait to hear some serious AI music-making a few years from now.


This idea is presented by Jeremy Howard on literally their first Deep Learning for Coders class (most recent edition). A student wanted to classify sounds, but only knew how to do vision, so they converted sounds to spectrograms, fine tuned the model on the labelled spectra, and the classification worked pretty well on test data. That of course does not take the merit away from the Riffusion authors though.


The idea of connecting CV to audio via spectrograms pre dates Jeremy Howard's course by quite a bit. That's not really the interesting part here though. The fact that a simple extension of an image generation pipeline produces such impressive results with generative audio is what is interesting. It really emphasizes how useful the idea of stable diffusion is.

edit: added a bit more to the thought


The idea to apply computer vision algorithms to spectrograms is not new. I don't know who first came up with it, but I first came across it about a decade ago.

I just ran a quick Google Scholar search, and the first result is https://ieeexplore.ieee.org/abstract/document/5672395

This is from 2010. I didn't go looking, but it wouldn't surprise me if the idea is older than that.


There were a number of systems designed for composers in the 90s (also continuing through to today) designed for the workflow of converting a sound to a spectrogram, doing visual processing on the image, and then re-synthesizing the sound from the altered spectrogram. Many were inspired by Xenakis' UPIC system which was designed around the second half of this workflow: you'd draw the spectrogram with a pen and then synthesize it.

https://en.wikipedia.org/wiki/UPIC

Edit: my favorite of all these systems was Chris Penrose's HyperUPIC which provided a lot of freedom in configuring how the analysis and synthesis steps worked.


Makes me wonder if we will see a generalization of this idea. Just like in a CPU 90%+ of want you want to do can be modeled with very few instructions (mov, add, jmp..) we could see a set of very refined models (Stable difussion, GPT, etc) and all of their abstractions on top (ChatGPT, Rifussion, etc).


Maybe next up is a model that generates Piet code

https://www.dangermouse.net/esoteric/piet.html


and you ask stable diffusion to generate piet code for a slightly better version of stable diffusion (or chatGPT) ...which then you can further use to generate a better version, and so on. Singularity here we come!


Perhaps GPT could run on top of Stable-diffusion, generating output in the form of written text (glyphs).


Indeed, I think this would be a cost-effective way to go forward.


For what is worth, people were trying the same thing with GANs (I also played with doing it with stylegan a bit) but the results weren't as good.

The amazing thing is that the current diffusion models are so good that the spectograms are actually reasonable enough despite the small room for error.


As someone who loves making music and loves listening to music made by other humans with intention, it just makes me sad.

Sure, AI can do lots of things well. But would you rather live in a world where humans get to do things they love (and are able to afford a comfortable life while doing so) or a world where machines do the things humans love and humans are relegated to the remaining tasks that machines happened to be poorly suited for?


As someone who loves making music and loves listening to music (regardless of its origins, in my case), it doesn't make me that sad. Sure, at first, I had an uncomfortable feeling that AI could make this sacred magic thing that only I and other fellow humans know how to do... But then I realized same thing is happening with visual art, so I applied the same counterarguments that've been cooking in my head.

I think that kind of attitude is defeatist - it's implying that humans will be stopped from making music if AI learns how to do it too. I don't think that will happen. Humans will continue making music, as they always have. When Kraftwerk started using computers to make music back in the 70s, people were also scared of what that will do to musicians. To be fair, live music has died out a bit (in a sense that there aren't that many god-on-earth-level rockstars), but it's still out there, people are performing, and others who want to listen can go and listen.

Maybe consumers will start consuming more and more AI music, instead of human music [0], but the worst thing that can happen is that music will no longer be a profitable activity. But then again, today's music industry already has some elements of the automation - washed-out rhythms, sexual thematics over and over again, re-hashing same old songs in different packages... So nothing's gonna change in the grand scheme of things.

[0] https://www.youtube.com/watch?v=S1jWdeRKvvk


> but the worst thing that can happen is that music will no longer be a profitable activity.

For me, the worst that could happen is that people spend so much time listening to AI generated music, that human musicians can no longer find audiences to connect to. It's not just about economics (though that's also huge). It's the psychological cost of all of us spending greater and greater fractions of our lives connected to machines and not other people.


Music was always about people. Even today, as most people listen to the mass-produced run-of-the-mill muzak, there is still a significant audience that seeks the "human element" for the sake of itself.

Black metal community, for example, has always rejected all forms of "automation" and considers it not kvlt - rawness is a sought-after quality, defined as having people performing as close to the recording equipment as possible.

There's also a rapper named Bones (Elmo O'Connor) who's never signed a contract with a label, does only music he wants to do, releases a couple albums every year. There's something about his approach that makes his music sound very organic and honest. I listen to him more than I listen to any mass produced rapper.

So in conclusion, music was always about people. Unless AI reaches AGI level, I don't think it will ever impact music enough to kill all audience.


I agree with this so much that I’d take off the comment about AGI: I think that even if/when AGI lands there will always be audiences that seek raw, organic, and honest artistic expression from other humans.


The vast majority of music produced is listened to by nobody, or a handful of people, so this is already the case really.


I play the piano (badly). There are many other people who can play much better than I. There are simple computer programs which can play better. It doesn't stop me from enjoying it or playing it. Computers have been beating people at Chess for years yet you still see people everywhere enjoying the game. At some point computers will be better than humans at absolutely everything but it shouldn't stop you as a human from enjoying anything.


Sure, but a large part of enjoyment in creativity for many is the joy of sharing it with an audience. To the degree that people are spending their available attention on AI-generated content, they have less time and attention available to spend listening to and watching art created by humans.


I would rather live in a world where humans get to do things they love because they can (and not because they have to earn their bread), and machines get to do basically everything that needs to be done but no human is willing to do it.

Advancing AI capabilities in no way detracts from this. You talk about humans being "relegated to the remaining tasks" - but that's a consequence of our socioeconomic system, not of our technology.


> but that's a consequence of our socioeconomic system, not of our technology.

Those two are profoundly intertwined. Our tech affects our socioeconomic systems and vice versa.


Sure, so now that we have new tech, let's update the socioeconomic system to accommodate it.


If only it was that easy.


It's not, but at least it's feasible. Trying to suppress technology instead is futile in the long term.


Musicians already make much (most?) of their money via gigs and I don't think going to watch an AI play at a gig will be all too common. I think we'll be fine. Might have to adapt though.


I'd rather live in the world where humans do things that are actually unique and interesting, and aren't essentially being artificially propped up by limiting competition.

I don't see this as a threat to human ingenuity in the slightest.


There are still chess tournaments for humans, even though our smartphone could play chess better than any grandmaster.


I'm super excited about the Audio AI space, as it seems permanently a few years behind image stuff - so I think we're going to see a lot more of this.

If you're interested, the idea of applying Image processing techniques to Spectrograms of audio is explored in brief in the first lesson of one of the most recommended AI courses on HN: Practical Deep Learning for Coders https://youtu.be/8SF_h3xF3cE?t=1632


> I can't wait to hear some serious AI music-making a few years from now.

I think this will be particularly useful for musical compositions in movies and film, where the producer can "instruct" the AI about what to play, when, and how to transition so that the music matches the scene progression.


Not only that but sampling. I'd say there's at least one sample from something in most modern music. This can essentially create "sounds" that you're looking for as an artist. I need a sort of high pitched drone here... Rather than dig through sample libraries you just generate a few dozen results from a diffusion model with some varying inputs and you'd have a small sample set on the exact thing you're looking for. There's already so much processing of samples after the fact, the actual quality or resolution of the sample is inconsequential. In a lot of music, you're just going after the texture and tonality and timbre of something... This can be seen in some Hans Zimmer videos of how he slows down certain sounds massively to arrive at new sounds... or in granular synthesis... This is going to open up a lot of cool new doors.


I was thinking gaming where music can and should dynamically shift based on different environmental and player conditions.


I suspect that if you had tried this with previous image models the results would have been terrible. This only works since image models are so good now.


You already hear a ton of them. Lofi music on these massively popular channels are basically auto-generated "music" + auto generated artwork.


I dabble in music production and know some of the people in the "Lofi" world, so I know for a fact that this is not true. It's just a formulaic sub-genre where people are trying to make similar instrumentals with the same vibe. It would be jarring to listen to a playlist while studying and each song had wildly different tempos, instruments, etc.

Also, the music doesn't sound "Lofi" because it's generated by algorithms. A lot of hard work and software goes into taking a clean, pitch-perfect digital signal and making it sound like something playing on a record player from the 70s.


Do you have any sources for more information about this?


Some of this is really cool! The 20 step interpolations are very special, because they're concepts that are distinct and novel.

It absolutely sucks at cymbals, though. Everything sounds like realaudio :) composition's lacking, too. It's loop-y.

Set this up to make AI dubtechno or trip-hop. It likes bass and indistinctness and hypnotic repetitiveness. Might also be good at weird atonal stuff, because it doesn't inherently have any notion of what a key or mode is?

As a human musician and producer I'm super interested in the kinds of clarity and sonority we used to get out of classic albums (which the industry has kinda drifted away from for decades) so the way for this to take over for ME would involve a hell of a lot more resolution of the FFT imagery, especially in the highs, plus some way to also do another AI-ification of what different parts of the song exist (like a further layer but it controls abrupt switches of prompt)

It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It's realaudio grade, it needs to be more like 128kbps mp3 grade.


> composition's lacking, too. It's loop-y.

Well no wonder, it has absolutely no concept of composition beyond a single 5s loop, if I understand correctly.

> It absolutely sucks at cymbals, though. Everything sounds like realaudio :)

> It could probably do bad modern production fairly well even now :) exaggeration, but not much, when stuff is really overproduced it starts to get way more indistinct, and this can do indistinct. It's realaudio grade, it needs to be more like 128kbps mp3 grade.

I haven't sat down yet to calculate it, but is the output of SD at 512*512px at 24bit enough to generate audio CD quality in theory?


No.

And I suspect this will always have phase smearing, because it's not doing any kind of source separation or individual synthesis. It's effectively a form of frequency domain data compression, so it's always going to be lossy.

It's more like a sophisticated timbral morph, done on a complete short loop instead of an individual line.

It would sound better with a much higher data density. CD quality would be 220500 samples for each five second loop. Realtime FFTs with that resolution aren't practical on the current generation of hardware, but they could be done in non-realtime. But there will always be the issue of timbres being distorted because outside of a certain level of familiarity and expectation our brains start hearing gargly disconnected overtones instead of coherent sound objects.

What this is not doing is extracting or understanding musical semantics and reassembling them in interesting ways. The harmonies in some of these clips are pretty weird and dissonant, and not what you'd get from a human writing accessible music. This matters because outside of TikTok music isn't about 5s loops, and longer structures aren't so amenable to this kind of approach.

This won't be a problem for some applications, but it's a long way short of the musical equivalent of a MidJourney image.

Generally we're a lot more tolerant of visual "bugs" than musical ones.


I think an approach like this could generate interesting sounds we as humans would never think of. Or meshing two sounds in ways we could barely imagine or implement.

But of course something like this, which only thinks in 5s clips can not generate a larger structure, like even a simple song. Maybe another algorithm could seed the notes and an algorithm like this generates the sounds via img2img.


>and not what you'd get from a human writing accessible music

The timbral qualities of the posted samples remind me of some of the stuff I heard from Aphex Twin, like Alberto Balsalm. Not accessible by a long shot but definitely human


This is huge.

This show me that Stable Diffusion can create anything with the following conditions:

1. Can be represented as as static item on two dimensions (their weaving together notwithstanding, it is still piece-by-piece statically built)

2. Acceptable with a certain amount of lossiness on the encoding/decoding

3. Can be presented through a medium that at some point in creation is digitally encoded somewhere.

This presents a lot of very interesting changes for the near term. ID.me and similar security approaches are basically dead. Chain of custody proof will become more and more important.

Can stable diffusion work across more than two dimensions?


Now I'm wondering about feeding Stable Diffusion 2D landscape data with heightmaps and letting generate maps for RTS videogames. I mean, the only wrinkle there is an extra channel or two.


Any image generator can do well in any two dimensional data, including SD, Dalle, Midjourney.

One feature of SD not discussed much in my opinion, is the deterministic key it provides to the user. This is what enables the smooth transition in every second of music it generates, and the next second in time. Moving the cursor of latent space between in a minimal way, creating the next piece of information and change it ever so slightly, it definitely sounds good to the human ears.


Being able to blend between prompts and attention weightings smoothly from a fixed seed is definitely a fantastic and underexplored avenue; it makes me recall "vector synthesis" common in wavetable synthesizers since the '80s as discussed here[0]. I feel we are just a couple of months from seeing people start using MIDI controllers to explore these kinds of spaces. Something could be hacked together today, but it will be interesting to see once the images can be generated in nearly realtime as the controls are adjusted.

[0] https://www.soundonsound.com/techniques/synth-school-part-7


>Being able to blend between prompts and attention weightings smoothly from a fixed seed is definitely a fantastic and underexplored avenue;

Agree totally. Before SD was created, i thought that it is impossible to replicate a prompt more than once. Deterministic/fixed seed is a big innovation of SD, and how well it works in practice, it is simply amazing.

From the article: >Those who tried this method, however, soon found that, without analogue filters to run through the harmonic content of waveforms, picking out and exaggerating their differing compositions, most hand‑drawn waveforms sounded rather ordinary and often bland, despite the revolutionary way in which they were created.

Yes, the technique which the people of riffusion created, displayed to everyone, and shared it as well, it is the holy grail of electronic music synthesis. I would imagine it is has some way to go before it is applied to electonic music effectively, integration with some tools, practice of musicians on the new tool, some fine tunning etc.


I would argue that its high-fidelity representations of 3d space, imply that the model's weights are capable of pattern-matching in multiple dimensions, provided the input is embedded into 2d space appropriately.


Can you expand on what you mean with the identity/security services?


Something unlikely to be affected: OIDC, PGP, etc. as these require signals that have full fidelity to authorize access.

Something likely to be affected: anything using biometrics as a password instead of a name.


I think there has to be a better way to make long songs...

For example, you could take half the previous spectrogram, shift it to the left, and then use the inpainting algorithm to make the next bit... Do that repeatedly, while smoothly adjusting the prompt, and I think you'd get pretty good results.

And you could improve on this even more by having a non-linear time scale in the spectrograms. Have 75% of the image be linear, but the remaining 25% represent an exponentially downsampled version of history. That way, the model has access to what was happening seconds, minutes, and hours ago (although less detail for longer time periods ago).


Perhaps you could do a hierarchical approach somehow, first generating a "zoomed out" structure, then copying parts of it into an otherwise unspecified picture to fill in the details.

But perhaps plain stable diffusion wouldn't work - you might need different neural networks trained on each "zoom level" because the structure would vary: music generally isn't like fractals and doesn't have exact self-similarity.


You seem smart. How do I follow you?


Authors here: Fun to wake up to this surprise! We are rushing to add GPUs so you can all experience the app in real-time. Will update asap


Awesome, there is another project out there that does it with CPU https://github.com/marcoppasini/musika maybe mix the both, ie take initial output of musika, convert to spectrogram and feed it to riffusion to get more variation...


"fine-tuned on images of spectrograms paired with text"

How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.


Fascinating stuff.

One of the samples had vocals. Could the approach be used to create solely vocals?

Could it be used for speech? If so, could the speech be directed or would it be random?


I bet a cool riff on this would be to simply sample an ambient microphone in the workplace and use that the generate and slowly introduce matching background music that fits the current tenor of the environment. Done slowly and subtly enough I'd bet the listener may not even be entirely aware its happening.

If we could measure certain kinds of productivity it might even be useful as a way to "extend" certain highly productive ambient environments a la "music for coding".


>in the workplace

Or at a house party, club or restaurant... as more people arrive or leave and the energy level rises or declines..or human rhythms speed up or slow down...so does the music...


DJs are getting automated away too!


Or perhaps use it in a hospital to play music that matches the state of a patient’s health as they are passing away.


I would not want to go to the hospital for a mild ear infection and hear the AI start blasting death metal.


Reactive generative music would so cool


Producing images of spectrograms is a genius idea. Great implementation!

A couple of ideas that come to mind:

- I wonder if you could separate the audio tracks of each instrument, generate separately, and then combine them. This could give more control over the generation. Alignment might be tough, though.

- If you could at least separate vocals and instrumentals, you could train a separate model for vocals (LLM for text, then text to speech, maybe). The current implementation doesn't seem to handle vocals as well as TTS models.


I think you'd have to start with separate spectrograms per instrument, then blend the complete track in "post" at the end.


This opens up ideas. One thing people have tried to do with stable diffusion is create animations. Of course, they all come out pretty janky and gross, you can't get the animation smooth.

But what if what if a model was trained not on single images, but animated sequential frames, in sets, laid out on a single visual plane. So a panel might show a short sequence of a disney princess expressing a particular emotion as 16 individual frames collected as a single image. One might then be able to generate a clean animated sequence of a previously unimagined disney princess expressing any emotion the model has been trained on. Of course, with big enough models one could (if they can get it working) produced text prompted animations across a wide variety of subjects and styles.


We are back to sprite maps


That's an interesting idea. I wonder if this would work with inpainting - erase the 16th cell and let the AI fill it in. Then upscale each frame. Has anyone experimented with this?



Well Look at that. I'm totally not surprised, lol.


a different approach: analog television


The vocals in these tracks are so interesting. They sound like vocals, with the right tone, phonemes. and structure for the different styles and languages but no meaning.

Reminds me of the soundtrack to Nier Automata which did a similar thing: https://youtu.be/8jpJM6nc6fE


That's glossolalia, and it's not that uncommon in human-created art.


I think AI would be great at generating similar things. Might be very nice for generating fake languages, too.


Another related audio diffusion model (but without text prompting) here: https://github.com/teticio/audio-diffusion


oh wow this one works really well


Earlier this year, graphic designers, last month it was software engineers, and now musicians are also feeling the effects.

Who else will AI make looking for a new job?


Musicians were made to get a day job long before you were born ;)


Although I do wonder how much an earlier technology, audio reproduction, contributed to that. My grandmother worked for a time as a piano player as part of a nightclub orchestra. It was a stable job back then. I have to wonder how many musician jobs were killed off by the jukebox and related technologies.


Honestly none of them should. I think the moral panic around these things is way overstated. They are cool but hardly about to replace anyone's job.


Have you tried AI asset generators? They are working extremely good. Just yesterday a friend of mine has shown me the progress they made in their game. It is incredible. Designers are 100% loosing their job over this.


I'm a professional game developer and excited AI enthusiast.

While I've seen a lot of cool stuff which helps generation for hobby projects or smaller indie games it's nowhere near the quality and consistency needed to come close to the work of a skilled human artist at a larger studio.


Yes, and I studied Game Development in Germany for 3 years in Düsseldorf, got third at the national gameforge newcomer award with my team "Northlight Games" and still have many connections to the people in the business (if this somehow matters). The quality for 2d assets is on very professional level and already replaced jobs in projects I know.

To give you an example join the public discord of https://www.scenario.gg and check out results. Come back and tell me those aren't on a professional level.

I am not saying that designers won't be needed anymore but AI is definitely able to replace jobs and speed up progress in game development.


If I was a musician, this post would not make me worry for a second


If a hack based on an image generator already has promising results for music generation, then imagine what will happen if something dedicated to music is built from the ground up.


Politicians, bureaucracy.

GPT-3, what policy should we apply to increase tax revenue by 5% given these constraints?

GPT-3, please tell me some populist thing to say to win the next election, or how should I deflect these corruption charges.


"We should place a tax on all copyright lawyers and use it to fund GPU manufacturing and AI development. At your next stump speech, mention how the entertainment industry is stealing jobs from construction workers. Your corruption charges won't matter because voters only care about corruption when it's not in their favor."


Isn’t this the plot of Deus Ex?


The raw outputs of these tools will be best consumed by experts. Until general AI, these are just better tools for the same workers.


They were killed off by the ability to record the data. Every city used to have their own music stars :)


This was the first AI thing to fill me with a feeling of existential dread.


What is with the hyperbole in this thread? This stuff sounds like incoherent noise. It is noticeably worse than AI audio stuff I heard 5 years ago. What is going on with the responses here?


I assume the stuff from 5 years ago was essentially spitting out a midi output which would be fed in to a traditional tool to play samples. So it's going to sound a lot sharper while being a lot less sophisticated. The real breakthrough here is this is generating everything from scratch and it still resembles the prompt.

One of the automated prompts was "Eminem anger rap", I'm confident if you had showed me the audio without the prompt I could identify which artist it sounded like.

And this is just a basic first attempt at reusing a tool not even designed for audio. I can only imagine how powerful it could be after some trivial revisions like using GPT-3 to generate coherent lyrics.


I feel exactly the opposite way, but I suppose everyone has a different ear and taste. I think a good 3-4% of what this produces sounds damn amazing and beautiful. I've been vibing to it a lot. Fantastic stuff! There is also the feeling of shock and awe like with ChatGPT where you give it a prompt about a niche thing you think it will definitely not understand and it turns out it understands it shockingly well. As an example I just gave it a prompt "Avril 4th" and the result literally gave me chills.


Usage of an image generator to produce passable music fragments, even if they sound a bit distorted, is very surprising. That type of novelty is why we come here.


People did the same with GANs years ago with similar odd results. I do think the kinks will eventually be ironed out but i don’t think this is it.


This looks great and the idea is amazing. I tried with the prompt: "speed metal" and "speed metal with guitar riffs" and got some smooth rock-balad type music. I guess there was no heavy metal in the learning samples haha.

Great work!


Gregorian death metal folk also seems to have lacked seed tunes but the thing is just in its infancy so soon we'll be banging our tonsured heads to the folky beats of ...

...OK, need to create a band name generator to work in tandem with this thing. Let's see what one of its brethren in ML makes of it...

- "Echoes of the Past": This name plays on the idea of Gregorian chanting, which is often associated with the distant past, and combines it with the intense and aggressive sound of death metal.

- "The Order of the Black Chant": This name incorporates elements of both the religious connotations of Gregorian chanting and the dark, heavy sound of death metal, creating a sense of mystery and danger.

- "Foretold in Blood": This name evokes both the ancient, mystical nature of Gregorian chanting and the violent themes of death metal, creating a sense of ancient prophecy coming to pass.

- "Crypt of the Silent Choir": This name brings together the eerie, otherworldly sound of Gregorian chanting with the underground, underground feel of death metal, creating a sense of hidden secrets and forbidden knowledge.

"The Order of the Black Chant" it shall be.


Somewhat unrelated, but given the descriptions, perhaps you should take a listen to the Darktide 40k soundtrack if you haven't: https://www.youtube.com/watch?v=D4hEOMSjzdo


Yes, that sounds like something which goes in the right direction. Now just drop the modern stuff - the synth bass beat doesn't really fit the description - and have some more Gregorian growl and Lucifer's your uncle.


It does seem to lack a lot in heavy music in general. My first attempt was to get it to generate something akin to AC/DC and it got fairly close but it still seemed a bit too clean and pop-like. Then I tried to get something closer to nu metal or deathcore and it just kept generating some smooth upbeat jazz. Which is not that bad in its own right, but not at all what I asked for.

Edit: After a bit of playing around I at least got some credible results with "electric guitar solo, glam rock". It also understands grunge.


Yeah... I just wanted some metal and there's no way to get anything gruff or agressive out of this at all :-(


yeah, I also couldn't get it to do any folk metal. We shall have to wait for metal AI a short while yet haha


Fun! I tried something similar with DCGAN when it first came out, but that didn't exactly make nice noises. The conversion to and from Mel spectrograms was lossy (to put it mildly), and DCGAN, while impressive in its day, is nothing like the stuff we have today.

Interesting that it gets so good results with just fine tuning the regular SD model. I assume most of the images it's trained on are useless for learning how to generate Mel spectrograms from text, so a model trained from scratch could potentially do even better.

There's still the issue of reconstructing sound from the spectrograms. I bet it's responsible for the somewhat tinny sound we get from this otherwise very cool demo.


Interesting. I experimented a bit with the approach of using diffusion on whole audio files, but I ultimately discarded it in favor of generating various elements of music separately. I'm happy with the results of my project of composing melodies (https://www.youtube.com/playlist?list=PLoCzMRqh5SkFPG0-RIAR8...) and I still think this is the way to go and but that was before Stable Diffusion came out. These are interesting results though, maybe it can lead to something more.


It may be clearer to those of you who are smarter than me, but I guess I've only recently begun to appreciate what these experiments show--that AI graphical art, literature, music and the like will not succeed in lowering the barriers to humans making things via machines but in training humans to respond to art that is generated by machines. Art will not be challenging but designed by the algorithm to get us to like it. Since such art can be generated for essentially no cost, it will follow a simple popularity model, and will soon suck like your Netflix feed.


> Since such art can be generated for essentially no cost, it will follow a simple popularity model, and will soon suck like your Netflix feed.

I'm not so sure. Considering how successful AI-driven social media feeds are, which already include substantial AI-generated content, why would a feed consisting entirely of such content be any less successful? The quality will only keep increasing.

> Art will not be challenging but designed by the algorithm to get us to like it.

I don't think these advancements are a threat to art created by humans, just as any art created by humans isn't a threat to other art. It's just... more art.

Eventually, AI will be capable of being truly creative, instead of being trained on human art and producing permutations of it, which will also be wonderful.

The role of humans will be to train these models to produce art we find enjoyable. Imagine if your AI media feed was an infinite stream of artworks personalized just for your taste. It will be TikTok on steroids. I can't say I'm thrilled by that prospect, because it will also be used for exploiting users, but the entertainment potential is huge.


Right now, the AI is trained on mostly human generated art. It will be interesting to see what happens when the training set itself is mostly AI generated. The role of the "artist" in the future will not be to create art directly, but to influence art by manipulating the training set. For instance an artist who can flood the Internet with a trillion captioned images will be able to spawn an art movement, even if it short lived.


I’d been wondering (naively) if we’d reached the point where we can’t see any new kinds of music now that electronic synthesis allows us to make any possible sound. Changes in musical styles throughout history tend to have been brought about by people embracing new instruments or technology.

This is the most exciting thing I’ve seen in ages as it shows we may be on the verge of the next wave of new technology in music that will allow all sorts of weird and wonderful new styles to emerge. I can’t wait to see what these tools can do in the hands of artists as they become more mainstream.


'make any possible sound' is less important than 'make x sound easily' by way of tools and accumulated knowledge. Also what's audiences are receptive to matters a lot - you could have made noise rock in the 40s but I can't imagine it would have sold a lot of records.


The interpolation from keyboard typing to jazz is incredible. This is what AI art should be.


Really cool. Can't get this to work on the homepage though.

Might be a traffic thing?

Edit: Works now. A bit laggy but it works. Brilliant!


I also don't hear anything, even when my prompt was selected...


Me neither, perhaps the web app is a bit buggy?


I'm getting this back when I try to hear cats sing me a rock opera:

{"data":{"success":true,"worklet_output":{"error":"Model version 5qekv1q is not healthy"},"latency_ms":530}}


Same here, servers are overloaded probably. Shame, I was really looking forward to a Wu Tang Clan and Jamiroquai collab


Same earlier, but I can now get it to work very intermittently, with the error "Uh oh! Servers are behind, scaling up..."



Nice reference! I had never seen that site before, but those albums had a significant impact on my musical journey.

There was a purple victorian house in Colorado Springs where the living room was converted into a record and cd store called Life By Design. I picked up these albums and a ton of other obscure music there. I was so happy to not have to drive all the way up to Wax Trax in Denver to be able to discover new artists.


Wow those examples are shockingly good. It's funny that the lyrics are garbled analogously to text in stable diffusion images.

The audio quality is surprisingly good, but does sound like it's being played through an above-average quality phone line. I bet you could tack on an audio-upres model afterwards. Could train it by turning music into comparable-resolution spectrograms.


How comes that the stable diffuse model helps here? Does the fact that it knows what an astronaut on a horse looks like have effect on the audio? Would starting the training from an empty model work too?


I think they only mentioned the horse to illustrate to people that they are using the same tool as what was used to generate those types of images. It's painting a picture for the uninitiated audience. From what I understood, this model is trained on spectrograms instead of horses and the like, resulting in this product.


I'm floored, the typing to jazz demo is WILD! Please keep pushing this space, you've got something real special here.


The results of this are similar to my nitpicks of AI generated images (well, duh!). There's definitely something recognizable there, but somethings just not quite right about it.

I'm quite impressed that there was enough training data within SD to know what a spectrograph looks like for the different sounds.


I just wanted to say you guys did an amazing job packaging this up. I managed to get a local instance up and running against my local GPU in less than 10 minutes.


This is so good that I wondered if it's fake. Really impressive results from generated spectrographs! Also really interesting that it's not exactly trained on the audio files themselves - wonder if the usual copyright-based objections wild even apply here.


regarding those usual objections, i'd argue that a spectrograph representation of a given piece of audio is just a different (lossy) encoding of the same content/information, so any hypothetical objections would still apply here.


You would be absolutely correct. the lossiness is in the resolution of the image (512x512 is pretty terrible) but given enough image resolution it's just an FFT transform, and the only reason that stuff falls short is because people don't give it, in turn, enough resolution. If you did wild overkill of the resolution of an FFT transform you could do anything you wanted with no loss of tone quality. If you turned that to visual images and did diffusion with it you could do AI diffusion at convincing audio quality.

In theory the tone quality is not an objection here. When it sounds bad it's because it's 512x512, because the FFT resolution isn't up to the task, etc. People cling to very inadequate audio standards for digital processing, but you don't have to.


Why not? Music copyright was not even about audio recordings originally.


@haykmartiros, @seth_, thank you for open sourcing this!

Played a bit with the very impressive demos, now waiting in queue for my very own riff to get generate.

Great as this is, I'm imagining what it could do for song crossfades (actual mixing instead of plain crossfade even with beat matching).


Does anyone have any good guides/tutorials for how to fine-tune Stable Diffusion? I'm not talking about textual inversion or dreambooth.


Plug this into a video game and you could have GTA 6 where the NPCs have full dialogue with the appearance of sentience, concerts where imaginary bands play their imaginary catalogue live to you and all kinds of other dynamically generated content.


Is there a different mapping of FFT information to a two dimensional image that would make harmonic relationships more obvious?

IE, use a polar coordinate system where angle 12 oclock is 440hz, and the 12 chromatic notes would be mapped to the angle of the hours. Maybe red pixel intensity is bit mapped to octave, IE first, third and eight octave: 0b10100001.

Time would be represented by radius. Unfortunately the space wouldn't wrap nicely like if there was a native image format for representing donuts.


You need the mapping to be only to 1 dimension, since you need the 2nd dimension for time


This is genuinely amazing. Like with any AI there are areas it's better at than others. I hope it doesn't go unnoticed just because people try typing "death metal" and are not happy with the results. This one seems to excel at 70-130BPM lo-fi vaporwave/washed out electronic beats. Think Ninja Tune or the modern lo-fi beats to study to. Some of this stuff genuinely sounds like something I'd encounter on Bandcamp or Soundcloud.

I think I'm beginning to crack the code with this one, here's my attempt at something like a "DJ set" with this. My goal was to have the smoothest transitions possible and direct the vibe just like I would doing a regular DJ set. https://www.youtube.com/watch?v=BUBaHhDxkIc

I wonder if this could be the future of DJing or perhaps beginning of a new trend in live music making kind of like Sonic Pi. Instead of mixing tracks together, the DJ comes up with prompts on the spot and tweaks AI parameters to achieve the desired direction. Pretty awesome.


Really fascinating. I'd be interested to know more about how it was trained, with what data exactly.


Congratulations this is an amazing application of technology and truly innovative. This could be leveraged by a wide range of applications that I hope you'll capitalize on.


Incredible stuff, Seth & Hayk. I've been thinking nonstop about new things to build using Stable Diffusion and this is the first one that really took my breath away.


This is amazing! Would it be possible to use it to modify this interpolate between two existing songs (i.e. generate spectrograms from audio and transition between them)?


This is so completely wild. Love the novelty and inventiveness.

Could anyone help me understand whether using SVG instead of bitmap image would be possible? I realize that probably wouldn't be taking advantage of the current diffusion part of Stable-Diffusion, but my intuition is maybe it would be less noisy or offer a cleaner/more compressible path to parsing transitions in the latent space.

Great idea? Off base entirely? Would love some insight either way :D


A musician friend of mine told me that this is (I freely translate) a perversion, building in frequency and returning time. Don't shoot the messenger.

Personally I like the results. I'm totally untrained and couldn't hear any of the issues many comments are pointing out.

I guess that all of lounge/elevator music and probably most ad jingles will be automated soon, if automation cost less than human authors.


"Horse Carriage Driver says horseless carriages are abominations. More at 12!"


adamsmith143, commenter. Witness here the apathy of someone unaffected by the suffering of millions and uncommitted to the soul of humanity. DIAF.


Sorry my sympathy for "starving" artists isn't sufficient for your liking. 95% of modern art is drivel that can and should be replaced by AI art. A recent walk through the MoMa in NY had me weeping for humanity. Filled with, in some cases literal, trash that any 5th grader could produce.


Very impressive. I am quite confident that next years number one Christmas hit will start like "church bells to electronic beats".


Rearrange that trip through latent space a little, jumping back and forth through different stages of the interpolation in a pattern resembling those customary chorus/verse things and you've got a hit. And you could reuse the exact same rearrange recipe for just about any interpolation between prompt pairs.

Plenty of times this has been called before, and it's certainly possible that this wont be the last time this is called, but allow me to declare this the end of the bedroom producer (stage performers remain unaffected). And no two elevators will ever sound the same again, on any day.


Xmd5a is already a real track.

https://www.youtube.com/watch?v=crcqADcAusg


I find it really cool that the "uncanny valley" that's audible on nearly every sample is exactly as I would imagine that the visual artifacts would sound that crop up in most generated art. Not really surprising I guess, but still cool that there's such a direct correlation between completely different mediums!


yeah, it's pretty unsurprising, that they're both uncanny valley like a messy circus.


This is one of the most ingenious thing I've seen in my life


Things similar to the “interpolation” part (not the generative part) are already used extensively especially for game and movie sound design. Kyma [1] is the absolute leader (it requires expensive hardware though). IMO later iterations on this approach may lead to similar or better results.

FYI, other apps that use more classic but still complex Spectral/Granular algos :

https://www.thecargocult.nz/products/envy

https://transformizer.com/products/

https://www.zynaptiq.com/morph/

[1] https://kyma.symbolicsound.com/


If copyright laws don't catch up, the sampling industry is cooked.

Made this: https://soundcloud.com/obnmusic/ai-sampling-riffusion-waves-...


assuming the first sample was generated by OP method, how did you clean that sample up so nicely?


Yep, the first sample was generated via OP's method. The tl;dr answer is that I used a few plugins in my digital audio workstation to make it sound better.

I made a video specifically in reply to you if you want to see exactly how I did it (3 mins): https://www.youtube.com/watch?v=69Q-cseNCI4


Thank you so much!


great stuff, while it comes with the usual smeary iFFT artifacts that AI-generated sound tends to have the results are surprisingly good. i especially love the nonsense vocals it generates in the last example, which remind me of what singing along to foreign songs felt like in my childhood.


I was thinking about this - what if someone trained a stable diffusion type model on all of the worlds commercial music? This model would probably produce quite amazing music given enough prompting and I'm wondering if the music industry would be allowed to claim copyright on works created with such a model. Would it be illegal or is this just like a musician picking up ideas from hearing the world of music? Is it really right to make learning a crime, even if machines are doing it? I'm conflicted after finding out that for sync licensing the music industry want a percentage of revenue based on your subscription fees, sometimes as high as 15%-20%! I'm surprised such a huge fee isn't considered some kind of protection racket.


This question has been explored before, see Kologorov Music: https://www.youtube.com/watch?v=Qg3XOfioapI


https://soundcloud.com/toastednz/stablediffusiontoddedwards?...

40 sec clip of uk garage/todd edwards style track made with riffusion -> serato studio with todd beats added


>> Prompt - When providing prompts, get creative! Try your favorite artists, instruments like saxophone or violin, modifiers like arabic or jamaican, genres like jazz or rock, sounds like church bells or rain, or any combination. Many words that are not present in the training data still work because the text encoder can associate words with similar semantics. The closer a prompt is in spirit to the seed image And BPM, the better the results. For example, a prompt for a genre that is much faster BPM than the seed image will result in poor, generic audio.

(1) Is there a corpse the keywords were collected from?

(2) Is it possible to model the proximity of the image to keywords and sets of keywords?



I can’t help but see parallels to synesthesia. It’s amazing how capable these models are at encoding arbitrary domain knowledge as long as you can represent it visually w/ reasonable noise margins.


They’ve got a looooong way to go man


I agree but it's better than listening to Ed Sheeran

Edit: To be honest, I find something like 'Band In A Box' to be more impressive and actually useful, I don't understand how I would ever use this or listen to this. To me, it's further proof that Stable Diffusion really just doesn't work that well


Doesn't have enough training data for odd time signatures music... or John Cages' 4'33 :D


As a musician, I'll start worrying once an AI can write at the level of sophistication of a Bill Withers song:

https://www.youtube.com/watch?v=nUXgJkkgWCg

Not simply SOUND like a Bill Withers song, but to have the same depth of meaning and feeling.

At that point, even if we lose we win because we'll all be drowning in amazing music. Then we'll have a different class of problem to contend with.


Absolutely incredible - from idea to implementation to output.


Wow, diffusion could be a game changer for audio restoration.


This is awesome! It would be interesting to generate individual stems for each instrument, or even MIDI notes to be rendered by a DAW and VST plugins. It's unfortunate that most musicians don't release the source files for their songs so it's hard to get enough training data. There's a lot of MIDI files out there but they don't usually have information about effects, EQ, mastering, etc.


Let's get in touch. This is precisely what we are working on at Neptunely. https://neptunely.com


Pretty nice, I was just talking to a friend about needing a music version of chatgpt, so thank you for this.

Wondering if it would be possible to create a version of this that you can point at a person SoundsCloud and have it emulate their style / create more music in the style of the original artist. I have a couple albums worth of downtempo electronic music I would love to point something like this at and see what it comes up with.


We are working on something similar at Neptunely. https://neptunely.com


https://mubert.com/ might be what you're looking for.


Thank you, I will check it out!


Could this concept be inverted, to identify music from a text prompt, as in i want a particular vibe and it can go tell me what music fits that description. Ive always thought the ability to find music you like was very lacking, it should not be bounded by a genre instead its usually based on rythmic and melodic structures that appeal to you, regardless of what type of music it might be.


What did use as training data?


I wonder how they got their train data..! The spectrogram trick is genius, but not much useful without high quality, diverse data to train on


Really great! I've been using diffusion as well to create sample libraries. My angle is to train models strictly on chord progression annotated data as opposed to the human descriptions so they can be integrated into a DAW plugin. Check it out: https://signalsandsorcery.org/


You can train/finetuned a Stable Diffusion model on an arbitrary aspect ratio/resolution and then the model starts creating coherent images, would be cool to try finetuning/training this model on entire songs by extending the time dimension (also the attention layer at the usual 64x64 resolution should be removed or it would eat too much memory)


Amazing project! Here is a demo including an input for negative prompt. It's impressive how it works. You can try:

prompt: relaxing jazz melody bass music negative_prompt: piano music

https://huggingface.co/spaces/radames/spectrogram-to-music


I was confused because I must not have read good that the working webapp is at https://www.riffusion.com/. Go to https://www.riffusion.com/ and press the play button to see it in action!


I found this awesome podcast that goes into several AI & music related topics https://open.spotify.com/show/2wwpj4AacVoL4hmxdsNLIo?si=IAaJ...

They even talk specifically about about applying stable diffusion and spectrograms.


Awesome work.

Would you be willing to share details about the fine-tuning procedure, such as the initialization, learning rate schedule, batch size, etc.? I'd love to learn more.

Background: I've been playing around with generating image sequences from sliding windows of audio. The idea roughly works, but the model training gets stuck due to the difficulty of the task.


"https://en.wikipedia.org/wiki/Spectrogram - can we already do sound via image? probably soon if not already"

Me in the Stable Diffusion discord, 10/24/2022

The ppl saying this was a genius idea should go check out my other ideas


If only we had a diffusional model that could take your ideas and turn them into reality!


No I want ppl


I wonder if subliminal messaging will somehow make a comeback once we have ai generated audio and video. Something like we type "Fun topic" and those controlling the servers will inject "and praise to our empire/government/tyrant" to the suggestion or something like that.


If such unreasonably good music can be created based on information encoded in an image, I'm wondering what there things we can do with this flow:

1) Write text to describe the problem 2) Generate an image Y that encodes that information 3) Parse that image Y to do X

Example: Y = blueprint, X = Constructing a building with that blueprint


If it can do music, can we train better models for different kinds of music? Or different models for different instruments makes more sense? For different instruments we can get better resolution by making the spectrogram represent different frequency ranges. This is terribly exciting, what a time to be alive.


Multiple folks have asked here and in other forums but I'm going to reiterate, what data set of paired music-captions was this trained on? It seems strange to put up a splashy demo and repo with model checkpoints but not explain where the model came from... is there something fishy going on?


Today's music generation is putting my Pop Ballad Generator to shame: http://jsat.io/blog/2015/03/26/pop-ballad-generator/


I feel like the next step here is to get a GPT3 like model to parse everything ever written about every piece of music which is on the internet (and in every pdf on libgen and scihub) and link them to spectrograms of that music

and then things are going to get wild

I am so blessed to live in this era :)


I know it sounds like I am going to be sarcastic, but I mean all of this in earnest and with good intention. Everything this generates is somehow worse than the thing it generated before it. Like the uncanny valley of audio had never been traversed in such high fidelity. Great work!


Anyone interested in joining an unofficial Riffusion Discord, let's organize here: https://discord.gg/DevkvXMJaa

Would be nice to have a channel where people can share Riffs they come up with.


Anyone know how I could try and use this with Elixir livebook with https://github.com/elixir-nx/bumblebee

I'm new but this is something that would get me going.


Damn this is insane. I wonder what other things can be encoded as images and generated with SD?


I'm a short fiction writer. Do you think I could get one of these new models to write a good story?

I'd want to train it to include foreshadowing, suspense, relatable characters and perhaps a twist ending that cleverely references the beginning.


I’ve thought about this and tested it a lot, it doesn’t work.

The reason is, the models don’t understand content or even context, they just recognize patterns and can generate similar patterns which we then interpret.

Case in point, a photo of an Astronaut on a horse is not actually an astronaut on a horse, it’s a 2D pixel map of light that our eyes then interpret to mean an astronaut on a horse. It’s even easier to understand this listening to the generated singing. It sounds just like singing - but it’s not language at all and doesn’t mean anything, it’s just a very similar pattern.

These AI models are great at making things that look like a pattern that to us means something, but they don’t generate semantically meaningful content itself.

So when you do this with GPT3 or other models and long-form narrative it falls apart pretty fast as the model can’t keep straight things like characters and their internal personalities and motives, nor the overall plot arc.

But! You can definitely feed in prompts and ask questions to get ideas and boilerplate - descriptions of people and places come out especially well - and then you can edit that and use it to accelerate your writing process.


they can only do about 150 words at a time, so you'd struggle to get anything longform out of it. You can keep asking it over and over but then its liable to forget previous information. You'd do better with a prompt that repeatedly reminds the AI the style its going for and some basic information about the characters, but it's story would likely be quite cliche in many ways. It's something i've experimented with trying to get it to make DND adventure books


I think it is possible. Have you tried interacting with chatGPT on short story?


I read the article:

"If you have a GPU powerful enough to generate stable diffusion results in under five seconds, you can run the experience locally using our test flask server."

Curious what sort of GPU the author was using or what some of the min requirements might be?


RTX 3070 can generate SD results in under 5 seconds, depending. Euler A 20 samples, 512x512. it can almost do 4 images in 5 seconds with those settings.

It's possible a 3060 might work, depending. in my experience the 3060 is about 50% slower than the 3070, but that may be a bad 3060 in our test rig. but a 3060 gets pretty close to 5 seconds for an image, so try it, if you have one.

just tested prompt "a test pattern for television" on both cards and 3070 took 1.87s and the 3060 took 2.93s. Similar results for the prompt "an intricate cityscape, like new york"

edit: i should note we're using SD 1.4, not 1.5, although i think that just has to do with the checkpoint of the model, not the algorithm, but i could be wrong.

Also the model is over 14GB, so perhaps the 3070 can't do it after all. i'll test it later as soon as the local admin wakes up and downloads it onto our machine.


Author here: fwiw we are running the app on a10g GPUs, which generally can turn around a 512x512 in 3.5s with 50 inference steps. This time includes converting the image into audio which should be done on the GPU as well for real-time purposes. We did some optimization such as a traced unet, fp16 and removing autocast. There are lots of ways it could be sped up further I'm sure!


> https://www.riffusion.com/?&prompt=punk+rock+in+11/8

Tried getting something in an odd timing, but still is 4/4.


Very cool, but the music still has a very "rough", almost abrasive tinge to it. My hunch, is that it has to do with the phase estimates being off.

Who's going to be first to take this approach and use it to generate human speech instead?


This is amazing, and scary (as a musician) but also reliably kills firefox on iOS!


This is just insane. Sooooo incredible. Don't really realize how far things have come until it hits a domain you're extremely familiar with. Spent 8-9 in music production and the transition stuff blew me away.


"Uh oh! Servers are behind, scaling up..." - havent' been able to get past that yet. Anyone getting new output?

This is already better than most techno. I can see DJs using this, typing away.


Do you guys think AI creative tools will completely subsume the possibility space of human made music? Or does it open up a new dimension of possibilities orthogonally to it? Hard for me to imagine how AI would be able to create something as unique and human as D'Angelo's Voodoo (esp. before he existed) but maybe it could (eventually).

If I understand these AI algorithms at a high level, they're essentially finding patterns in things that already exist and replicate it w some variation quite well. But a good song is perfect/precise in each moment in time. Maybe we'll only be ever be able to get asymptotically closer but never _quite_ there to something as perfectly crafted a human could make? Maybe there will always be a frontier space only humans can explore?


> Hard for me to imagine how AI would be able to create something as unique and human as D'Angelo's Voodoo (esp. before he existed)

There’s always that immortal randomly typing monkey with a typewriter thing [1]. And, in our case, it seems to be better than random.

So, yes, perhaps. But perhaps we could instead build and create things that are yet unimaginable upon it. We’ll see.

[1]: https://en.wikipedia.org/wiki/Infinite_monkey_theorem


I dont think its possible to make a compelling song in quite the same way. At Neptunely, we are pursuing a route that keeps the human in the picture. More of a collaboration. https://neptunely.com


Can anyone confirm/deny my theory that AI audio generation has been lagging behind progress in image generation because it’s way easier to get a billion labeled images than a billion labeled audio clips?


Sound is a lot higher fidelity, it's harder to make the information available to a computer without serious downsampling or simplification.

Consider sounds over 12khz. On a spectrogram during a chorus or drop that area is lit up, with so many things changing from millisecond to millisecond. A lot of AI samples really struggle at high frequencies, or even forgo them entirely.

Midi based approaches have been really great though, and an approach like in the OP is fascinating (and impressive).


As someone who works in the audio AI space I think an underappreciated reason for audio lagging behind is that it's a lot harder to put impressive audio clips in an academic paper than impressive images.


Also check the similar work on arxiv:

Multi-instrument Music Synthesis with Spectrogram Diffusion:

https://arxiv.org/abs/2206.05408


I propose that while you are GPU limited, you make these changes:

* Don't do the alpha fade to the next prompt - just jump straight to alpha=1.0.

* Pause the playback if the server hasn't responded in time, rather than looping.


Was just watching an interview of Billy Corgan (smashing pumpkins) on Rick Beato’s YouTube[1] last night where billy was lamenting the inevitable future where the “psychopaths” in the music biz will use ai and auto tune to churn out three chord non-music mumble rap for the youth of tomorrow, or something to that effect. It was funny because it’s the sad truth. It’s already here but new tech will allow them to cut costs even more, and increase their margins. No need for musicians. Really cool on one hand, in the same way fentanyl is cool — or the cotton gin, but a bit depressing on the other, if you care about musicians. I and a few others will always pay to go the symphony, so good players will find a way get paid, but this is what kids will listen to, because of the profit margin alone.

[1] https://m.youtube.com/watch?v=nAfkxHcqWKI


> new tech will allow them to cut costs even more, and increase their margins

How, when everyone and their dog can generate such music? It's gonna be like stock photography in the age of SD.


They do marketing not music.


Very cool! I was wondering why there wasnt any music diffusion apps out there, it seems more useful because music has stricter copyright and all content creators need some background music ...


This is a brilliant idea.

Also, spectrographs will never generate plausible high quality audio. (I think)

So I think the next move is to map the generate audio back over to synthesizer and samples via midi …


This happened earlier than I expected, and using a much different technique than I expected.

Bracing myself for when major record labels enter the copyright brawl that diffusion technology is sparking.


I wonder if this would be applicable to video game music. Be able to make stuff that's less repetitive but also smoothly transitions to specific things with in-game events.


Coming at this from a layman's perspective, would it be possible to generate a different sort of spectrogram that's designed for SD to iterate upon even more easily?


I got an actual `HTTP 402: PAYMENT_REQUIRED` response (never seen one of those in the wild, according to Mozilla it is experimental). Someone's credit card stopped scaling?


LOL. Yes we had to upgrade our Vercel tier: https://twitter.com/sethforsgren/status/1603425188401467392


I’m curious about the limitations of using spectrograms and transient-heavy sounds like drums.

It seems like you’d need very high resolution spectrograms to get a consistently snappy drum sound.


8GB is enough to do 1080p resolution. the UI i use for SD maxes out at 2048x2048. however, it takes a lot longer than 512x512 to generate: 1m40s versus 1.97s.

I'm guessing if one had access to one of those nvidia backplane rackmount devices one could generate 8k or larger resolution images.


SD can’t generate coherent images if you increase the output size. They’re basically always unusable unless you don’t need any global architecture to them.


I'm not sure what you mean. with the 1.4 checkpoint i can make a scene "a sky full of blimps" and then img2img that into a space battle where all the blimps in the sky are either fires or spacecraft afterburners. at 2k pixels, and then use the built in "GAN" stuff to bring it to 4k, 5k, or 8k.

It isn't perfect and it takes a lot of fiddling and sysadmin stuff, but it is way better than nextchar = CHR$(RAND64); color = (rand65535); of yore.


Sounds a bit "clowny" to me, for lack of a better word.


I prefer to make my spectrograms by hand. https://youtu.be/HT0HH_fc4ZU


This is what I've been talking about all year. It is such a relief to see it actually happen.

In summary: The search for AGI is dead. Intelligence was here and more general than we realized this whole time. Humans are not special as far as intelligence goes. Just look how often people predict that an AI cannot X or Y or Z. And then when an AI does one of those things they say, "well it cannot A or B or C".

What is next: This trend is going to accelerate as people realize that AI's power isn't in replacing human tasks with AI agents, but letting the AI operate in latent spaces and domains that we never even thought about trying.


Generated contents without filtering/validation are worthless.

I predict some kind of testing, validation or ranking be developed to filter out generated contents. Each domain has its own rules - you need to implement validation for code and math, fact checks for text, contrasting the results from multiple solutions for problem solving, and aesthetic scoring for art.

But validation is probably harder than learning to generate in the first place, probably a situation similar to closing the last percent in self driving.


Anyone out there that speaks Arabic, can you let us know if the Arabic Gospel clip contains words? To a speaker do the sounds even sound Arabic?


This is really cool but can someone tell me why we are automating art? Who asked for this? The future seems depressing when I look at all this AI generated art.


Because tons of people want to make art, and a lot of art currently requires years of training to make anything close to "good". Making art more accessible to create is a boon to everyone who's dreamed of being able to make their own paintings and music, but doesn't have the skills required.


That just means there is going to be a whole lot more bad art in the world


Not all of this art will be meant to be shared with the whole world though. A lot of it will be people just using it because they enjoy it.


Everyone asked for this, including artists. If you make a living off of making art, having the best tools to help you do that is a constant, and the tools are finally starting to get properly good. Will "the job" change because of the tools? Of course. Will the nature of what it means for something to be art change? Also of course. Art isn't some static, untouchable thing. It changes as humanity does.


Actually I agree with you, but HN is not really a place where you will find artists defending themselves. However you will find alot of people defending the automation of art. Generative art has it's place. But ultimately until humans are extinct, human generated art is the only thing which really represents the species. Everything else is an advanced form of puppetry or mimicry.


I would say it's not "generated," but interpolated...

It doesn't make anything new or fresh. It doesn't pull any real-life emotions or experiences into a synthesis that a person can relate to. It's more like asking a teenaged comedian to imitate numerous impressions of music styles. e.g. in Clerks when the Russian guy does "metal": https://youtu.be/7gFoHkkCaRE?t=55

Of course the modern conception of music in the West is as an accompaniment to other, mostly drudging, activities, as opposed to something to be paid singularly attention. Therefore, there are many "valuable"(*) occasions to produce "impressions" of music. E.g. in advertisements and social media flexes where identity and attitude are the purpose of music. For these, a shallow interpretation or reflection of loosely amalgamated sound clips will suffice. But we don't just attend concerts or focus sustained energy on sonic impressions. We listen to lyrics and give over our consciousness to composed works because we want to find secrets others give away in dealing with this crazy thing called life- ideas to succeed, admissions of failure, and what the expected emotional arcs of these trajectories looks like. This lofty goal is to date not within the scope of AI stunts.

As Solzheinetysn said, "Too much art is like candy and not bread."


Because art is the low hanging fruit of "close enough" applications.


I wonder if this is true for music. Our ears are much more discerning than our eyes when it comes to art it seems.


I mean listening to samples on the link above I'd hardly call it music so I'd say you're right.


You can't automate a live performance or an oil painting with AI in this way. This isn't going to replace musicians and artists. If anything, I think a preponderance of AI art would make people appreciate the real stuff more.

As to why, music is fun to create, and this is just a tool.


> You can't automate a live performance or an oil painting with AI in this way.

You'd have to combine it with these guys https://www.youtube.com/watch?v=WqE9zIp0Muk


> Who asked for this?

I did.

> The future seems depressing when I look at all this AI generated art.

You should talk about your concerns with an AI psychotherapist.


We're not automating art, we're creating tools that make it easier for humans to create art. These are nothing more than new and exciting tools. The cream will still rise to the top, same as it ever was.


I will try it but at least for the name it deserves praise.


So this is slightly bending my mind again. Somehow image generators were more comprehensible compared to getting coherent music out. This is incredible.


Such a creative application of SD to spectrograms!

...now do stock charts


Very interesting idea :) Unfortunately it breaks when I enter Tarantella or Taranta. Need more training samples from south of Italy :)


This, Chat-GPT and the AI image generation. We're now at a very interesting time where average joes get to start using incredible tools.


Right now it still seems to lack the horsepower for this many users. Hope it gets in a better state soon, but I am bookmarking this right now!


Seems to be victim of its own success:

- No new result available, looping previous clip

- Uh oh! Servers are behind, scaling up

I hope Vercel people can give you some free credits to scale it up.


GPT-3 has 175 billion parameters (says Wikipedia). What is the size of the neural network used in this riffusion project?


Stable Diffusion 'only' has ~1B parameters IIRC.


I wonder if it's possible to fine-tune an image upscaling model on spectrograms, in order to clean up the sound?


It's interesting if this can be used for longer tracks by inpainting the right half of the spectrogram.


Now if only it could generate accurate sheet music, then you've got something. Incredible examples.


Wow, I just learned so much about spectograms, had no idea that one could reverse one into audio waves!


Now the next step is to use Stable Diffusion to create chemical components via molecular graphs


Wow, absolutely fascinating. AI will continue to revolutionize our current approaches.


impressive stuff. reminds me of when ppl started using image classifier networks on spectrograms in order to classify audio. i would not have thought to apply a similar concept for generative models, but it seems obvious in hindsight.


Cant wait to see this in karaoke, you just sing lyrics and it jams along with music.


"Jamaican rap" - usually the genre (e.g. Sean Paul) is called Dancehall.


it seems that SD does cover everything in terms of generative ai. Speaking of music, very interesting paper and demo. Just wondering in terms of license and commercialization, what kind of mess are we expecting here?


I wasn't expecting to see uncanny valley translated to music today.


This works because songs are images in time. FFT analysis does not care.


A network trained on spectrograms only should do much better.


That church bell one is amazing. Very creative transition.


Some how it made Wesley Willis sound even better.


unfortunately I put in "sonic the hedgehog" and the result was...

... nothing? like, there's no playback at all. Is that expected?


And we broke it.


Someone please train it on John Coltrane.


personalized RL agents that finds aesthetic trajectories through the music latent space... soon, i hope :D


Love this idea. If I had more time I wanted to make a spaceship game where you are flying around the latent space, and model interrogation is used to provide labels to landmarks as you move around.


This website crashes Firefox on iOS


In the end there was the word.


BOOM! Yes!


Did you fine-tune the VAE?


New era of library music.


These are horrible musics, but of course there is nothing to feel shame about it.


Not generate … steal.


Absolutely brilliant!


Wow this is awesome!


impressive. and this is a hobby project.. amazing


why not use an image of the waveform as input?


the problem is it sounds awful, like a 64kbps MP3 or worse

Perhaps AI can be trained to create music in different ways than generating spectrograms and converting them to audio?


It doesn't need to sound good at all for it to be useful. Like with the AI Art creation, it can be a starting point for artists to play around and rapidly try different concepts, and then interpret the concept using high quality tools to create something really quite remarkable.

It's all about empowering artists to explore more possibilities.


I haven't been paying close attention to the AI field but seeing someone painting with NVIDIA Canvas blew me away[1].

I can totally see how it can help prototyping or exploring ideas and visions.

[1]: https://www.youtube.com/watch?v=uLlhyKxygrI


Exactly! UX/Pipelines/Integrations are the next logical step. It's my belief that samples will essentially be 'free' very soon. We will see DAW plugins/integrations that contextually offer samples to the composer. I'm confident in this because that's what I'm working on.


this is just great


It's amazing. They've really got something revolutionary here.


snake jazz


wow!


deleted


I think you'll find plenty of people who find that DAWs and music theory help them better find self-expression and celebrate life through their music. Any tool or framework that opens up new modes of achieving that self-expression should be celebrated, not shunned because it isn't as "pure" as more time and labor intensive methods. Would you rather someone be forced to dedicate a significant amount of time to studying music and art creation just to be able to find that self-expression?


[flagged]


:/ this is just spam. How are more of your comments not flagged?


nice username




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: