Awesome, there is another project out there that does it with CPU https://github.com/marcoppasini/musika maybe mix the both, ie take initial output of musika, convert to spectrogram and feed it to riffusion to get more variation...
"fine-tuned on images of spectrograms paired with text"
How many paired training images / text and what was the source of your training data? Just curious to know how much fine tuning was needed to get the results and what the breadth / scope of the images were in terms of original sources to train on to get sufficient musical diversity.