regarding those usual objections, i'd argue that a spectrograph representation o...

Applejinx · on Dec 15, 2022

You would be absolutely correct. the lossiness is in the resolution of the image (512x512 is pretty terrible) but given enough image resolution it's just an FFT transform, and the only reason that stuff falls short is because people don't give it, in turn, enough resolution. If you did wild overkill of the resolution of an FFT transform you could do anything you wanted with no loss of tone quality. If you turned that to visual images and did diffusion with it you could do AI diffusion at convincing audio quality.

In theory the tone quality is not an objection here. When it sounds bad it's because it's 512x512, because the FFT resolution isn't up to the task, etc. People cling to very inadequate audio standards for digital processing, but you don't have to.