We took a look at encoding phase, but it is very chaotic and looks like Gaussian noise. The lack of spatial patterns is very hard for the model to generate. I think there are tons of promising avenues to improve quality though.
Phase itself looks random, but what makes the sound blurry is that the phase doesn't line up like it should across frequencies at transients. Maybe something the model could grab hold of better is phase discontinuity (deviation from the expected phase based on the previous slices) or relative phase between peaks, encoded as colour?
But the same thing could be done as a post-processing step, finding points where the spectrum is changing fast and resetting the phases to make a sharper transient.