Well it's no surprise that it kinda sorta works. Neural networks are very good at learning the underlying structure of things and working with suboptimally represented inputs. But if working with images of spectrograms works better than just samples in time domain, that is a valid and non-obvious finding.