I have trained GANs on raw JPEG coefficients with moderate success as a pet project. Read the raw DCT coefficients without decompressing and train in this space. JPEG-decompress the output of the net to reconstruct an image in the pixel space. There are few papers doing similar things for supervised learning tasks iirc
Yeah it kinda works when you feed JPEG coefficients into a typical time-domain CNN, but mathematically it seems that if you're using frequencies as inputs, your convolutions should become simple multiplications. Am I wrong?
Yep, you can successfully do it without convolutions. Here is a pointer if you want to dig deeper: https://eng.uber.com/neural-networks-jpeg/ (there's prior art to that, but this one is well written)