Images are not truly bandlimited, which means they can't be perfectly represented in the frequency domain, so instead there's a compromise where smaller blocks of them are encoded with a mix of frequency domain and spatial domain predictors. But that's the biggest part of it, yes.
Most of the problem is sharp edges. These take an infinite number of frequencies to represent (= Nyquist theorem), so leaving some out gets you blurriness or ringing artifacts.
The other reason is that bandlimited signals infinitely repeat, but realistic images don't - whatever's on the left side of a photo doesn't necessarily predict anything about whatever's on the right side.
A real image not, but a digital image built up from pixels certainly is band limited. A sharp edge will require contributions from components across the whole spectrum that can be supported on a matrix the size of the image, the highest of which is actually called the Nyquist frequency
Not quite. You can tell this isn't true because there are many common images (game graphics, text, pixel art) where upscaling them with a sinc filter obviously produces a visually "wrong" image (blurry or ringing etc), whereas you can reconstruct them at a higher resolution "as intended" with something nonlinear (nearest neighbor interpolation, OCR, emulator filters like scale2x). That means the image contains information that doesn't work like a bandlimited signal does.
You could say MIDI is sort of like that for audio but it's used a lot less often.
Yes, or by extending the pixels on the edge out forever. The question is which one is more effective for compression; it turns out doing that for individual blocks rather than the entire image is better.
(With mirroring things could happen like the left edge of the image leaking into the right, and that'd be weird.)
Most of the problem is sharp edges. These take an infinite number of frequencies to represent (= Nyquist theorem), so leaving some out gets you blurriness or ringing artifacts.
The other reason is that bandlimited signals infinitely repeat, but realistic images don't - whatever's on the left side of a photo doesn't necessarily predict anything about whatever's on the right side.