What is really odd is that we are alive to see both. At billions of light-years even a slight extension of the travel time, say a few hundred years, would exceed human perception and probably go unnoticed. For a short-lived object like a nova, we may only ever have one of the two images visible at any one time.
It may be surprising, but if you do the math, this is what we should expect.
According to http://www.marmet.org/cosmology/einsteincross/, the distance between images is on the order of 1.6 arcseconds. That's a bit under 10^(-5) radians. The difference in time here is proportional to the difference in cosines of the angles taken. An cos is proportional to the square of the angle in radians. Which means we'd expect a time difference on the order of 10^(-10) of the total time taken. If the distant object is several billion light years away, we would therefore expect time gaps in arrival time that can be measured in no more than months.
First you need to understand Fermat's principle. Light will follow a particular path from A to B when all nearby paths take approximately the same time as that one. Distant paths may be faster or shorter - consider a straight line vs a mirror. But if nearby paths are different lengths there is destructive interference and no light travels.
Now suppose that we have a spiral galaxy between us and the distant source, but tipped on its side. And that galaxy is most of the way to us. From our view it is somewhat elliptical. There are five paths from there to here that meet the description of Fermat's principle. They are a straight line through that galaxy, a bent line to either side of the galaxy, and a bent line over the edges of the galaxy. At all other angles and directions, you don't meet Fermat's principle, and therefore light doesn't reach us that way.
However the central image gets blocked out by the lensing galaxy. Therefore you only see the other four.
Somewhat counterintuitively, the short side of the ellipse represents a greater gravity gradient, which bends the light more. Therefore those two images are farther apart and we don't get a perfect cross.
Also the lensing object is never perfectly lined up. This will also affect the size and placement of the images. Plus the length of time that light takes to get here.
Your understanding matches mine if you follow the following explanation. In the 2015 supernova, the lensing object was slightly off center. This meant that light that passed to the side of the lensing galaxy more or less straight to us had a fairly short route. Likewise light that passed by the long ends of the ellipse were bent less (because less gravitational gradient) so were also short. Therefore the light of the supernova came fairly close in time along those three routes. The fourth image, which went on the far side of the center of the lensing galaxy, bends the most and therefore had a longer route. Which is why it arrived with a significant delay from the other three.