I have spotted similar bugs too which become evident very quickly when working with smaller resolutions and likewise with a smaller number of grayscales.
> We choose a H×W rectangular grid of points, from which we will draw samples.
An additional thing to keep in mind is how a camera capturing the image would be operating. It's not sampling in the true theoretical sense of picking points from a continuous signal. The pixels are of finite size, which makes the grid (2) as closer to the reality. There are additional complications for color images where the red, green and blue channels are integrating over different regions within the pixel area (see [A] for example). This makes the real grid as different from even (2) for different color channels. However, the math suggested by the author should not change still.
> It seems the mess is unique in the deep learning world.
The title, "Where Are Pixels? -- a Deep Learning Perspective", looks unjustified. What's presented is not a deep learning perspective. It applies generically.
Digital elevation models typically represent terrain elevations in a grid of square cells of some cell size, e.g. 1 meter by 1 meter. Such a model is rarely useful unless it also has georeferencing information, i.e. coordinates for the bounding box of the grid. Such coordinates can of course be arbitrary, but for large-scale mapping some convention is typically used so that adjacent models of the same cell size will "line up" without small gaps or overlaps. For example, this can be achieved by making sure that the grid corners are on integer coordinates - from my experience this is the convention I've seen in most European countries. In Poland, they instead use the convention that the centers of the corner cells are on integer coordinates, meaning everything is "off" by 0.5 meter (half a cell). This has caused me quite some difficulties at $WORK...
TL;DR: always think of the upper left pixel's position as (0.5, 0.5).