Even if your physical sensor is made of little squares, they cease to be squares once you've converted them to single readings - you've integrated the square into a single point. This is the basis of sampling theory. Continuing to think of them as little squares leads you to bad intuitions such as the resampling algorithm that we're responding to. Maybe there's a better way to make that point than the cited paper, but I haven't seen it yet.
Whether pixels are on the half-integer points is a completely arbitrary decision, unless you're trying to mix raster and vector graphics. Then the correct solution will become obvious.
From the point of view of sensors the square model seems correct. Sensor pixels are photon counters. If you want to turn a 2x2 pixel square into a single pixel summing the four pixels is the same as if you had a sensor with 1/4 of the pixels and each of them counts the photons over those 2x2 areas[1]. What we may be saying is that since you have the extra pixels you can actually do better than if the sensor was already natively at your desired resolution by using a more complex filter over the extra data (e.g., by avoiding the Moiré artifacts that are common in lower resolution sensors).
[1] Most sensors are bayer patterns so some extra considerations apply to color resolution
Whether pixels are on the half-integer points is a completely arbitrary decision, unless you're trying to mix raster and vector graphics. Then the correct solution will become obvious.