This wasn't my understanding. If the decoding happens in hardware, I wouldn't have expected the decoded video to be passed back to the display server to be sent back again to the GPU and out to the screen.
My understanding was that there was some kind of compositing going on, in hardware, where the display server would tell the GPU to display the output between some coordinates, but the server itself wouldn't know what the actual output would be.