Well, if we're being pedantic here, raw RGB video as would be displayed on a RGB monitor does indeed take 3 bytes per pixel. Technically, YUV is either an optimization or a backwards compatibility measure, and hence the adjective "raw" does not apply to it.
Actually the "raw" data coming from most cameras are often already compressed somehow, otherwise the camera would not be able to write it to storage (or send it over USB) fast enough.
In fact since decoding also often happens in hardware, the raw size may never materialize anywhere in the pipeline, other than the framebuffer of the video card. Even HDMI has compression now [0]
The author probably choose a screenshot as a benchmark, because otherwise it's hard to get your hands on anything compressible but not already compressed.
10 bit video 444 with 8294400 or 8847360 samples per picture for 4k (tv vs movie) tops out below 270mbit per frame, or at 60hz 16gbit, so you can fit half a dozen signals down a 100G transceiver, or two down a 40G.
Throw in YUV compression and you can shift 120hz 8k with relatively little problem.
If you define raw video as RGB with 1 byte for red, 1 byte for green and 1 byte for blue, then yes, it will be 3 bytes per pixel.
But there are clearly other ways to store a pixel of colour information in less than 3 bytes, which is OP's point. It's not an optimization really - it's just a different coding format (Just as ASCII isn't an optimization of Unicode).