It's not guaranteed to perform well with unknown or new codecs - true. But, the implicit assumption is that YouTube will use codecs that preserve what videos look like - not just random codecs. If that assumption holds then the image recognition model will keep working even with new codecs.
That's the thing though, "looks like" breaks down pretty quickly with things that aren't real images. It even breaks down pretty quickly with things that are real, but maybe not so common, images: https://www.youtube.com/watch?v=r6Rp-uo6HmI
So one question would be: Does your image generation approach preserve a higher information density than big enough pixels?
Why would you assume that the images in my algorithm aren't real images? For example, you could use 256 categories from imagenet as your keys. Image of a dog is 00000000, tree is 00000001, car 00000010, and so on.
I'm not assuming they are not real images, I'm questioning whether to get to any information density that even out-performs "big enough colorful pixels", you might get into territories where the lossy compression compresses away what you need to unambiguously discriminate between two symbols.
And to get to that level of density I do wonder what kind of detailed "real image" it would still be.
If your algorithm were for example literally the example you just noted, then the "big enough colorful pixels" from the example video already massively outperform your algorithm. Of course it won't be exactly that, but you have the assumption that the video compression algorithm somehow applies the same meaning of "looks like" in its preservation efforts that your machine learning algorithm does, down to levels where differences become so minute that they exceed what you can do with colorful pixels that are just big enough to pass the compression unscathed, maybe with some additional error correction (i.e. exactly what a very high density QR code would do).