The format isn't explicit to the network. But the data trained on is usually in RGB format, so probably the reasoning. I found a repo where someone tried different formats but it's wroth noting that this was for discrimination so just because it can discriminate doesn't mean it does the same thing. Maybe I'll run some experiments. You could use a UNet for classification and then look at the bottom layer and do the same thing. Be hard to do with SD (or SDXL) because you'd need to retrain with the format. Tuning could possibly work but the network would likely be biased to understand the RGB encoding.
> But the data trained on is usually in RGB format, so probably the reasoning.
It's trivial to convert the values for training - basically 0% of the cost of the process. But there's likely more "meaning" in HSV than in RGB. So I don't think that would account for the difference.
ML systems generally do not care about human semantics, and they will not produce them naturally. The VAE works at 16 bits float per channel, so compression is not an issue either, but if it was, HSV would be a poor choice too.
ML systems don't care, but humans do and better semantically-meaningful representations in training data usually lead to better results for us. In images you often care about "different colours of similar brightness" rather than "matching levels of 3 colour components", so there's a non-zero chance HSV/HLS would do better than RGB. It's nothing to do with compression.
Does it lead to better results though? For the system, the best representation would be one that it learned - which is the latent representation, 4 channels in this case. Would it learn a "better" representation when fed with HSL instead of RGB? If so, what's the intuition? RGB somewhat resembles human vision, whereas HSL exists for interactive editing, and YCbCr exists for compression. If anything, I would expect YCbCr to outperform.
HSV closer resembles physical properties, for most natural things. Hue and saturation variations are usually meaningful variations in the actual material. Brightness variations often end up being mostly about lighting, rather than the material. It can be surprisingly effective for simple segmentation [1], which is why it's usually the first one implemented in computer vision classes.
Our eyes have RGB sensors, but I would claim I perceive the colors in my surroundings in something like HSV (although, that could very well be from the way I learned colors). And, I think this makes sense: if you're looking for something, you want a color perception that's not overly sensitive to lighting conditions. RGB is directly related.
The segmentation aspect is interesting, but the problem I have with H is that it is circular, i.e. 0 and 1 represent virtually the same hue, and my intuition is that this lends itself poorly to a NN. The luminosity argument is valid, but that is not unique to HSL, hence my intuition that YCbCr (or related) would outperform.
Edit: ops, forgot the link
https://github.com/ducha-aiki/caffenet-benchmark/blob/master...