I think there might be an opinion that since most colour space conversions can be expressed with relatively small neural nets (since they are mostly accumulations of variously scaled values), the autoencoder can dedicate a negligible proportion of its parameters towards that job and that gives it the potential to choose whatever color space training dictates.
I'm not entirely convinced by this idea myself. I have seen a few networks where a range of -1..1 inputs do a lot better than inputs in the range of 0..2 even though translation should be an easyish step for the network to figure out. The benefit from preprocessing the inputs, to me seem to be more advantageous than my common sense tells me it should be.
I suspect that weight initializations are geared towards inputs being normal random variables with mean 0 and variance 1. Deviating from that makes the learning process unhappy.
I'm not entirely convinced by this idea myself. I have seen a few networks where a range of -1..1 inputs do a lot better than inputs in the range of 0..2 even though translation should be an easyish step for the network to figure out. The benefit from preprocessing the inputs, to me seem to be more advantageous than my common sense tells me it should be.