To be frank - the problem is not poorly defined, you're just not aware of the de...

To be frank - the problem is not poorly defined, you're just not aware of the definition.

In general in generative models, you have some "true data distribution" P and an estimator distribution Q.

The goal is to make P and Q the same, generally by minimizing some divergence between them.

The actual objective is defined as being between the actual distributions P and Q, but because we only have so many data points, we define an empirical loss that just uses the real observations from P. So if the model makes Q just memorize the samples from P, then it actually hasn't made P and Q similar, it's only minimized the empirical loss.

One practical way to get around this with GANs is to train a conditional GAN instead of an unconditional GAN, and then run the conditioned generation task on held-out samples from the validation set. Another good and perhaps more general solution is to train an inference network and to generate reconstructions on held out data points. If they look totally different, then the model is probably not very "representative".