This sounds similar to the basic idea behind the psy optimizations used in the V...

This sounds similar to the basic idea behind the psy optimizations used in the Vorbis encoder: the ear expects to hear something rather than nothing in a given frequency band, so it's better to play back noise in that band instead of playing nothing at all. In other words, energy preservation is more important than mere accuracy.

Of course, in this case, the logic is being applied to the audio stream as a whole, not just individual frequency bands. Since the voice activity detection is removing background noise, the lack of energy in the audio stream seems odd to the brain, so the noise has to be added to compensate.

The same seems to apply in dealing with images and video: the eye notices a lack of detail (blurring) more than it notices inaccurate detail.