This sounds similar to the basic idea behind the psy optimizations used in the Vorbis encoder: the ear expects to hear something rather than nothing in a given frequency band, so it's better to play back noise in that band instead of playing nothing at all. In other words, energy preservation is more important than mere accuracy.
Of course, in this case, the logic is being applied to the audio stream as a whole, not just individual frequency bands. Since the voice activity detection is removing background noise, the lack of energy in the audio stream seems odd to the brain, so the noise has to be added to compensate.
The same seems to apply in dealing with images and video: the eye notices a lack of detail (blurring) more than it notices inaccurate detail.
This was entirely not what I thought it was given the title of the article. That is, I didn't expect this to be related to telecomms in any way.
... and now that we're on the topic of strange subjects involving telecommunications, I've managed to distract myself by listening to recordings of numbers stations again.
FTA:
"The result of receiving total silence, especially for a prolonged period, has a number of unwanted effects on the listener, including the following ..."
Besides what is listed in the article, prolonged silence may result in media platform forcing disconnection. For example, that's why special RFC exists for RTP for comfort noise. However, not all user agents (neither all platforms) honor this rfc. Note that I use the term "media platform" for any media gateway or media server, that establishes and manages connections between two or more clients.
Something similar used to apply to tv. On either a Buffy or a Firefly commentary, Joss Whedon said that they wanted to cut to black, then hang on it for a few seconds for dramatic effect.
The problem was that if they actually cut to black, that would trigger the commercial - the software that controlled when the commercials went on must have monitored the signal, and when the video was black, played the commercial reel. So they had to cut to almost-black to avoid ending the scene prematurely.
This is how commercial detection in MythTV and others worked for a time; I think it uses something a little more sophisticated now due to some sort of arms race type of deal.
Interesting that article doesn't mention G729 (http://en.wikipedia.org/wiki/G.729) which has VAD (voice activity detection), but also specifies a noise coefficient. A voice frame contains 10 bytes of data, where a noise frame contains 2 bytes. The two byte frame contains a characteristic for the noise, so the other end hears the equivalent (roughly) noise. Cisco also implemented the same thing in G711, but its proprietary.
I suppose the reasoning behind this is the same as that behind the "fail whale" concept. Make sure that your clients/listeners know that the network is still live, its not completely gone.
Of course, in this case, the logic is being applied to the audio stream as a whole, not just individual frequency bands. Since the voice activity detection is removing background noise, the lack of energy in the audio stream seems odd to the brain, so the noise has to be added to compensate.
The same seems to apply in dealing with images and video: the eye notices a lack of detail (blurring) more than it notices inaccurate detail.