I think they are assuming a world where you took this existing model but it was trained on a dataset of animals making noises to each other, so that you could then feed the trained model the vocalization of one animal and the model would be able to produce a continuation of audio that has a better-than-zero chance of being a realistic sound coming from another animal - so in other words, if dogs have some type of bark that encodes a "I found something yummy" message and other dogs tend to have some bark that encodes "I'm on my way" and we're just oblivious to all of that sub-text, then maybe the model would be able to communicate back and forth with an animal in a way that makes "sense" to the animal.
Probably substitute dogs for chimps though.
But obviously that doesn't solve at all or human-understandability, unless maybe you have it all as audio+video and then ask the model to explain what visual often accompanies a specific type of audio? Maybe the model can learn what sounds accompany violence or accompany the discovery of a source of water or something?
Probably substitute dogs for chimps though.
But obviously that doesn't solve at all or human-understandability, unless maybe you have it all as audio+video and then ask the model to explain what visual often accompanies a specific type of audio? Maybe the model can learn what sounds accompany violence or accompany the discovery of a source of water or something?