I get that this is to drive the avatar, but I'm curious as to why. There's stronger signals with video, which I'm certain taking in even a very low resolution image would have stronger signals to convey movement than audio does (either the network is memorizing a lot (which is fine, but limited), or this is an iteration towards a 3D high sensitivity audio driven for precise sound? Something else?). I mean the quest has cameras in it, so why not use those? Computation? They aren't big models (largest is 1.42G, smallest 0.58GB)