They don't learn by watching. That would be another breakthrough entirely. They learn only by physically completing the task under the complete control of a puppeteer. Still very cool though.
My understanding is that the puppeteer records the same movement over and over again and this dataset of trajectories (pose, speed, acceleration) is then "diffused".