This research is all beyond me so maybe someone can explain: How does this compare to the state of the art in using simulators to train physical robots? Does using transformers help in any way or can this just as easily be done with other architectures?
To the uninitiated this looks cool as all heck and yet another step towards the Star Trek future where we do everything in a simulator first and it always kinda just works in the real world (plot requirements notwithstanding).
Although I can also hear the distant sounds of a hundred military R&D labs booting up Metalhead [1] simulators.
Edit: Looks like the previous SOTA was still a manual process where the user had to come up with a reward function that actually rewards the actions they wanted to the algorithm to learn. This research uses language models to do that tedious step instead.
Your edit is roughly correct. It's a bit odd that they're comparing a single human-made policy with the very best of the DrEureka outputs; in practice, I would expect to make multiple different reward functions and then do validation on the functions and hyperparameters based on the resulting trained models. However, their comparison isn't necessarily wrong, since it seems (I could be wrong) that they used the reward function from prior papers in the field of policy learning.
If you're interested in methods for actually learning policies for these sorts of dynamic motions, note that this paper is simply applying proximal-policy optimization. They're pulling in the training and implementation methods from Margolis's [1] and Shan's [2] work.
So, in sum, the contribution of this paper is exclusively the method for generating reward functions (which is still pretty cool!!!!!), not all the learning-based policy stuff.
Transformers don't care if it is simulation or real data or even if you are using a different robot altogether. At some point it will be possible to show a robot a human demonstration and let it learn from that.
This looks very singularity-ish to me; LLMs can construct a reward function that can be trained in simulation, and the trained reward function works surprisingly well in the physical world.
Some of the videos raised questions in my mind as to whether or not the leash was doing stabilization work on the robot; it might be. But, if you watch the video where they deflate a yoga ball, you can see the difference when they're "saving" the robot and when they're just keeping it taut. The 'dextrous cube' manipulation video is also pretty compelling.
This sort of 'automate a grad student' type work, in this case, coming up with a reasonable reward function, all stacks up over time. And, I'd bet a lot of people's priors would be "this probably won't work," so it's good to see it can work in some circumstances -- will save time and effort down the road.
Every single second of every example has a handler holding a leash - and not just holding it, holding it without any slack.
Blindingly obvious interference from Ouija board effect.
I don't mean to denigrate the work, I believe the researchers are honest and I hope there's demoes outside the published one. Just, at best, an obvious unforced error that leaves open a big question.
EDIT: Replier below shared a gif with failures, tl;dr this looks like two different experiment protocols, one for success, one for failure. https://imgur.com/a/DmepBVU
I agree it's hard to tell whether the controller learned with DrEureka would be sufficient without the leash, but I'm at least convinced that the leash is not sufficient to hold a robot on the ball without a decently competent controller.
In a scene reminiscent of the giant boulder rolling after Indiana Jones, a robot dog is balancing on top of an enormous rubber ball down the streets of some big city, flattening everything in its way.
To the uninitiated this looks cool as all heck and yet another step towards the Star Trek future where we do everything in a simulator first and it always kinda just works in the real world (plot requirements notwithstanding).
Although I can also hear the distant sounds of a hundred military R&D labs booting up Metalhead [1] simulators.
Edit: Looks like the previous SOTA was still a manual process where the user had to come up with a reward function that actually rewards the actions they wanted to the algorithm to learn. This research uses language models to do that tedious step instead.
[1] https://en.wikipedia.org/wiki/Metalhead_(Black_Mirror)