DrEureka: Language Model Guided SIM-to-Real Transfer

throwup238 · 2024-05-03T18:10:22 1714759822

This research is all beyond me so maybe someone can explain: How does this compare to the state of the art in using simulators to train physical robots? Does using transformers help in any way or can this just as easily be done with other architectures?

To the uninitiated this looks cool as all heck and yet another step towards the Star Trek future where we do everything in a simulator first and it always kinda just works in the real world (plot requirements notwithstanding).

Although I can also hear the distant sounds of a hundred military R&D labs booting up Metalhead [1] simulators.

Edit: Looks like the previous SOTA was still a manual process where the user had to come up with a reward function that actually rewards the actions they wanted to the algorithm to learn. This research uses language models to do that tedious step instead.

[1] https://en.wikipedia.org/wiki/Metalhead_(Black_Mirror)

claytonwramsey · 2024-05-03T22:22:27 1714774947

Your edit is roughly correct. It's a bit odd that they're comparing a single human-made policy with the very best of the DrEureka outputs; in practice, I would expect to make multiple different reward functions and then do validation on the functions and hyperparameters based on the resulting trained models. However, their comparison isn't necessarily wrong, since it seems (I could be wrong) that they used the reward function from prior papers in the field of policy learning.

If you're interested in methods for actually learning policies for these sorts of dynamic motions, note that this paper is simply applying proximal-policy optimization. They're pulling in the training and implementation methods from Margolis's [1] and Shan's [2] work.

So, in sum, the contribution of this paper is exclusively the method for generating reward functions (which is still pretty cool!!!!!), not all the learning-based policy stuff.

[1]: https://web.archive.org/web/20220703005502id_/http://www.rob... [2]: https://arxiv.org/pdf/2309.06440

imtringued · 2024-05-04T11:41:21 1714822881

Transformers don't care if it is simulation or real data or even if you are using a different robot altogether. At some point it will be possible to show a robot a human demonstration and let it learn from that.

vessenes · 2024-05-04T04:36:48 1714797408

This looks very singularity-ish to me; LLMs can construct a reward function that can be trained in simulation, and the trained reward function works surprisingly well in the physical world.

Some of the videos raised questions in my mind as to whether or not the leash was doing stabilization work on the robot; it might be. But, if you watch the video where they deflate a yoga ball, you can see the difference when they're "saving" the robot and when they're just keeping it taut. The 'dextrous cube' manipulation video is also pretty compelling.

This sort of 'automate a grad student' type work, in this case, coming up with a reasonable reward function, all stacks up over time. And, I'd bet a lot of people's priors would be "this probably won't work," so it's good to see it can work in some circumstances -- will save time and effort down the road.

userbinator · 2024-05-04T01:30:04 1714786204

From the current casing I want to clarify that this is "sim" as in simulation, not as in SIM card. (Made me click.)

refulgentis · 2024-05-03T18:31:11 1714761071

Every single second of every example has a handler holding a leash - and not just holding it, holding it without any slack.

Blindingly obvious interference from Ouija board effect.

I don't mean to denigrate the work, I believe the researchers are honest and I hope there's demoes outside the published one. Just, at best, an obvious unforced error that leaves open a big question.

EDIT: Replier below shared a gif with failures, tl;dr this looks like two different experiment protocols, one for success, one for failure. https://imgur.com/a/DmepBVU

Imnimo · 2024-05-03T19:03:07 1714762987

This sample on Twitter shows how other controllers fail:

https://twitter.com/JasonMa2020/status/1786433841613390023

I agree it's hard to tell whether the controller learned with DrEureka would be sufficient without the leash, but I'm at least convinced that the leash is not sufficient to hold a robot on the ball without a decently competent controller.

refulgentis · 2024-05-03T19:20:20 1714764020

Oh my, that looks quite damning. https://imgur.com/a/DmepBVU

The good case leash is held taught at half the distance of failures, at a parallel angle to the bot and orthogonal to failures.

The failures all held with slack, on a leash held at 2x the distance of successes, at an angle orthogonal to the bot.

(do correct me, we're seeing opposite things, and those are very small and I last took physics...16 years ago :< )

Imnimo · 2024-05-03T20:17:16 1714767436

Hmm, I do see what you mean.

canadiantim · 2024-05-03T19:03:55 1714763035

So the robot dog that's going to kill me in the near future will atleast be adorably balancing on a big rubber ball

codetrotter · 2024-05-03T20:21:34 1714767694

Death by giant rubber ball.

In a scene reminiscent of the giant boulder rolling after Indiana Jones, a robot dog is balancing on top of an enormous rubber ball down the streets of some big city, flattening everything in its way.

Cronch, cronch, cronch, go the cars.

Squish, squish, squish, go the people.

magicalhippo · 2024-05-03T21:45:50 1714772750

> Death by giant rubber ball.

As long as it's not white...

https://www.youtube.com/watch?v=I6Ffr1U7KMY

FrustratedMonky · 2024-05-03T18:48:42 1714762122

Kind of like how a human visualizes before a sport.?

Like visualizing free throws in basketball, makes you measurably better, without actually doing free throws for real?