Learning to Estimate 3D Hand Pose from Single RGB Images

kefka · on May 4, 2017

I'm glad they're making inroads here, cause gesture and pose calculation and tracking is awesome.

But.. it also seems like more usage of black-box magic math of Neural Networking. I'm glad it gets results and all, but it just seems.. inelegant. What's the algo really doing underneath? Did the algo figure out the joints and their ranges of freedom?

The results are spectacular, indeed. But it also seems a bit not like science. It seems more like an oracle system was able to deduce the results - and we're nowhere near closer to understanding how to do this. Just, we have a trained system that can.

return0 · on May 5, 2017

Maybe the problem itself is not elegant (hence why it's not solved in closed form)? Neural networks learn thousands of different algorithms in parallel. It's probably not a single idea behind it, it may be thousands. It's more like experimental science: the system is there, it works, someone else needs to analyze it to figure out how it does what it does.

kefka · on May 5, 2017

That's my point. Anybody can shove in gigs of data, clean it up some, provide easily digestible samples, make a whole lot of parameters (or let the system decide), and scrape the data out.

How is it working? What criteria does it work and fail? Does it work for black people? Does it work for women's hands? (Or does it work for anyone out of university?) does it handle people with hand defects or missing digits?

That's right... We don't know unless we test this. And only by adding more data can we even determine those questions. And we haves no clue how its working, what features its using, or anything. Just, that it does work. And it doesn't for conditions were unsure of.

Now this is a great starting plank for determining the underlying math. But even the cost of compute seems high for what it could ideally be, if we understood what was going on.

devrandomguy · on May 4, 2017

Crazy brute force idea for analyzing a neural net, from someone who has little experience with neural nets:

Graph the response of the neural network, over the range of the stimuli that you care about. This is going to be a ridiculously huge dataset, but bear with me. Then, use a genetic algorithm to evolve equations that have reasonably similar behavior, perhaps over a much smaller domain.

This collection of equations and their valid input ranges, are the raw material for your program. You would simplify them using algebraic solvers, when possible, and attempt to hand-optimize them for readability. Then, when you are done, the whole thing gets compiled down to big switch-case statement in, say, C. From here on, the process looks sort of like yacc.

So, by adding a whole new layer of magic, we get the system to explain itself in a way that a programmer could understand. Come to think of it, this feels sort of like how I do personal introspection. In fact, I'm doing it right now.

nojvek · on May 5, 2017

This is kind of close to a relatively new openai paper that uses annealing/genetic algorithms for making a great reinforcement learning algorithm.

I do love the idea of machines figuring out the simplest model with maximum accuracy. When we can do that across different domains using same algo then we can say we have figured out secrets of intelligence.

AndrewKemendo · on May 5, 2017

Deep Neural Nets are inelegant now? Have you tried building effective DL models?

In fact it's more like science than you let on. You have a hypothesis for the data you need, test it, evaluate results, test etc... and result in optimized weighting.

The elegance is inside of the architecture.

iaw · on May 4, 2017

I think this post would be a lot more informative with the original title "Learning to Estimate 3D Hand Pose from Single RGB Images"

The results are astounding for anyone that's tried to do similar work in the past. I'm so pleased that they made the code available.

sctb · on May 4, 2017

Agreed, thanks! We've updated the title from “A 2D Image of your hand and TensorFlow tell you everything”.

throwanem · on May 4, 2017

dang, sctb, et al: Please update the title to match that of the original article, per iaw's comment above. Thanks!

hypertexthero · on May 4, 2017

Interesting.

I just finished reading a book called The Hand, by Frank R. Wilson. The author tells how the human hand has shaped our evolution and suggests, among other things, that the development of language came from our hands.

A bit heavy at times, but nevertheless warmly recommended.

http://www.penguinrandomhouse.com/books/191866/the-hand-by-f...

mrfusion · on May 4, 2017

The vive already has a camera... could this be applied there?

What are some other applications?

revelation · on May 4, 2017

Okay, but depth cameras exist, full stop, and they obviate the need for an entire GPU worth of processing. It's why Uber and Waymo are sinking billions into cheap, solid-state LIDAR. Actual measured sensor data is just so much better.

genericpseudo · on May 4, 2017

The challenge with LIDAR (for the foreseeable future) is ubiquity. If you're building a car, then sure, LIDAR, no questions asked. If you're trying to pair this with say a structure-from-motion pipeline for photogrammetry targeting mobile phones as capture devices then this becomes very very important.

The best data for many applications is often the data you can get at scale.

highd · on May 4, 2017

You'd prefer a $10,000 lidar over a $8 mobile GPU?

blt · on May 5, 2017

it's hard to make a small active sensor that produces enough light to work outdoors on a sunny day.

jvanderbot · on May 5, 2017

Here's something from six years ago ish with similar results (faster) https://www.youtube.com/watch?v=qok636pe_qw

wyldfire · on May 4, 2017

The current title "A 2D Image of your hand and TensorFlow tell you everything" is a little head-scratching. "2D image" as opposed to "stereo" 2D images (or depth-mapped image?). And "tell you everything" makes it sound like some kind of fortune teller.

jordache · on May 4, 2017

I don't think this is all that amazing.

Human fingers only have a very narrow range of motion in a single direction. When the 2D profile of the hand changes, the set of finger movements to arrive at that profile is fairly predictable.

ska · on May 4, 2017

You think so in the first 5 minutes of thinking about it, then you try it, and then you realize why lots of smart people have struggled with improving this in the last decades.

Compute vision is full of problems that turn out to be a lot harder than they look. Part of the reason is that our intuition about this is skewed by having access to a very good vision processing unit that we can't reproduce :)

KineticLensman · on May 4, 2017

> "Human fingers only have a very narrow range of motion in a single direction"

Each finger has three joints that can bend in and a little bit back. Each finger can move from side to side. There is a small amount of rotation around the base of the finger. The thumb has three joints, the bottom of which is opposable to create a gripping hand. The entire hand can bend, yaw and rotate with respect to the arm, and there is further flexibility in the bones of the palm.

If you doubt how complicated the hand is, download a free 3D figure posing system like Daz Studio [1], instantiate a figure, and then try to make a hand grip a small object, or gesture (e.g. v-sign, 'ok', hitchhiking, vulcan greeting).

[1] https://www.daz3d.com/

jordache · on May 4, 2017

A lot of the variables you listed is not relevant. The solution linked here is only deriving the 3d positions of the fingers, none of the other nuances that can be expressed by the hand's muscular and skeletal system.

regularfry · on May 4, 2017

As someone who's been keeping tabs on the field for the past 20 years since briefly bouncing through it at university, you're dead wrong. It's one of those things that sounds easy, and... just isn't. The shots of predicting the partially occluded pose are a bit special, too.

iaw · on May 4, 2017

Have you tried to implement visual algorithms? Especially those prior to the prevalence of GPU-based neural networks?

Even just ten years ago it was non-trivial to predict the shape of a room from a single 2d RGB image.

anjc · on May 4, 2017

Try it and see. You're right that skeletal constraints help.