TensorFlow Graphics: Computer Graphics Meets Deep Learning

wpietri · on May 10, 2019

This is a really sharp approach, where synthetic scenes are rendered and used to train a system so it can better understand real scenes. It reminds me of the bit with Deep Thought in HHGTG, where "even before the data banks had been connected up it had started from 'I think therefore I am' and got as far as the existence of rice pudding and income tax before anyone managed to turn it off." This isn't quite that, but it's a good step in that direction.

dimatura · on May 10, 2019

This could be used that way, and there is some precedent, e.g. in RenderGAN (https://arxiv.org/abs/1611.01331). Note the GAN part: that's why the "differentiable" part is important, so you can integrate rendering into a trainable network. Otherwise, you don't really need the "differentiable" part, and in fact using synthetic renderings from traditional, established renderers (both rasterizing and raytracing) has been a pretty active area for a while now.

There's also some approaches like SUGAN (https://machinelearning.apple.com/2017/07/07/GAN.html), but those don't really use differentiable rendering -- they just use traditional rendering, then apply a 2D GAN on top to make them look more realistic (from a CNN's point of view, at least).

hn_throwaway_99 · on May 10, 2019

To me it sounded really similar to a Generative Adversarial Network (a GAN). With a GAN you have one network whose job is to classify an image (is this picture really a person?) and another whose job is to essentially fool the classifier (generate an image that looks like a person).

This case is a little bit of the reverse, in that it's focused on making the computer vision component (the discriminator) try to match the visual content that has already been generated.

Seems like these types of "adversarial" approaches will be used in lots of different domains, as so far they've produced some pretty amazing results.

codetrotter · on May 10, 2019

Speaking of doing things the other way around, could ML techniques be useful for tuning shader parameters and light placements in order to make a 3d scene modeled by a human look as close as possible to a reference photo?

(If so, it should probably be done with multiple reference photos from different angles, to ensure shaders and lights aren’t adjusted badly so that the scene only looks good from the one angle that the computer was looking from when it was tweaking.)

jeeceebees · on May 10, 2019

Would be even nicer if it could be trained on unpaired datasets (ala CycleGAN https://arxiv.org/abs/1703.10593).

dimatura · on May 10, 2019

Sure, I can imagine that. Would be surprised if there isn't SIGGRAPH papers to that effect. At least initially, it would be more of an optimization problem, but ML could help as well.

andbberger · on May 9, 2019

Pretty cool to see mainstream support arrive for differentiable graphics. It's been incubating for years

adw · on May 10, 2019

This is exciting as the first "industrially supported" framework for this stuff. There's research code out there, but it really is research code – it was written to publish, not written to build products around.

rsp1984 · on May 10, 2019

This looks very cool, but can someone with insight into the topic explain why the graphics part needs to be differentiable?

Couldn't we just automatically generate lots of graphics renderings (that are by definition already perfectly labelled) and use them to train a ML model? That wouldn't require any differentiability, would it?

So how is this approach different?

PeterisP · on May 10, 2019

Differentiable graphics rendering allows you to transform the difference (what you got vs what you wanted) of the resulting image back to a difference of the underlying data which you rendered.

It allows you to get from "these 1000 output pixels were wrong" to "the truck in the scene model actually was two inches to the left compared to what I expected/predicted" or "the texture of that apple should be changed this way to match reality".

You wouldn't use it to tweak labels for image classification tasks, but to learn better underlying models of physical reality and behavior.

rsp1984 · on May 10, 2019

I think I need a higher level explanation.

Is the goal to tweak the rendering parameters until it matches a given input image? If yes that would be inference, not learning. What am I missing?

dimatura · on May 10, 2019

Yeah, you can do that as an inference step, like you described, and that has its uses without any learning required. But you can also make the output of the renderer depend (differentiably) on parameters apart from whatever you have as input - and those parameters can be learned via gradient descent. For example, you could make a conventional CNN take an image of say, a cube, as an input, and predict its 3D pose wrt to the camera, feed that prediction to the renderer, and have the renderer render the cube. Then, as training, e.g., a pixelwise error can be computed and backpropagated to the CNN. At the end of the process, ideally you would have a network capable of predicting the pose of a cube (and rendering a close match) in one shot, without iterative parameter tweaking.

antpls · on May 11, 2019

It sounds like the description of an auto-encoder. You encode the image of a cube to a representation of 3D points, then there is a decoder that transforms it back to an image of a cube. Am I correct ?

Edit : nevermind, the article does state that it is similar to an autoencoder

mov · on May 10, 2019

So, in that case, we could learn without the need of a differentiable renderer neither graphics operators, right? Maybe the throughput of communicating with external renderer is too much when compared with iterative parameter tweaking happening inside of loss functions on CNN?

PeterisP · on May 10, 2019

> we could learn without the need of a differentiable renderer neither graphics operators, right?

No. The sentence of the parent post "a pixelwise error can be computed and backpropagated to the CNN" is possible only if the renderer is differentiable.

mov · on May 10, 2019

Got it. So let's suppose we have an external renderer, then we could learn parameters to tweak rendered scene and then get rendered pixels, so we can calculate pixelwise errors from it and some target image we're trying to optimize for. In this way, do we still need differentiable renderer in your opinion?

Update:

It would require more training cycles and would not be as "atomic" as iterative tweaks but seems possible.

At same time, I wonder about making loss function talk with some external renderer would make it possible to mix both approaches.

PeterisP · on May 10, 2019

How would you learn parameters to tweak the rendered scene if the renderer is not differentiable, and you can't backpropagate through the renderer to calculate the appropriate parameter adjustments from the pixelwise errors?

I suppose you theoretically could do it with some trial and error method or grid search or something like that, but it's going to be absolutely computationally unfeasible in the general case; the pixelwise errors only become practically useful if you have an uninterrupted differentiable/'backpropagatable' path from your parameters to the pixels.

mov · on May 10, 2019

Yes, throughput would be larger and we would loose the backprop path like you said, but it seems practical in some ways and actually guiding approaches like https://nv-tlabs.github.io/meta-sim/

PeterisP · on May 10, 2019

That's one use case - quoting the article, "[...] analysis by synthesis where the vision system extracts the scene parameters and the graphics system renders back an image based on them. If the rendering matches the original image, the vision system has accurately extracted the scene parameters." - so this allows you to learn a image-analysis system from unlabeled data in an iterative manner, rendering what you "understood", looking for differences, and adjusting your understanding.

There are other use cases, for example, rendering "what-if" scenarios for reinforcement learning - e.g. you may have a model that predicts that if an agent does action sequence ABC then it will result in a world state X, and it can render an image Y which it expects to perceive. When it actually performs these actions, it actually obtains a different image Y' ... and needs to update its prediction, so it'd need differentiation/backpropagation through that rendering (and beyond) to figure out how it could have predicted the correct world state; so this feature is needed to allow learning the predictive model.

ehsankia · on May 10, 2019

Why does Light and Materials use the Google Keep logo? :)

dimatura · on May 10, 2019

Nice! I played with OpenDR (http://files.is.tue.mpg.de/black/papers/OpenDR.pdf) a few years ago, and got really excited about it. Unfortunately it uses a custom autodiff implementation that made it hard it to integrate with other deep learning libraries. Pytorch still seems to be lagging in this area, bit there's some interesting repos on github (e.g. https://github.com/daniilidis-group/neural_renderer).

state_less · on May 10, 2019

Wonderful work. I’m curious about the representation. Does it take a scene graph and infer a scene graph? Can you predict future scenes, given a sequence of previous scenes?

Seems like a nice way to debug and create more complex models.

yogrish · on May 10, 2019

Really cool. “Analysis by synthesis “ way can probably increase the performance of networks. I am looking at autonomous driving scenarios where it can remove false object detections.

pizza · on May 10, 2019

I wonder if capsule networks would be another good, maybe even better, approach for scene reconstruction, as opposed to pooling convnets.

mov · on May 10, 2019

Could you elaborate more?