Text to Image Synthesis Using Thought Vectors

gwern · on Aug 26, 2016

It's a little tricky getting this to work because you need two separate models working together, but I tried it out. Here's some of the samples I generated:

https://imgur.com/Uwp1wfu

https://imgur.com/yuW9Yre

https://imgur.com/oZ4wzdC some definite weaknesses in the natural language embedding

https://imgur.com/MAupphr roses in general don't seem to work well. must not have been many in the dataset

You can see that it works better than one would expect, but there are definitely limits to the understanding. The flower and COCO datasets are, ultimately, not that big. What would be exciting is if you could train it on some extremely large and well-annotated dataset like Danbooru.

paarthn · on Aug 26, 2016

One possible improvement can be training the text embeddings along with the entire model (Instead of using the pretrained embeddings like skip-thought-vectors). It is on my to-do list, I'll try it out.

radarsat1 · on Aug 26, 2016

I think the idea is interesting but I'm not convinced it really "synthesizes ideas" so much as treats the neural network like a database of images that it mixes.

Now, I could be wrong, but because of the way the results are presented it doesn't tell me that it's any good at picking up the meaning of the phrase. The results show a single phrase and a set of images it generates. White flower with yellow center, and a bunch of images of white flowers.

But if it can synthesize the idea properly, one should be able to generate a flower of a variety of descriptions. Yellow flower with blue center. Red flower with yellow center. Blue flower with black edges and black center. etc..

From the way they describe the functionality it should be able to do these things so in a way I don't doubt it, but I want to see how it performs on phrases that induce combinations of ideas that are well outside of the training set yet refer to individual ideas within the training set.

taneq · on Aug 27, 2016

How do you "synthesize ideas" if not by combining parts of your own personal database of images/concepts? Even in your example of "X flower with Y center [and Z features]" you start with a mental picture of a flower you've seen (or a generalisation from many flowers you've seen) and then modify it with your mental picture of colours X and Y and features Z.

jcannell · on Aug 27, 2016

>How do you "synthesize ideas" if not by combining parts of your own personal database of images/concepts?

Procedural generation can be far more complex than just linear blending, which is all that a shallow net can do. For example, consider the full generative process which creates a frame from the game No Man's Sky. It is enormously more complex than a simple shallow net that can just do linear blends of previous examples - many many nonlinear processing steps to go from a small random seed to intermediate databases for terrain and objects and finally down to pixels.

If you look at the actual net design used here, it's only a few layers deep, and not very big. Much much closer to 'linear blending' than what our brains do (which is presumably vaguely closer to what no man's sky does).

failrate · on Aug 26, 2016

This is lovely. As a lazy programmer, I would appreciate this as a web service. Instead of googling for an image to steal as placeholder art, I could request a uniquely generated image.

AJ007 · on Aug 26, 2016

I think few have grasped how much of future output will be machine generated from existing work, and yet rather than violate copyright, almost be a necessity to ensure a copyright somewhere is not broken.

failrate · on Aug 26, 2016

That's an interesting angle that I hadn't considered. I was discussing with a coworker who is into machine learning about overfitting: where the output could perfectly match the input in a flawed implementation.

iaw · on Aug 27, 2016

I'm literally working on this now as a side project. If the avenue I'm exploring is successful the concept of hiring artists will be completely changed.

Y_Y · on Aug 26, 2016

I can see this being useful for police sketch artists.

paarthn · on Aug 26, 2016

There is still a long way to go, to be able to do that. The model currently generates 64 X 64 pictures and is trained on a very specific flowers image dataset. Nevertheless, it would be a great idea to experiment with such a dataset (of sketches and descriptions) if available.

taneq · on Aug 27, 2016

But the thing about machine learning is that once it works at all, "a long way to go" generally means "add more training data" rather than "we require significant conceptual breakthroughs".

jcannell · on Aug 27, 2016

That is not generally true. For example, GANs work great on MNIST, pretty well on the flower dataset, and ok on bedrooms.

But the same techniques currently fail on ImageNet - which actually is a much larger dataset. "Add more training data" is not a magic solution that overcomes limitations of your model.

In particular if you look at the generative models these GANs map to, it makes sense that they can learn 2D shapes and texture patterns, but rendering a complex 3D scene with significant depth complexity and lighting interactions is an entirely different beast. That problem has been studied deeply in 3D computer graphics and the generative programs successful there are vastly more complex than current GANs.

hughperkins · on Aug 27, 2016

seen this? https://arxiv.org/abs/1605.09304 awesome generated images, from imagenet

gwern · on Aug 26, 2016

A GAN would be helpful more for its latent space abilities than for any text-to-image capability. If you look at the GAN papers, you can see that they demonstrate how you can warp an image across multiple semantic dimensions like adding/remove eyeglasses, darkening hair, turning frowns into smiles etc. This would be great for sketches because you can have the victim walk through the latent space by recognizing which version looks more like the attacker, since people can recognize faces far better than they can verbalize detailed descriptions.

DINKDINK · on Aug 26, 2016

Make sure your face isn't included in the learning set!

adrusi · on Aug 26, 2016

Probably by the time it gets good enough to be useful to them it will be good enough to replace them.

viach · on Aug 26, 2016

It would be cool to implement text to pizza image synthesis.

taneq · on Aug 27, 2016

Hey, if you're gonna go in that direction why not just implement text-to-pizza synthesis where you say "I want a mushroom and jalapeno pizza with sundried tomatoes" and then it makes one for you.

ash9r · on Aug 26, 2016

What GPU was used to train this model?

paarthn · on Aug 26, 2016

I trained it on an AWS instance with grid k520 gpu.