Hacker News new | past | comments | ask | show | jobs | submit login
Text to Image Synthesis Using Thought Vectors (github.com/paarthneekhara)
153 points by piyush8311 on Aug 26, 2016 | hide | past | favorite | 21 comments



It's a little tricky getting this to work because you need two separate models working together, but I tried it out. Here's some of the samples I generated:

https://imgur.com/Uwp1wfu

https://imgur.com/yuW9Yre

https://imgur.com/oZ4wzdC some definite weaknesses in the natural language embedding

https://imgur.com/MAupphr roses in general don't seem to work well. must not have been many in the dataset

You can see that it works better than one would expect, but there are definitely limits to the understanding. The flower and COCO datasets are, ultimately, not that big. What would be exciting is if you could train it on some extremely large and well-annotated dataset like Danbooru.


One possible improvement can be training the text embeddings along with the entire model (Instead of using the pretrained embeddings like skip-thought-vectors). It is on my to-do list, I'll try it out.


I think the idea is interesting but I'm not convinced it really "synthesizes ideas" so much as treats the neural network like a database of images that it mixes.

Now, I could be wrong, but because of the way the results are presented it doesn't tell me that it's any good at picking up the meaning of the phrase. The results show a single phrase and a set of images it generates. White flower with yellow center, and a bunch of images of white flowers.

But if it can synthesize the idea properly, one should be able to generate a flower of a variety of descriptions. Yellow flower with blue center. Red flower with yellow center. Blue flower with black edges and black center. etc..

From the way they describe the functionality it should be able to do these things so in a way I don't doubt it, but I want to see how it performs on phrases that induce combinations of ideas that are well outside of the training set yet refer to individual ideas within the training set.


How do you "synthesize ideas" if not by combining parts of your own personal database of images/concepts? Even in your example of "X flower with Y center [and Z features]" you start with a mental picture of a flower you've seen (or a generalisation from many flowers you've seen) and then modify it with your mental picture of colours X and Y and features Z.


>How do you "synthesize ideas" if not by combining parts of your own personal database of images/concepts?

Procedural generation can be far more complex than just linear blending, which is all that a shallow net can do. For example, consider the full generative process which creates a frame from the game No Man's Sky. It is enormously more complex than a simple shallow net that can just do linear blends of previous examples - many many nonlinear processing steps to go from a small random seed to intermediate databases for terrain and objects and finally down to pixels.

If you look at the actual net design used here, it's only a few layers deep, and not very big. Much much closer to 'linear blending' than what our brains do (which is presumably vaguely closer to what no man's sky does).


This is lovely. As a lazy programmer, I would appreciate this as a web service. Instead of googling for an image to steal as placeholder art, I could request a uniquely generated image.


I think few have grasped how much of future output will be machine generated from existing work, and yet rather than violate copyright, almost be a necessity to ensure a copyright somewhere is not broken.


That's an interesting angle that I hadn't considered. I was discussing with a coworker who is into machine learning about overfitting: where the output could perfectly match the input in a flawed implementation.


I'm literally working on this now as a side project. If the avenue I'm exploring is successful the concept of hiring artists will be completely changed.


I can see this being useful for police sketch artists.


There is still a long way to go, to be able to do that. The model currently generates 64 X 64 pictures and is trained on a very specific flowers image dataset. Nevertheless, it would be a great idea to experiment with such a dataset (of sketches and descriptions) if available.


But the thing about machine learning is that once it works at all, "a long way to go" generally means "add more training data" rather than "we require significant conceptual breakthroughs".


That is not generally true. For example, GANs work great on MNIST, pretty well on the flower dataset, and ok on bedrooms.

But the same techniques currently fail on ImageNet - which actually is a much larger dataset. "Add more training data" is not a magic solution that overcomes limitations of your model.

In particular if you look at the generative models these GANs map to, it makes sense that they can learn 2D shapes and texture patterns, but rendering a complex 3D scene with significant depth complexity and lighting interactions is an entirely different beast. That problem has been studied deeply in 3D computer graphics and the generative programs successful there are vastly more complex than current GANs.


seen this? https://arxiv.org/abs/1605.09304 awesome generated images, from imagenet


A GAN would be helpful more for its latent space abilities than for any text-to-image capability. If you look at the GAN papers, you can see that they demonstrate how you can warp an image across multiple semantic dimensions like adding/remove eyeglasses, darkening hair, turning frowns into smiles etc. This would be great for sketches because you can have the victim walk through the latent space by recognizing which version looks more like the attacker, since people can recognize faces far better than they can verbalize detailed descriptions.


Make sure your face isn't included in the learning set!


Probably by the time it gets good enough to be useful to them it will be good enough to replace them.


It would be cool to implement text to pizza image synthesis.


Hey, if you're gonna go in that direction why not just implement text-to-pizza synthesis where you say "I want a mushroom and jalapeno pizza with sundried tomatoes" and then it makes one for you.


What GPU was used to train this model?


I trained it on an AWS instance with grid k520 gpu.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: