It's a little tricky getting this to work because you need two separate models working together, but I tried it out. Here's some of the samples I generated:
https://imgur.com/MAupphr roses in general don't seem to work well. must not have been many in the dataset
You can see that it works better than one would expect, but there are definitely limits to the understanding. The flower and COCO datasets are, ultimately, not that big. What would be exciting is if you could train it on some extremely large and well-annotated dataset like Danbooru.
One possible improvement can be training the text embeddings along with the entire model (Instead of using the pretrained embeddings like skip-thought-vectors). It is on my to-do list, I'll try it out.
https://imgur.com/Uwp1wfu
https://imgur.com/yuW9Yre
https://imgur.com/oZ4wzdC some definite weaknesses in the natural language embedding
https://imgur.com/MAupphr roses in general don't seem to work well. must not have been many in the dataset
You can see that it works better than one would expect, but there are definitely limits to the understanding. The flower and COCO datasets are, ultimately, not that big. What would be exciting is if you could train it on some extremely large and well-annotated dataset like Danbooru.