I wonder if this kind of tech will ruin the stock photo industry - all it takes is a recreation of a model like this with a completely open license and someone willing to host it.
"Type what you want to see and get a drawing of it" is very powerful, even if its scope is still limited.
People have been trying this for a really long time and there are just a ton of business considerations that make it impractical.
1. People buy photography, even stock photography for ads or business uses, with some intention to affiliate with the photographer / artist or some notion of an artsy style or aesthetic. Autogenerated art starts right off the bat at a disadvantage for being “commodity” in nature, even compared to repetitive inventory of sites like Shutterstock. Maybe you can get past this for certain niche areas where the real photos are already exceedingly commodity, like backgrounds, office photos, landscapes. But even then, the status of an artist counts for a lot.
2. It’s not actually that cheap to operate generative image at scale. You have to ensure that pre-generated content is of sufficient quality, and covers sufficient subject matter, compositional and aesthetic variety. If content is generated on the fly, you’ll be dealing with pretty high throughput on a very resource intensive model.
3. Competition can replicate your image model pretty easily, so your differentiator comes back to branding and a sense of “not commodity” quality, as well as all accompanying services and support, which is where all your operating costs come from anyway.
I am sure generative inventory will become a bigger trend in stock photography, but I doubt it will be much of a differentiator. If you run a stock photography business with more than a few million images already, you would be better served building ML solutions for search, discovery, keyword or caption annotation, abusive content detection, automated aesthetic enhancements or assisted editing tools & style transfer. There won’t be a “holy grail” of generated inventory. Most customers just won’t care.
It’s a good example of how impressive exotic ML solutions might seem like they surely have to have consumer applications, but where it interacts with business concerns it just doesn’t matter. Monetizing ML solutions is really hard - much harder than creating the ML solutions in the first place.
Maybe generated images don't add much on top of a large collection of stock photos but CV neural nets are very important in the post editing, tweaking and mixing of photos. A bad picture might become a good picture if you have the right tool.
What I'd like to see in generative models is the generation of diagrams, geometry figures, mind-maps, data and algorithm drawings based off text inputs. I want a super boosted imagination power in a box, a tool to model problems.
Sort of related, but I'm looking for a library or tool of some sort that I can use to generate landscape photos. I don't really have a lot of time to put into a project right now, so I was fantasizing about stumbling across some python library that's a wrapper for one of these models that's basically as simple as
import all_photos from "./photos";
model = train(all_photos)
a = model.generate();
StyleGAN was trained on something like 3 million images, and it's very difficult to fine tune GANs on a small subset of images past that, unless perhaps your images are very similar to the ones StyleGAN used. It's likely you'd need at least thousands or hundreds of thousands of images to get the results you probably want. I've been successful with thousands, but not less than that.
I found stylegan to be pretty plug and play for training on 10000 panels from a webcomic, here are some sample results. This was a while ago, I’m sure stylegan2 would perform even better http://hgreer.com/homestuck-gan/fakes007470.png
The article misses to highlight some of the important elements of the current state of the image synthesis.
It's quite important to emphasise the dichotomy between the 2 current approaches in the image synthesis today. 1. Implicit distribution learning - VAE or autoregressive based techniques 2. Explicit learning of the distribution - GAN based models. The way these two model the distribution is fundamentally different.
Fundamentally, GAN presents huge drawbacks when it comes to the actual inference during the synthesis. There have been dozens of models with workarounds but most of these present new challenges on their own especially instability and mode collapse being one of the primary.
VQVAE2 as the most advanced VAE based technique has eliminated major drawbacks of VAE and GAN and has produced phenomenal quality [1]
However the main challenge in the area is not synthesising just any kind of image. VQVAE2 is doing that already very well. Where none of the current techniques win today, is the multi-object image synthesis. That requires a new paradigm in the architecture and distribution learning.
You mixed up implicit and explicit models. For anyone interested in the difference - implicit models such as GANs don't allow you to evaluate the probability density over datapoints - you can only sample from some surrogate model of the distribution learned by minimizing some 'distance' between the surrogate and the true empirical distribution.
'Explicit' models (I think this term is nonstandard) parameterize the density directly and modify the parameters via maximum likelihood. This allows one in theory to both directly evaluate the density and sample from the learned distribution. VAEs (only give a lower bound on the density), autoregressive models, and normalizing flows all fall under this category.
Note that while it is theoretically possible for 'explicit' models to go in both directions (sample and evaluate), one direction may be much more efficient than the other for certain models. e.g. for autoregressive models you can read the first two pages of [1] for a good explanation of why.
As shown in a figure in the section of VQGAN, VQGAN offers superior quality over VQVAE2 for a given amount of compute budget, and given that the generator of VQGAN is based on an architecture similar to VQVAE-1/2 (like DALL-E), it does not suffer from mode collapse or instability you mentioned.
~I think you have these two backwards: "1. Implicit distribution learning - VAE or autoregressive based techniques 2. Explicit learning of the distribution - GAN based models. The way these two model the distribution is fundamentally different."~
edit, someone else pointed this out hours ago and provided a much more detailed answer.
Thanks for your feedback! Actually when I wrote the article, I feared this may confuse the reader.
The reason why VQGAN (taming transformers) has good quality (possibly the best as you said) is precisely because it uses an idea from GAN (not because of quantized VAE). So, this model is not really VAE, but it's a combination of VAE and GAN just like DC-VAE model that was also featured in the post.
If you take a look at VQGAN's section, you can see a comparison between DALL-E and VQGAN, the former of which uses substantially more resources and no GAN technique. THe latter shows much better quality, which shows that GAN really offers much better quality.
"Type what you want to see and get a drawing of it" is very powerful, even if its scope is still limited.