Perhaps I'm saying the obvious here, but why not train with a bunch of images which have been recorded in front of an actual green screen? That way, you can insert any random background and generate as many new training images as you like.
The network is being trained for photographs, not CGI. I suspect the different cues will end up producing wildly different trained networks. But the green screen idea is still an interesting and worthy proposal.
1. I don't think that they have enough images taken in front of a green screen. Just changing the background has diminishing returns because the network may start to memorize the foreground images.
2. The network may rely on differences in lighting, etc, and fail to generalize.