This is certainly interesting research, and as an artist I think I'd be hugely frustrated by the amount of non-local change which seems to happen in the video. A fair number of small pen strokes seem to affect a large part of the generated face.
For example, take the difference between 2:20 and 2:27 in the video. The upper half of the drawing hasn't changed, but the generated image has a lot more hair and different ears. While the technology looks impressive as it is, it seems to me that it would be better to leave areas the artist has barely defined as blurred rather than flickering between various high resolution features that are all roughly equally matching the sketch.
The whole thing works on statistical priors: if I have feature a at location x, there's a 90% I should have feature b at location y. So if the majority of pictures of beards in my dataset were also, say, wearing sunglasses, then naturally if I freehand draw a beard the net will probably output sunglasses even if I don't change the eyes!
The solution is to ensure that you sample the full data space that you wish to reproduce (not trivial). Neural nets do seem to interpolate but this is super high dimensional space so it's not always intuitive...there are many orders of magnitude more directions in which to move to get from point A to point B.
For example, take the difference between 2:20 and 2:27 in the video. The upper half of the drawing hasn't changed, but the generated image has a lot more hair and different ears. While the technology looks impressive as it is, it seems to me that it would be better to leave areas the artist has barely defined as blurred rather than flickering between various high resolution features that are all roughly equally matching the sketch.