Does anyone else notice that the example images they provided look like they included their test data in the training set?
E.g., the picture with the couch where they cut out the dog in it. How should the network know that there was a dog on the couch? The only explanation is: It knows the reference image.
You give it a bunch of reference images, then another image with some rectangle removed, and it will fill in the rectangle with information from the reference image.