> Hair, delicate clothes, tree branches and other fine objects will never be segmented perfectly, even because the ground truth segmentation does not contain these subtleties. The task of separating such delicate segmentation is called matting, and defines a different challenge.
I did an internship where one of my tasks was to automatically remove the background in X-ray images, I spent basically half a year studying image processing, segmentation, reading papers, skimming image processing books etc... and never came across the word "matting", which was exactly what I was doing.
Somebody should've told me that earlier, would've probably saved me a month.
Perhaps I'm saying the obvious here, but why not train with a bunch of images which have been recorded in front of an actual green screen? That way, you can insert any random background and generate as many new training images as you like.
The network is being trained for photographs, not CGI. I suspect the different cues will end up producing wildly different trained networks. But the green screen idea is still an interesting and worthy proposal.
1. I don't think that they have enough images taken in front of a green screen. Just changing the background has diminishing returns because the network may start to memorize the foreground images.
2. The network may rely on differences in lighting, etc, and fail to generalize.
> Hair, delicate clothes, tree branches and other fine objects will never be segmented perfectly, even because the ground truth segmentation does not contain these subtleties. The task of separating such delicate segmentation is called matting, and defines a different challenge.
The state of the art in natural image matting already is confronting fine details as well as image segmentation and clustering. Copying papers from 10-12 years ago would give much better results than he shows here.
To get more training data, you could have filmed using a green screen.
Insert whatever background you like (as long as the result looks natural). That way you can automatically get a good pixel map of the subject. And using a video camera at 30fps, you could get thousands of training images in just a few minutes.
Of course, you might run into an issue of overfitting (it might learn what you as an individual looks like and not generalize to other people). However as long as you green-screen a somewhat large number of people this shouldn’t be an issue.
Edit: darn. Looks like I wasn’t the only person to think of this idea!
I think this is some excellent work, and anything that can help extract items of interest from the background is going to be useful for a lot of different scenarios. That said, I can't help by imagine the mayhem of this combined with the deepfake stuff where one might put a target person's face on a porn actor, and then that scene is transported to a place where that person might actually be found by replacing the background. Scary indeed.
It says later in the article that solving for those sorts of details is out of scope for the project - and is a different problem altogether (matting).
We implemented background removal in an iOS app recently. We went down a similar route, but ended up choosing a user directed grabcut (heavily modified).
It would be interesting to take the output of this and use the alpha mask as the starting point for the grabcut mask.
I did an internship where one of my tasks was to automatically remove the background in X-ray images, I spent basically half a year studying image processing, segmentation, reading papers, skimming image processing books etc... and never came across the word "matting", which was exactly what I was doing.
Somebody should've told me that earlier, would've probably saved me a month.