I think you underestimate the problem, which is not to get an output that says "Bird", but one that says "Specific breed of bird."
Human experts can get enough clues from the bird shape and the context to do that in the sample photos. I doubt your captioning system can.
This is a good example of a standard problem in ML - underestimating the complexity of the problem domain.
You could argue that your system only needs to do the simpler task to be useful, and that's likely true. But if the goal is to approach human expert levels of classification, it needs to improve by at least a few levels.
I suspect getting it there would run into some interesting performance constraints, and possibly some theoretical issues too.
No, ML is very, very good at doing breeds. See, for example https://arxiv.org/pdf/1603.06765.pdf which gets 88.9% accuracy on the Stanford Dogs dataset, and 84.3 on the Caltech Birds dataset.
These are way better than anything a non-expert human can do. For example, it can distinguish between the Rhinoceros Auklet and the Parakeet Auklet.
I'm not sure what expert performance is, but around 94% is where humans top out on most tasks.
A single NN can predict more than one class of object. The ImageNet competition has 20,000 classes.
There's also image segmentation as another poster has pointed to.
In the case of FB face tagging, they'd have learn an embedding space for faces, and when a new image comes in they'd place it in the embedding space along with all the person's connections and find the nearest neighbors.
The problem posed in the xkcd is "Check if the photo is of a bird", not identify the bird in question. As far as identifying the bird species, that would probably be harder, because I'm guessing there's very few human experts who could reliably do that across a wide spectrum of species, and without knowing the context of the photo.
Human experts can get enough clues from the bird shape and the context to do that in the sample photos. I doubt your captioning system can.
This is a good example of a standard problem in ML - underestimating the complexity of the problem domain.
You could argue that your system only needs to do the simpler task to be useful, and that's likely true. But if the goal is to approach human expert levels of classification, it needs to improve by at least a few levels.
I suspect getting it there would run into some interesting performance constraints, and possibly some theoretical issues too.