> How can we even begin to go about writing an algorithm that can reason about the scene like I did?
You are doing AI wrong. AI should learn all of that context by itself, from a large amount of stimulus. If it was a good one, it might be able to learn enough in less than N years, where N is the age of a human who would laugh at the photo.
If we could make AI study the relationship between objects in images and videos, it could learn a lot of raw common sense. Also, it could be useful to add a different domain to the mix by cross referencing that with information extracted from text.
It strikes me that a forum such as this one might make a great training tool for AI. So many good cues for following branched reasoning, dealing with ambiguity etc. I usually"read" the articles through the comments.
You are doing AI wrong. AI should learn all of that context by itself, from a large amount of stimulus. If it was a good one, it might be able to learn enough in less than N years, where N is the age of a human who would laugh at the photo.