Amazing. I only had a chance to read the README.md but my question is this. What happens if you ask it questions that it could not possibly answer, as in if it were given a picture of the man playing tennis and you asked it what the score was? Is it capable of discerning between questions that cannot be answered (given a particular input) and those that can?
Priors from the language play a much bigger role in the answers that are predicted than the image itself. So for example, if you ask 'What color is ...?', irrespective of the image, it is more likely to spit out colors as the answer. The answers are usually well-aligned with the question that is being asked. 'Yes/no' for binary questions, 'red/blue/etc' for 'What color...', 'tennis/baseball/etc' for 'What sport...' and so on.
Although there is no catch, it's far from perfect and hardly magical. Its accuracy goes up to ~55% on the VQA (http://visualqa.org/) dataset (which is short of state-of-the-art by ~7%).
Did you look at the examples? Many of them are wrong, and others are guessable just from the question (e.g. "What shape is the plate?" I would say "round" without a picture)
I have seen sites using captchas which ask such visual questions thinking that only a human can answer them. This project really makes me doubt the effectiveness of such techniques.
Accuracy is measured as min((number of humans that provided that answer)/3, 1) i.e. 100% accurate if at least 3 humans provided that exact answer, as outlined here: http://visualqa.org/evaluation.html.