Visual question answering using CNN+RNN

n0us · on Nov 30, 2015

Amazing. I only had a chance to read the README.md but my question is this. What happens if you ask it questions that it could not possibly answer, as in if it were given a picture of the man playing tennis and you asked it what the score was? Is it capable of discerning between questions that cannot be answered (given a particular input) and those that can?

abhshkdz · on Nov 30, 2015

Priors from the language play a much bigger role in the answers that are predicted than the image itself. So for example, if you ask 'What color is ...?', irrespective of the image, it is more likely to spit out colors as the answer. The answers are usually well-aligned with the question that is being asked. 'Yes/no' for binary questions, 'red/blue/etc' for 'What color...', 'tennis/baseball/etc' for 'What sport...' and so on.

KnightHawk3 · on Nov 30, 2015

Is there a catch to the effectiveness of this?

I haven't seen it before and it seems pretty magical.

abhshkdz · on Nov 30, 2015

Although there is no catch, it's far from perfect and hardly magical. Its accuracy goes up to ~55% on the VQA (http://visualqa.org/) dataset (which is short of state-of-the-art by ~7%).

aidenn0 · on Nov 30, 2015

Did you look at the examples? Many of them are wrong, and others are guessable just from the question (e.g. "What shape is the plate?" I would say "round" without a picture)

arocks · on Nov 30, 2015

I have seen sites using captchas which ask such visual questions thinking that only a human can answer them. This project really makes me doubt the effectiveness of such techniques.

abhshkdz · on Nov 30, 2015

As it stands currently, it's quite far off from cracking captchas. :-)

mrdrozdov · on Nov 30, 2015

How do you measure accuracy? Is this a new baseline?

abhshkdz · on Nov 30, 2015

Accuracy is measured as min((number of humans that provided that answer)/3, 1) i.e. 100% accurate if at least 3 humans provided that exact answer, as outlined here: http://visualqa.org/evaluation.html.

No, this model is from the NIPS15 paper by Ren et al (http://arxiv.org/abs/1505.02074).