I tried a softball multiple-choice question, and the results were not very impressive:
> Question: Which is the longest unit of distance? (A) fathom (B) kilometer (C) mile (D) parsec
> Aristo's Answer: (B) kilometer
> Confidence: 81.04%
I think it's potentially noteworthy that of the "reasoners" listed below the answer, none of them make any mention of relative magnitude, except for the "Justification Sentence" listed under "Information Retrieval" (with the tooltip "lucene"). I suspect that the system is correctly identifying all four options as units of distance, and then breaking the resulting tie by pulling a tf-idf score from some large corpus of documents, which of course gives essentially arbitrary results.
> Question: Which is the longest unit of distance? (A) fathom (B) kilometer (C) mile (D) parsec
> Aristo's Answer: (B) kilometer
> Confidence: 81.04%
I think it's potentially noteworthy that of the "reasoners" listed below the answer, none of them make any mention of relative magnitude, except for the "Justification Sentence" listed under "Information Retrieval" (with the tooltip "lucene"). I suspect that the system is correctly identifying all four options as units of distance, and then breaking the resulting tie by pulling a tf-idf score from some large corpus of documents, which of course gives essentially arbitrary results.