What does it even mean to detect hallucinations. The AI doesn't say something trivially false. While using GPT4 I have observed that it lies on simple things I didn't expect it to, while complex things it does very well on.
TLDR: It lies on fact based information which is mentioned in very very few places on the internet and not repeated too much. Short of having a human with the context, how do you even detect it.
Example: Ask it to describe a "Will and Grace" episode with some guest appearance. It will always make up everything including the episode number and the plot, and the plot seems very believable. If you have not watched and can't find a summary online, it is hard to say that it is a lie.
There are many ways to detect hallucinations. Basically, either you have the ground truth answers in external database, in that case you compare to ground truths. Or you don’t have the ground truth. In that case, you need to do metamorphic testing. See this article on it: https://www.giskard.ai/knowledge/how-to-test-ml-models-4-met...
But GPT4 doesn't hallucinate on things which are popular enough to be replicated enough times on the web as knowledge. It hallucinates on things which are very less likely to be repeated many times. That rules out an external database with true answers. Unless the external database is supposed to contain all info queryable in all ways, in which case the database is just a better version of GPT-X.
The metamorphic testing approach is interesting and might work.
I've been playing with GPT4 summarization of hard knowledge that has an external database with true answers that GPT knows about, and it's still hallucinating regularly.
Metamorphic testing seems to try to map an output of a model to a ground truth, which I guess is great if you have a database of all the known truths in the universe.
Not exactly, metamorphic testing does not need an oracle. That’s actually the reason of its popularity in ML testing. It works by perturbing the input in a way that will produce a predictable variation of the output (or possibly no variation).
Take for example a credit scoring model: you can reasonably expect that if you increase the liquidity, the credit score should not decrease. In general it is relatively easy to come up with a set of assumptions on the effect of perturbation, which allows evaluating the robustness of a model without knowing the exact ground truth.
That is beside the point. My point is that detecting hallucinations seems like a very very hard problem.
The utility of it is there and has nothing to do with making up episodes instead of quoting the current one. Like you can ask it to write new episodes with specific settings and specific constraints. Hallucination is not the value add. Nobody is excited because it hallucinates. People are excited despite it since the other value add is too much.
But deliberately requesting and receiving content generation is altogether different from requesting a factual answer and receiving plausible-seeming nonsense. Or at least, it's different to the person asking; it's the same thing as far as the model is concerned.
TLDR: It lies on fact based information which is mentioned in very very few places on the internet and not repeated too much. Short of having a human with the context, how do you even detect it.
Example: Ask it to describe a "Will and Grace" episode with some guest appearance. It will always make up everything including the episode number and the plot, and the plot seems very believable. If you have not watched and can't find a summary online, it is hard to say that it is a lie.