*> I wonder whether it would make sense to separate the concepts of "simpler" an...

kylevedder · on Dec 22, 2022

Humans give explanations that other humans find convincing, but they can be totally wrong and non-causal. I think human explanations are often mechanistically wrong / totally acausal.

As a famous early example, this lady provided an unprompted explanation (using only the information available to her conscious part of her brain in her good eye) for some of her preferences despite the mechanism of action being subconscious observations out of her blind eye.

https://www.nature.com/articles/336766a0

not2b · on Dec 22, 2022

A key reason that we want models at least for some applications to be interpretable is to watch out for undesirable features. For example, suppose we want to train a model to figure out whether to grant or deny a loan, and we train it to match the decisions of human loan officers. Now, suppose it turns out that many loan officers have unconscious prejudices that cause them to deny loans more often to green people and grant loans more often to blue people (substitute whatever categories you like for blue/green). The model might wind up with an explicit weight that makes this implicit discrimination explicit. If the model is relatively small and interpretable this weight can be found and perhaps eliminated.

But if that model could chat with us it would replicate the speech of the loan officers, many of whom sincerely believe that they treat green people and blue people fairly. So interpretability can't be about somehow asking the model to justify itself. We may need the equivalent of a debugger.

joe_the_user · on Dec 22, 2022

I don't think anyone has come up with an unambiguous definition of "interpretable". I mean, often people assume that, for example, a statement like "it's a cat because it has fur, whiskers and pointy ears" is interpretable because it's a logical conjunction of conditions. But a logical conjunction of a thousand vague conditions could easily be completely opaque. It's a bit like the way SQL initially advanced, years ago, as "natural language interface" and simple SQL statements are a bit like natural language but large SQL statements tend to be more incomprehensible than even ordinary computer programs.

If you ask ChatGPT to explain its answers, it will generate explanations that a human being can interpret -- even if the explanations are wrong!

The funny thing is that yeah, LLMs often come up with correct method-description for wrong answers and wrong method-descriptions for right answers. Human language is quite slippery and humans do this too. Human beings tend to start loose but tighten things up over time - LLMs are kind of randomly tight and loose. Maybe this can be tuned but I think "lack of actual understanding" will make this difficult.

visarga · on Dec 22, 2022

The fact that both humans and LMs can give interpretable justifications makes me think intelligence was actually in the language. It comes from language learning and problem solving with language, and gets saved back into language as we validate more of our ideas.

bilsbie · on Dec 22, 2022

I think you’re on to something. I wonder if there’s anyone working on this idea. I’d be curious to research it more.

Llamamoe · on Dec 23, 2022

I don't understand the abstract. What does it do in plain language?