Hacker News new | past | comments | ask | show | jobs | submit login

> I wonder whether it would make sense to separate the concepts of "simpler" and "interpretable."

Interesting. I was thinking the same, after coming across a preprint proposing a credit-assignment mechanism that seems to make it possible to build deep models in a way that enables interpretability: https://arxiv.org/abs/2211.11754 (please note: the results look interesting/significant to me, but I'm still making my way through the preprint and its accompanying code).

Consider that our brains are incredibly complex organs, yet they are really good at answering questions in a way that other brains find interpretable. Meanwhile, large language models (LLMs) keep getting better and better at explaining their answers with natural language in a way that our brains find interpretable. If you ask ChatGPT to explain its answers, it will generate explanations that a human being can interpret -- even if the explanations are wrong!

Could it be that "model simplicity" and "model interpretability" are actually orthogonal to each other?




Humans give explanations that other humans find convincing, but they can be totally wrong and non-causal. I think human explanations are often mechanistically wrong / totally acausal.

As a famous early example, this lady provided an unprompted explanation (using only the information available to her conscious part of her brain in her good eye) for some of her preferences despite the mechanism of action being subconscious observations out of her blind eye.

https://www.nature.com/articles/336766a0


A key reason that we want models at least for some applications to be interpretable is to watch out for undesirable features. For example, suppose we want to train a model to figure out whether to grant or deny a loan, and we train it to match the decisions of human loan officers. Now, suppose it turns out that many loan officers have unconscious prejudices that cause them to deny loans more often to green people and grant loans more often to blue people (substitute whatever categories you like for blue/green). The model might wind up with an explicit weight that makes this implicit discrimination explicit. If the model is relatively small and interpretable this weight can be found and perhaps eliminated.

But if that model could chat with us it would replicate the speech of the loan officers, many of whom sincerely believe that they treat green people and blue people fairly. So interpretability can't be about somehow asking the model to justify itself. We may need the equivalent of a debugger.


I don't think anyone has come up with an unambiguous definition of "interpretable". I mean, often people assume that, for example, a statement like "it's a cat because it has fur, whiskers and pointy ears" is interpretable because it's a logical conjunction of conditions. But a logical conjunction of a thousand vague conditions could easily be completely opaque. It's a bit like the way SQL initially advanced, years ago, as "natural language interface" and simple SQL statements are a bit like natural language but large SQL statements tend to be more incomprehensible than even ordinary computer programs.

If you ask ChatGPT to explain its answers, it will generate explanations that a human being can interpret -- even if the explanations are wrong!

The funny thing is that yeah, LLMs often come up with correct method-description for wrong answers and wrong method-descriptions for right answers. Human language is quite slippery and humans do this too. Human beings tend to start loose but tighten things up over time - LLMs are kind of randomly tight and loose. Maybe this can be tuned but I think "lack of actual understanding" will make this difficult.


The fact that both humans and LMs can give interpretable justifications makes me think intelligence was actually in the language. It comes from language learning and problem solving with language, and gets saved back into language as we validate more of our ideas.


I think you’re on to something. I wonder if there’s anyone working on this idea. I’d be curious to research it more.


I don't understand the abstract. What does it do in plain language?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: