Towards Monosemanticity: Decomposing Language Models with Dictionary Learning

wodow · on Oct 6, 2023

Anthropic at https://x.com/anthropicai/status/1709986949711200722

> The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

reliablereason · on Oct 6, 2023

Unless you enforce your network to have simple understandable mechanics during training you won’t be able to properly decompose the network to a mechanistic understandable explanation. Due to the fact that it won’t work like that under the hood. Parts of the network will be simple enough to be understandable but allot of it won’t.

The human mind can not deal with and understand things that have to much dimensionality.

My 5 cents.

DanielBMarkham · on Oct 6, 2023

I agree, but keep in mind that the goal of AI (whatever that is) all along is not explanatory power, it's verisimilitude. This is due to necessity, since we don't understand it already, the only way to judge it is to see if it kinda sorta makes sense.

Looked at this way, this project is a thing of beauty. Just like the human mind, it's not going to explain how the LLM came up with an answer, it's going to create plausible arguments why it did so. How would we know? Do you actually think it'll make a difference whether the AI is correct about any random oddball question, especially when it can make a pretty coherent case about why the answer was correct all along?

It will not.

We are performing a very strange experiment with our fellow humans. Should be fascinating to watch it play out.

mxkopy · on Oct 6, 2023

What I'd like to see is some algorithm that converts DNNs -> steps undertaken by Turing machines, and then maybe Transformers -> Turing machines. There's definitely a way to do so, at the very least by graph isomorphism.

knexer · on Oct 6, 2023

This is a fascinating article, but I'm also very impressed by how well-communicated and transparent the research is. The writing is impressively clear, simple, and precise. The interactive feature explorer (https://transformer-circuits.pub/2023/monosemantic-features/...) is awesome, and lets me explore the claims of the article myself. The background information filled in a lot of gaps for me where I haven't been following the research closely.

I would love to see if this analysis can scale to a multi-layer transformer network, there are so many interesting questions to answer!

- For single-layer transformers, the inputs are text and the outputs distributions over text, so we can directly inspect and interpret the strings that the features are sensitive to, and the predictions the features make. With two layers, that isn't the case any more for the intermediate bit between the two layers. I guess the question here is... can we transfer this approach to a multi-layer context at all, or does it fundamentally not make sense?

- Are second-layer features fundamentally different in character from first-level ones? How often do we see the same feature in multiple layers, iteratively refined, vs categorically different features?

- How does the existence of the second layer change the kinds of features learned by the first layer?

- How do features evolve throughout the training process? I often imagine features getting 'compressed downwards' as training progresses, requiring N levels to fully resolve at epoch t, and perhaps n - 1 levels at epoch t + k. Can this approach give us a comprehensive picture of how and when that happens? Could we actually track a single feature as it shifts from the second layer to the first? The bottom-up nature of this approach - vs top-down looking for some specific expected feature - is really compelling here, as it could let us get a sense for how features behave in this regard on a population level.

- Could we identify causal relationships between first-layer features and second-layer features with some extension of this approach?

- Combining the above two points, I could imagine a 'meeting in the middle' of low-level features and high-level features: if both take some span of layers to be computed, and they're iteratively refined across that layer span, and at some point the spans become disjoint; this effect could explain some of the 'phase transition' like behavior in the training process. It would be amazing to be able to directly observe this occurring!

Intervening in the direction of a feature seems very powerful for safety, as mentioned. However, because of the interference effects from the overcomplete basis, one might expect other unrelated features to be affected/degraded in subtle ways if we naively intervened on a feature. I wonder if this feature-identification process could be incorporated into training the model somehow - periodically identifying safety-relevant or otherwise important features, and encouraging just those particular features to be disentangled/orthogonal/sparse/aligned with the neuron basis? I'm not sure if this even makes mathematical sense, my linear algebra is quite iffy, but it seems potentially quite interesting to learn a representation space where the basis is somehow a mix of complete (for the 'privileged' features) and overcomplete (for everything else).

Sorry for the wall of text, this is just very exciting stuff and I had to get my thoughts out through my fingers somehow!