Hacker News new | past | comments | ask | show | jobs | submit login

> "if it classifies successfully, it must be conditioned on latents about truth"

Yes, this is a truism. Successful classification does not depend on latents being about truth.

However, successfully classifying between text intended to be read as either:

- deceptive or honest

- farcical or tautological

- sycophantic or sincere

- controversial or anodyne

does depend on latent representations being about truth (assuming no memorisation, data leakage, or spurious features)

If your position is that this is necessary but not sufficient to demonstrate such a dependence, or that reverse engineering the learned features is necessary for certainty, then I agree.

But I also think this is primarily a semantic disagreement. A representation can be "about something" without representing it in full generality.

So to be more concrete: "The representations produced by LLMs can be used to linearly classify implicit details about a text, and the LLM's representation of those implicit details condition the sampling of text from the LLM".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: