This is the wrong analogy. The transformer block *is a bunch of code* and weight... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

danielmarkbruce on Feb 4, 2024 | parent | context | favorite | on: Beyond self-attention: How a small language model ...

This is the wrong analogy. The transformer block is a bunch of code and weights. It's a set of instructions laying out which numbers to run which operations on. The optimizer changes weights to minimize a loss function during training and then the code implementing a forward pass just runs during inference. That's what it is doing. It's not doing something else.

If the argument is that a model is a function approximator, then it certainly isn't approximating some function that performs worse at the task at hand, and it certainly isn't approximating a function we can describe in a few hundred words.

FeepingCreature on Feb 5, 2024 [–]

We have no reason at all to be certain of the latter.

danielmarkbruce on Feb 5, 2024 | [–]

There is pretty good reason. If it could be described explicitly in a few hundred words, it would be extremely unlikely that we'd have seen a jump in capability with model size.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact