Hacker News new | past | comments | ask | show | jobs | submit login

My intuition is that a significant challenge for LLMs' ability to do arithmetics has to do with tokenization. For instance, `1654+73225` as per the OpenAI tokenizer tool breaks down into `165•4•+•732•25`, meaning the LLM is incapable of considering digits individually; that is, "165" is a single "word" and its relationship to "4" and in fact each other token representing a numerical value has to be learned. It can't do simple carry operations (or other arithmetic abstractions humans have access to) in the vast majority of cases because its internal representation of text is not designed for this. Arithmetic is easy to do in base 10 or 2 or 16, but it's a whole lot harder in base ~100k where 99% of the "digits" are words like "cat" or "///////////".

Compare that to understanding arbitrary base64-encoded strings; that's much harder for humans to do without tools. Tokenization still isn't _the_ greatest fit for it, but it's a lot more tractable, and LLMs can do it no problem. Even understanding ASCII art is impressive, given they have no innate idea of what any letter looks like, and they "see" fragments of each letter on each line.

So I'm not sure if I agree or disagree with you here. I'd say LLMs in fact have very impressive capabilities to learn logical structures. Whether grammar is the problem isn't clear to me, but their internal representation format obviously and enormously influences how much harder seemingly trivial tasks become. Perhaps some efforts in hand-tuning vocabularies could improve performance in some tasks, perhaps something different altogether is necessary, but I don't think it's an impossible hurdle to overcome.




I don't think that's really how it works - sure this is true at the first level in a neural network, but in deep neural networks after the first few layers the LLM shouldn't be 'thinking' in tokens anymore.

The tokens are just the input - the internal representation can be totally different (and that format isn't tokens).


Please don't act like you "know how it works" when you obviously don't.

The issue is not the fact that the model "thinks or doesn't think in tokens". The model is forced at the final sampling/decoding step to convert it's latent back into tokens, one token at a time.

The models are fully capable of understanding the premise that they should "output a 5-7-5 syllable Haiku", but from the perspective of a model trying to count its own syllables, this is not possible, as its own vocabulary is tokenized in such a way that not only does the model not have direct phonetic information within the dataset, but it literally has no analogue for how humans count syllables (measuring mouth drops). Models can't reason about the number of characters or even tokens used in a reply too for the same exact reason too.

The person you're replying to broadly is right, and you are broadly wrong. The internal format does not matter when the final decoding step forces a return of tokenization. Please actually use these systems rather than pontificating about them online.


Thank god we aren’t talking about a model counting syllables then.


That requires converting from a weird unhelpful form into a more helpful form first, so yes but the tokenisation makes things harder as it adds an extra step - they need to learn how these things relate while having significant amounts of the structure hidden from them.


This conversion is inherent in the problem of language and maths though - Two, too (misspelt), 2, duo, dos, $0.02, and one apple next to another apple, 0b10 and 二 can all represent the (fairly abstract) concept of two.

The conversion to a helpful form is required anyway (also lets remember that computers don't work in base 10, and there isn't really a reason to believe that base 10 is inherently great for LLM's either)


It is, but there's a reason I teach my son addition like this:

    hundreds | tens | ones

        1        2      3
    +   2        1      5
    -----------------------
        3        3      8
Rather than

unoDOOOOS(third) {}{}{} [512354]_ = three"ate

* replace {}{}{} with addition, {}{} is subtraction unless followed by three spaces in which case it's also addition * translate and correct any misspellings * [512354] look up in your tables * _ is 15 * dotted lines indicate repeated numbers

Technically they're doing the same thing. One we would assume is harder to learn the fundamental concepts from.


Right, which is why testing arithmetics is a good way to test how well LLMs generalize their capabilities to non text tasks. LLMs can in theory be excellent at it, but they aren't due to how they are trained.


The tokens are the structure over which the attention mechanism is permutation equivariant. This structure permeates the forward pass, its important at every layer and will be until we find something better than attention.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: