> nmfisher 1 day ago [–] > The attention mechanism is quadratic cost in the numb...

> nmfisher 1 day ago [–]

> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.

That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?

BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)