> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.
That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?
BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)
> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.
That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?
BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)