The paper doesn’t seem to show perplexity for completions of a large dataset, i.e. the standard benchmark of language models. It shows benefits for specialized tasks, but as I said, specialized task training isn’t the goal. It’s always possible to outperform a general model by choosing a sufficiently specialized task.
I don’t know why people feel so strongly that the tokenization is a weakness, but ultimately there’s not much choice but to agree to disagree.
This is obviously the case if you apply much narrower criteria, most benchmarks of existing large datasets aren’t for character level tasks. That said, the synthetic noise section should be extremely interesting if not fully representative of your criteria.
Agree that tokenization isn’t a weakness for most general applications, disagree that it isn’t a weakness for the specific string manipulation task that the blog post is referencing
I don’t know why people feel so strongly that the tokenization is a weakness, but ultimately there’s not much choice but to agree to disagree.