I feel like there should be a basic 'curriculum' that gets passed to all foundational LLMs that teaches them the basics of language. Maybe 100 million files where the first 10 million are all first grade reading level, the second 10 million are all second grade reading level, etc.
Ideally this includes a bunch of text books. That should give the LLM time to grok language before it starts training on more difficult texts.
Ideally this includes a bunch of text books. That should give the LLM time to grok language before it starts training on more difficult texts.