"You do this by training an existing model on example input/output pairs that de...

kcorbitt · on Sept 12, 2023

There's no rule that your fine-tuning dataset needs to be split into input/output pairs -- you can of course fine-tune a model to just continue a sequence.

As a practical matter though, most of the fine-tuning frameworks, including Axolotl (which this guide uses) and HuggingFace's SFTTrainer (the actual fine-tuning trainer most frameworks use under the hood) assume your data comes in input/output pairs, and automatically inserts a separator token to let the model know that the input has finished and it should start generating the output. In general most tasks can be formulated this way, including autocomplete tasks, so I'd probably recommend going that way unless you have a very strong reason not to.

omneity · on Sept 14, 2023

Axolotl takes a lot of formats, not all of them are in the form of input/output.

"Completion" format only takes a single text value per dataset record. Some other formats are in the form of multiple choice answers, etc.

Take a look below (there are more formats in "see other formats") https://github.com/OpenAccess-AI-Collective/axolotl#dataset

rrherr · on Sept 12, 2023

“most tasks can be formulated this way, including autocomplete tasks”

For autocomplete tasks, with a corpus of unlabeled documents, would you insert a separator token at an arbitrary space in each document, in order to form input/output pairs?

omneity · on Sept 14, 2023

What you described is basically an input/output pair. The input is the sentence so far, and the output is the next token. You build your dataset by splitting the raw text corpus into sentences, paragraphs or documents, and for each of these chunks generate input/target pairs by taking the sentence up to the Nth token as input and that token as output. You do this for each token in your corpus' chunks.

For further reference you can lookup "next-token prediction objective".