What would a good fine-tuning dataset for language translation look like? I want...

kordlessagain · on Sept 16, 2023

If you want to fine tune Llama 2 or similar, then embed each pair together and separately and store them. Then, use the unlabeled data (the source text without translation) to query the embeddings for similar matches. You then send in the necessary prompt text with the matches, plus the text to translate. You'll want to do this with a foundational model, like GPT-x.

As noted below, extracting words or keyterms would maybe be a good idea, as they could be included in the training set.

The training set would the be comprised of the prompt, the translation, and keyterms. As you will want to vet the generated texts anyway, you could then decide if the foundational model was working enough. You could also try to run the largest "open" model you could find on the prompts, to see if those needed training as well. There are many different Llama models trained on HuggingFace for language pairs, so see if your languages are already built and test those.

I'm building a simple, Open Source ML pipeline manager at https://ai.featurebase.com/. I'd be down to help you with this!

soultrees · on Sept 15, 2023

Language translation can be tricky because of the underlying nuances in each language so more context would probably be better, but using multiple steps to evaluate its performance on a key level would be a good way to improve the confidence.

It might be beneficial to start your dataset at the key (word) level, generate some embeddings of the key pair in the source and target and stash them, then do the same for sentence level and just for fun, paragraph level. (I believe you could get enough context from the sentence level as a paragraph is just a group of sentences but it would still be interesting to generate paragraph level key pairs I think).

From there you’d have a set of embeddings of each word src:tgt that also has context of how it fits in a sentence level and paragraph level with the respective nuances of each language.

Once you have that dataset then you can augment your data with prompts like you’re using but also including some contextual references of word pairs, and sentence pairs in your prompt which should corner the LLM into the right path.

Edit: not an expert so will heed if someone smarter comes along.

packet_nerd · on Sept 15, 2023

Oh, yes, pairs of words is a good idea. I also have a bilingual dictionary and can generate a prompt for each entry something like "here's a word in <lang_a>, write a dictionary definition for it in <lang_b>: <lang_a_word>: <lang_b_definition".

throwawwey · on Sept 18, 2023

Are there any resources available to help me get started with this process? I'm also interested in fine-tuning a model for my native language.