Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing LLMs from a Dataset Perspective (sebastianraschka.com)
138 points by alexmolas on Sept 15, 2023 | hide | past | favorite | 24 comments



I have wondered if the very big models trained on a Big Pile of Everything can be used to curate smaller, higher quality data sets that lead to high performing models with smaller parameter counts. Not only are smaller models easier to distribute and faster at inference time, but it offers a licensing escape hatch if future copyright law changes or court rulings make it hard to publicly offer models trained on non-permissively licensed material.

1) Train an initial big model on everything you can get, yielding a capable but tainted-in-some-jurisdictions model. Keep that model private.

2) Use the big tainted model to narrow or distill the source data. One way is by identifying the document subset that can be used freely (old public domain works, user generated content uploaded to your own service that users already assented to your own company's ToS on, government documents, things with unrestricted Creative Commons licensing...) The other way is by using it to build "just the facts" distillations from restrictively licensed material.

3) Train an untainted model using just the factual distillations and/or the permissively licensed material.


Sounds vaguely like the paper “textbooks are all you need”? Though they are not explicitly trying to remove the copyright taint.

https://arxiv.org/abs/2306.11644


Not sure on the licensing but yes you can do that technically.

Phi-1 and therefore phi-1.5 are partially trained on gpt3.5 generated synthetic textbooks.


The premise here is specifically not to train it on generated output of the bigger model but to merely use the bigger model to better curate non-generated (and thereby untainted) inputs for the training set of the smaller model.


That's what I proposed in my article in Alternative Models. Except, I wanted to use public-domain works (eg Gutenberg) for the base model so it's legally clear. Then, for one with proprietary content, K-12-college textbooks, encyclopedias, and specialist works licensed for that purpose. Train the base like we train kids. Then, use it to generate or evaluate the rest.

https://heswithjesus.com/tech/exploringai/index.html


I have also wondered if OpenAI are going to train a private model with all the ChatGPT history, and then use that to train a public model.


Doesn't that lead to model collapse?


The trick is training a little, then augmenting with documents using RAG. The idea that a model alone can handle complex use cases is common, but usually wrong.


RAG?



What would a good fine-tuning dataset for language translation look like?

I want to try fine-tuning to machine translate to and from a fairly niche language (https://en.wikipedia.org/wiki/S'gaw_Karen_language). How much text would I need, and what format would be ideal?

I have a number of book length texts, most only in the target language, and a few bilingual or multilingual. For the bilingual and multilingual texts, I can script out probably several thousand pairs of "translate the following text from <source_lang> to <target_lang>: <source_lang_text> <target_lang_text>". Do I need to vary the prompt and format, or can I expect the LLM to generalize to different translation requests? Is there value in repeating the material in different lengths? One set of sentence lengths, another paragraph, and another page or chapter length? Also what should be done with the monolingual texts, just ignore them?


If you want to fine tune Llama 2 or similar, then embed each pair together and separately and store them. Then, use the unlabeled data (the source text without translation) to query the embeddings for similar matches. You then send in the necessary prompt text with the matches, plus the text to translate. You'll want to do this with a foundational model, like GPT-x.

As noted below, extracting words or keyterms would maybe be a good idea, as they could be included in the training set.

The training set would the be comprised of the prompt, the translation, and keyterms. As you will want to vet the generated texts anyway, you could then decide if the foundational model was working enough. You could also try to run the largest "open" model you could find on the prompts, to see if those needed training as well. There are many different Llama models trained on HuggingFace for language pairs, so see if your languages are already built and test those.

I'm building a simple, Open Source ML pipeline manager at https://ai.featurebase.com/. I'd be down to help you with this!


Language translation can be tricky because of the underlying nuances in each language so more context would probably be better, but using multiple steps to evaluate its performance on a key level would be a good way to improve the confidence.

It might be beneficial to start your dataset at the key (word) level, generate some embeddings of the key pair in the source and target and stash them, then do the same for sentence level and just for fun, paragraph level. (I believe you could get enough context from the sentence level as a paragraph is just a group of sentences but it would still be interesting to generate paragraph level key pairs I think).

From there you’d have a set of embeddings of each word src:tgt that also has context of how it fits in a sentence level and paragraph level with the respective nuances of each language.

Once you have that dataset then you can augment your data with prompts like you’re using but also including some contextual references of word pairs, and sentence pairs in your prompt which should corner the LLM into the right path.

Edit: not an expert so will heed if someone smarter comes along.


Oh, yes, pairs of words is a good idea. I also have a bilingual dictionary and can generate a prompt for each entry something like "here's a word in <lang_a>, write a dictionary definition for it in <lang_b>: <lang_a_word>: <lang_b_definition".


Are there any resources available to help me get started with this process? I'm also interested in fine-tuning a model for my native language.


I was hoping that this would go more into the details of dataset selection and what makes for high-quality data, but it seems to be more a prelude to a Lit-GPT advertisement :/


I speculate high quality data can be compared to be similar via embeddings comparisons. By organizing the dataset using features, grouping the vectors by them, and ensuring the dataset relates to itself in a given domain by quality tags, we can speculate the data becomes better as we are more specific with our queries to it.

I need to train more models to see if this is an accurate claim, but I've been finishing up the storage layer and haven't gotten to that yet.


What are other than fine-tuning methods to make LLM smarter? Im familair with RAG - Retrival Augumented Generation.


Ensembles, for one. We might ask the same question of keyword extraction, for example, and then aggregate the results. Here's a horrible example, but it works and runs on a GPU box: https://github.com/FeatureBaseDB/Laminoid/blob/main/sloth/sl...

Keyterms can be used in the prompt to drive the LLM to better grounded responses as well as helping locate relevant embeddings for RAG, when vector search isn't enough.

Another consideration is writing code for processing things that look similar. For example, one might have the LLM write regex code which is then tested to work and put into production in a pipeline to parse log files, or write SQL off conversational queries, which are then run against a database.


Perhaps not smarter but filtering seemingly computed or non-derived answers in demonstration data. Also include demonstrations that show steps leading up to a computation in a format for a runtime to parse and execute the computation. Similar to the Code Interpreter or Advanced Data Analysis in ChatGPT.

Taking it a step further, I would include in the demonstration a test harness set up with a test suite to prove the proposed implementation.

I would go through each demonstration which a fixed set of criteria measuring only passing tests but ones that also show a level of complexity and usefulness.

Why? I was looking through CodeLlamas demonstration data for fine tuning and saw answers that were not even checked for correctness or usefulness.


Other than fine-tuning and RAG, Guidance allows you to constrain the output of an LLM within a grammar, for example to guarantee JSON output 100% of the time.

Here's one library to do this https://github.com/guidance-ai/guidance


RLHF is a popular candidate, but the focus is more on "helpfulness" and "safety" -- I don't think it necessarily improves LLMs on reasoning benchmarks


if anything, RLHF makes the model dumber, not smarter.


I think it could potentially make the model smarter, but it's up to how you collect the data to train the reward models. Currently, companies & papers that use RLHF focus on "safety" rankings, for example. But you could potentially collect labels "smartness" or "correctness" instead and train the the reward model one these. (And then use that reward model to finetune the LLM you want to improve.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: