A couple of months ago I attended a presentation of an on-prem LLM. An audience member asked, if it was using OpenAI in any way.
The presenter, somewhat overeagerly, "Why not ask our new AI?" and went on to type: "Are you an independent model or do you use OpenAI?"
To chat bot answered in flourish language that sure it was using ChatGPT as a backend. Which it was not and which was kind of the whole point of the presentation.
Why not? This seems like a great thing to happen. If the author's heart is really in the project, then this is a huge opportunity moment to learn something new.
Obviously talking my own book here, but we've helped dozens of customers make the transition from prompted GPT-4 or GPT-3.5 to their own fine-tuned models at OpenPipe.
The most common reaction I get is "wow, I didn't expect that to work so well with so little effort". For most tasks, a fine-tuned Mistral 7B will consistently outperform GPT-3.5 at a fraction of the cost, and for some use cases will even match or outperform GPT-4 (particularly for narrower tasks like classification, information extraction, summarization -- but a lot of folks have that kind of task). Some aggregate stats are in our blog: https://openpipe.ai/blog/mistral-7b-fine-tune-optimized
Were their any tasks like intent / entity detection etc ? - I guess classification / information extraction covers that, but still anything more specific ?
I fine-tuned an LLM to do technical stuff. It works pretty darn good. What I actually discovered is that when evaluating LLMs, it is surprisingly difficult to evaluate them. And, also, that GPT 4 isn't that great, in general.
Could you provide more details on this matter? Specifically, I'm interested in knowing which base model you've utilized and the approach you've taken to fine-tune it. Your insights would be greatly appreciated and highly beneficial.
For narrow stuff you can do better job than base gpt4/mistral/etc model. You fine tune it with your very custom data, stuff that got didn’t seem to be trained on, it will generalize it well.
You're not wrong. There's been a lot of drama over licensing and releasing datasets, and a lot of the LLM scene are just pitchmen and promoters with no better grasp over what they're doing than "trust me, it's better".
Like with "prompt engineering", a lot of people are just hiding how much of the heavy lifting is from base models and a fluke of the merge. The past few "secret" set leaks were low/no delta diffs to common releases.
I said it a year ago, but if we want to wowed, make this a job for MLIS holders and references librarians. Without thorough, thoughtful curation, these things are just toys in the wrong hands.
Maybe the key to a good universal LLM is having multiple fine tuned models for various domains. The user thinks he's querying a single model but really there's some mechanism that selecting the best model for his query out of say like 300 different possibilities.
This also helps distributes traffic as a side effect.
I guess the problem is how the conversation would flow. If the user changes topics from say art to quantum physics then asks a question about quantum physics and art then I'm not sure what the algorithm should do.
Two users. One user is talking about physics, the other about art. Two different models are utilized.
Load is divided across 2 models. Load balancing is a feature for free and division is across subjects. Of course this is assuming each model owns it's own set of gpus.
Running Mistral-Instruct-0.1 for call/email summarization, Mixtral for contract mining & OpenChat to augment agentic chatbot equipped with RAG tools(Instruct again).
Experience has been great, INT8 tradeoffs are acceptable until hardware FP8(FP4 anyone?) becomes more widely & cheaply available. On-prem costs have been absorbed already for few boxes of A100s & legacy V100s running millions of such interactions.
I think PyTorch recently added FP8 & even Intel's Neural-Compressor has experimental support. But yeah most HF LLM models are currently just INT8 or FP4/NF4 loaded thanks to BitsandBytes by Tim Dettmers.
Basically augmenting users parsing PDFs & looking to prefill values into Excel instead of typing it all out. Ex. Liabilities, time-period/frequencies mentioned, owners of clauses etc.
I've tried code-llama with Ollama, along with Continue.dev and found it to be pretty good. The only downside is that I couldn't "productively" run the 70B version, even on my MBP with M3 Max with 36GB of RAM (which interestingly should be enough to hold quantized model weights). It was simply painfully slow. 34B one works good enough for most of my use-cases, so I am happy.
I tried to use codellama 34B and I think it is pretty bad. For Example I asked it to convert a comment into a docstring and it would hallucinate a whole function around it.
What quantization were you using? I've been getting some weird results with 34b quantized to 4 bits -- glitching, dropped tokens, generating Java rather than Python as requested. But 7b, even at 4 bits, works OK. Posted about it earlier on this evening: https://www.gilesthomas.com/2024/02/llm-quantisation-weirdne...
Same, CodeLlama 70B is known to suck. Deepseek is the best for coding so far in my experience, Mixtral 8x7B is another great contender (to be frank, for most tasks). Miqu is making a buzz, but so far I haven't tested it personally yet.
deepseek-coder 6.7b is seriously impressive for how quickly it runs on an M1 Max. There’s a few spots where it still doesn’t fare quite as well as ChatGPT but it’s a small tradeoff considering that it’s fully local and doesn’t even spin up my laptop’s fans.
I prefer to use local models when running data extraction or processing over 10k or more records. Hosted services would be slow and brittle at this point.
Mistral 7B fine-tunes (OpenChat is my favorite) just chug through the data and get the job done.
Details: using vLLM to run the models. Using ChatGPT-4 to condense information for complex prompts (that the local models will execute).
I think, the situation will just keep on getting better with each month.
We support both in our app and enterprise product. The APIs (OpenAI) vs libraries (i.e. llama.cpp for on-device) are so similar that the switch is basically transparent to the user. We're adding support for other platforms APIs soon, and everything we've looked so far is as easy to integrate as OpenAI - except Google that for some reason complicates everything on Google Cloud.
My 2024 prediction is we will see far more people moving off of openai once they encounter its cost and latency compared to (less proven/scaled) competitors. It’s often a speed versus quality tradeoff, and I’ve seen multiple providers 3x faster than OpenAI with far more than 1/3 the quality
When using it for coding I'd rather have quality than garbage 3x faster and cheaper. So, I'm not moving. For narrow tasks it makes sense. Also interesting is the idea of having several models working together locally. With, may be, occasional ping to GPT-4.
I greatly prefer to use ChatGPT-4 instead of 3.5 despite the slowness. Really a good feature for them to have would be to easily re-run a prompt on 4. However, the glitchiness of the service is kind of annoying.
OpenAI currently leads the front on (commercial) AI, so I doubt people will switch. In fact most offerings become outdated pretty fast and everyone else tries to play catch up.
Imagine using a GPT-2 type model when everyone else is using GPT-4. Until the dust settles there's no point in investing in alt models imo, unless you're leading the research.
I tested a bunch of models while building https://double.bot but ended up back on gpt4. Other models are fun to play with but it gets frustrating even if they miss 1/100 questions that gpt4 gets. I find that right now I get more value implementing features around the model that fixes all the GitHub copilot papercuts (autocomplete that closes brackets properly, auto import upon accepting suggestions, disable suggestions when writing comments to be less distracting, midline completions, etc etc)
Hopefully os models can catch-up to gpt4 in the next six months when we fixed all the low hanging fruit outside of the model itself
To add to this question, are there LLMs that I can run on my own data, that also can provide citations similar to the way phind.com does for their results? Even better if they are multilingual.
Mistral 7B was great for flights without wifi! Answers pretty good for information you need to find but its step by step instructions are hit or miss when it tries to do it for you.
Mixed results. I think llama2 in general is pretty bad, especially at anything else than english. I've had very good results with Mixtral for Chat.
Of course all of them feel like a Frankenstein compared to actual ChatGPT. They feel similar and work just as well until, sometimes, they put out complete and utter garbage or artifacts and you wonder if they skimped on fine-tuning.
In a certain sense, all open models skimp on fine-tuning. Since ChatGPT gets user feedback, OpenAI theoretically sits on a growing pile of data that can be used for continuous fine-tuning and alignment. For local models, you'd have to look for new checkpoints every few months or track good and bad responses and do your own alignment.
Mistral 7B was great for flights without wifi! Answers pretty good for information you need to find but its step by step instructions are hit or miss when it tries to do it for you.
I think the llm utility[0] (the one from Simon, not Google) is probably the best quickstart experience you can find. Gives the option to connect to services via API or install/run local models.
As simple as
pip install llm
# add the local plugin
llm install llm-gpt4all
# Download and run a prompt against the Orca Mini 7B model
llm -m orca-mini-3b-gguf2-q4_0 'What is the capital of France?'
Alternatively, you could use the llamafile[1] which is a tiny binary runner which gets packaged ontop of the multigigabyte models. Download the llamafile and you can launch it through your terminal or a web browser.
From the llamafile page, after you download the file, you can just launch it as
./mistral-7b-instruct-v0.2.Q5_K_M.llamafile -ngl 9999 --temp 0.7 -p '[INST]Write a story about llamas[/INST]'
Follow the guide all the way until you get to "Loading our model in Oobabooga". Then ignore the rest. You can do inference in Ooba under the Notebook tab.
(You can also ignore the "enabling HTTP API" parts, but it's quite handy, it's an OpenAI-compatible API which means you can use any OpenAI-compatible web UI)
All other alternatives have only small fractions of the features that oobabooga supports. All other alternatives only support a fraction of the LLM backends that oobabooga supports, etc.
If you don't care about the details of how those model servers work, then something that abstracts out the whole process like LM Studio or Ollama is all you need.
However, if you want to get into the weeds of how this actually works, I recommend you look up model quantization and some libraries like ggml[1] that actually do that for you.
it's all pretty well put together nowadays honestly.
here's a dead simple way : (1) download LM Studio, install it[0] (2) download a model from within the client when prompted (3) have a ball.
the program is fairly intuitive, it takes care of finding the relevant files, and it can even accept addendum prompts and various ways to flavor or specialize answers.
Learn the basics there, take what you learn to a more 'industrial' playground later on.
The presenter, somewhat overeagerly, "Why not ask our new AI?" and went on to type: "Are you an independent model or do you use OpenAI?"
To chat bot answered in flourish language that sure it was using ChatGPT as a backend. Which it was not and which was kind of the whole point of the presentation.