Ask HN: People who switched from GPT to their own models. How was it?

weinzierl · 2024-02-27T06:22:44 1709014964

A couple of months ago I attended a presentation of an on-prem LLM. An audience member asked, if it was using OpenAI in any way.

The presenter, somewhat overeagerly, "Why not ask our new AI?" and went on to type: "Are you an independent model or do you use OpenAI?"

To chat bot answered in flourish language that sure it was using ChatGPT as a backend. Which it was not and which was kind of the whole point of the presentation.

mdekkers · 2024-02-27T07:33:18 1709019198

Never demo anything unrehearsed, or where you cannot predict the outcome with a very high level of confidence

WhrRTheBaboons · 2024-02-27T08:29:12 1709022552

> where you cannot predict the outcome with a very high level of confidence

so never demo LLMs. got it.

waynesonfire · 2024-02-27T09:13:48 1709025228

Correct.. like how google does it

https://www.tomshardware.com/news/google-gemini-ai-video-sta...

hnfong · 2024-02-28T07:22:24 1709104944

If you tune down the temperature you can get a high level of confidence that the output will be the same.

seeknotfind · 2024-02-28T03:12:02 1709089922

Why not? This seems like a great thing to happen. If the author's heart is really in the project, then this is a huge opportunity moment to learn something new.

kcorbitt · 2024-02-27T06:15:39 1709014539

Obviously talking my own book here, but we've helped dozens of customers make the transition from prompted GPT-4 or GPT-3.5 to their own fine-tuned models at OpenPipe.

The most common reaction I get is "wow, I didn't expect that to work so well with so little effort". For most tasks, a fine-tuned Mistral 7B will consistently outperform GPT-3.5 at a fraction of the cost, and for some use cases will even match or outperform GPT-4 (particularly for narrower tasks like classification, information extraction, summarization -- but a lot of folks have that kind of task). Some aggregate stats are in our blog: https://openpipe.ai/blog/mistral-7b-fine-tune-optimized

ashu1461 · 2024-02-27T15:33:41 1709048021

Were their any tasks like intent / entity detection etc ? - I guess classification / information extraction covers that, but still anything more specific ?

ilaksh · 2024-02-27T07:14:56 1709018096

And those dozens are all doing narrow tasks like summarization right?

jonathanleane · 2024-02-27T06:31:24 1709015484

Have you started migrating anyone over to Mistral Large yet?

kcorbitt · 2024-02-27T06:37:47 1709015867

I don't think they've released a fine-tuning API, but we'll definitely support it once they do!

geoduck14 · 2024-02-27T04:13:46 1709007226

I fine-tuned an LLM to do technical stuff. It works pretty darn good. What I actually discovered is that when evaluating LLMs, it is surprisingly difficult to evaluate them. And, also, that GPT 4 isn't that great, in general.

alwinaugustin · 2024-02-27T04:42:25 1709008945

Could you provide more details on this matter? Specifically, I'm interested in knowing which base model you've utilized and the approach you've taken to fine-tune it. Your insights would be greatly appreciated and highly beneficial.

m3kw9 · 2024-02-27T04:50:32 1709009432

For narrow stuff you can do better job than base gpt4/mistral/etc model. You fine tune it with your very custom data, stuff that got didn’t seem to be trained on, it will generalize it well.

simonw · 2024-02-27T04:53:24 1709009604

Have you done this? How did you do it?

I've been looking forward to someone providing a detailed guide on how to "fine tune it with your custom data" for ages!

anon373839 · 2024-02-27T05:48:11 1709012891

This is a very nice resource: https://github.com/mlabonne/llm-course

ipaddr · 2024-02-27T05:44:45 1709012685

https://www.datacamp.com/tutorial/fine-tuning-llama-2

arbitrandomuser · 2024-02-27T05:30:15 1709011815

this is imo the secret sauce that gives people an edge and not a lot of people will want to reveal

washadjeffmad · 2024-02-27T07:24:42 1709018682

You're not wrong. There's been a lot of drama over licensing and releasing datasets, and a lot of the LLM scene are just pitchmen and promoters with no better grasp over what they're doing than "trust me, it's better".

Like with "prompt engineering", a lot of people are just hiding how much of the heavy lifting is from base models and a fluke of the merge. The past few "secret" set leaks were low/no delta diffs to common releases.

I said it a year ago, but if we want to wowed, make this a job for MLIS holders and references librarians. Without thorough, thoughtful curation, these things are just toys in the wrong hands.

ninetyninenine · 2024-02-27T06:22:22 1709014942

Maybe the key to a good universal LLM is having multiple fine tuned models for various domains. The user thinks he's querying a single model but really there's some mechanism that selecting the best model for his query out of say like 300 different possibilities.

This also helps distributes traffic as a side effect.

I guess the problem is how the conversation would flow. If the user changes topics from say art to quantum physics then asks a question about quantum physics and art then I'm not sure what the algorithm should do.

0x008 · 2024-02-27T06:47:29 1709016449

That is actually the same idea as the (now) popular "Mixture of Experts" approach.

apetresc · 2024-02-27T14:27:48 1709044068

> This also helps distributes traffic as a side effect.

I'm not sure it's "distributing" traffic so much as amplifying it.

ninetyninenine · 2024-02-27T15:47:13 1709048833

Two users. One user is talking about physics, the other about art. Two different models are utilized.

Load is divided across 2 models. Load balancing is a feature for free and division is across subjects. Of course this is assuming each model owns it's own set of gpus.

apetresc · 2024-02-29T14:50:19 1709218219

Ah okay, I misunderstood. I thought you meant inferencing on all the different models and letting them judge to what extent they're the best fit.

What you're suggesting is just simply intent classification and using a specific model per intent. That's what everyone did _before_ LLMs.

ninetyninenine · 2024-03-02T16:39:04 1709397544

Given the existence of specialized LLMs people can still do this.

felurx · 2024-02-27T18:28:36 1709058516

You could use the same GPUs to run two instances of one model.

MountainMan1312 · 2024-02-27T06:41:29 1709016089

The user could talk to an "expert opinion aggregator" model which in turn makes a bunch of queries to specialized models.

visarga · 2024-02-27T05:49:51 1709012991

> GPT 4 isn't that great, in general

same here, it doesn't adhere to explicit instructions, maybe one or two simple instructions are ok but not more complex ones

exploringBytes · 2024-02-27T07:33:17 1709019197

More complex instructions should be split into multiple prompts, solving the challenge step by step

visarga · 2024-02-27T13:00:03 1709038803

Yes, but sometimes the two parts are also correlated so it's better if they are done at the same time.

quickgist · 2024-02-27T11:46:35 1709034395

Done this way, you also re-incur any input token costs for every additional step. If you're using a large context, this can be significant.

theshrike79 · 2024-02-27T06:30:38 1709015438

YMMV, it depends very heavily on what you're asking it to do.

MaxMatti · 2024-02-27T13:51:38 1709041898

What model are you using for more complex prompts?

maxlamb · 2024-02-27T04:14:48 1709007288

Which model did you start with?

geoduck14 · 2024-03-02T00:24:20 1709339060

Llama2 Code

avinassh · 2024-02-27T04:16:29 1709007389

Which base model did you use and what methodologies you used?

m3kw9 · 2024-02-27T04:51:25 1709009485

Likely mistral

iAkashPaul · 2024-02-27T04:16:10 1709007370

Running Mistral-Instruct-0.1 for call/email summarization, Mixtral for contract mining & OpenChat to augment agentic chatbot equipped with RAG tools(Instruct again).

Experience has been great, INT8 tradeoffs are acceptable until hardware FP8(FP4 anyone?) becomes more widely & cheaply available. On-prem costs have been absorbed already for few boxes of A100s & legacy V100s running millions of such interactions.

alecco · 2024-02-27T10:11:27 1709028687

> until hardware FP8(FP4 anyone?) becomes more widely & cheaply available

Nvidia 40xx Tensor Cores support FP8 (but not the fancy async stuff on H100s). For some reason it's not used for models (AFAIK).

Some people report throttling issues in some cases, though.

iAkashPaul · 2024-02-27T12:10:26 1709035826

I think PyTorch recently added FP8 & even Intel's Neural-Compressor has experimental support. But yeah most HF LLM models are currently just INT8 or FP4/NF4 loaded thanks to BitsandBytes by Tim Dettmers.

Klaus23 · 2024-02-27T04:55:47 1709009747

What are the trade-offs with INT8? I thought even the INT4 loss of accuracy was small and the INT8 loss almost unmeasurable.

namanyayg · 2024-02-27T04:51:25 1709009485

Hey Akash would you mind elaborating about what you mean by contract mining?

iAkashPaul · 2024-02-27T07:55:04 1709020504

Basically augmenting users parsing PDFs & looking to prefill values into Excel instead of typing it all out. Ex. Liabilities, time-period/frequencies mentioned, owners of clauses etc.

codybontecou · 2024-02-27T04:51:34 1709009494

I've been using [continue](https://continue.dev/) alongside Ollama. My go-to llm has been [deepseek-coder 7b](https://ollama.com/library/deepseek-coder). The setup feels as good as ChatGPT 4, local first, and overall, I enjoy it.

manca · 2024-02-27T05:39:33 1709012373

I've tried code-llama with Ollama, along with Continue.dev and found it to be pretty good. The only downside is that I couldn't "productively" run the 70B version, even on my MBP with M3 Max with 36GB of RAM (which interestingly should be enough to hold quantized model weights). It was simply painfully slow. 34B one works good enough for most of my use-cases, so I am happy.

0x008 · 2024-02-27T06:48:58 1709016538

I tried to use codellama 34B and I think it is pretty bad. For Example I asked it to convert a comment into a docstring and it would hallucinate a whole function around it.

gpjt · 2024-02-28T01:23:36 1709083416

What quantization were you using? I've been getting some weird results with 34b quantized to 4 bits -- glitching, dropped tokens, generating Java rather than Python as requested. But 7b, even at 4 bits, works OK. Posted about it earlier on this evening: https://www.gilesthomas.com/2024/02/llm-quantisation-weirdne...

3abiton · 2024-02-27T08:21:07 1709022067

Same, CodeLlama 70B is known to suck. Deepseek is the best for coding so far in my experience, Mixtral 8x7B is another great contender (to be frank, for most tasks). Miqu is making a buzz, but so far I haven't tested it personally yet.

jwells89 · 2024-02-27T05:46:41 1709012801

deepseek-coder 6.7b is seriously impressive for how quickly it runs on an M1 Max. There’s a few spots where it still doesn’t fare quite as well as ChatGPT but it’s a small tradeoff considering that it’s fully local and doesn’t even spin up my laptop’s fans.

ranguna · 2024-02-27T09:51:05 1709027465

Have you tried dolphin mistral 7B 2.6?

Part of its training data was code.

anticensor · 2024-02-27T06:12:03 1709014323

No markdown here sadly.

abdullin · 2024-02-27T06:20:21 1709014821

I prefer to use local models when running data extraction or processing over 10k or more records. Hosted services would be slow and brittle at this point.

Mistral 7B fine-tunes (OpenChat is my favorite) just chug through the data and get the job done.

Details: using vLLM to run the models. Using ChatGPT-4 to condense information for complex prompts (that the local models will execute).

I think, the situation will just keep on getting better with each month.

theolivenbaum · 2024-02-27T05:26:12 1709011572

We support both in our app and enterprise product. The APIs (OpenAI) vs libraries (i.e. llama.cpp for on-device) are so similar that the switch is basically transparent to the user. We're adding support for other platforms APIs soon, and everything we've looked so far is as easy to integrate as OpenAI - except Google that for some reason complicates everything on Google Cloud.

MaxLeiter · 2024-02-27T04:51:08 1709009468

My 2024 prediction is we will see far more people moving off of openai once they encounter its cost and latency compared to (less proven/scaled) competitors. It’s often a speed versus quality tradeoff, and I’ve seen multiple providers 3x faster than OpenAI with far more than 1/3 the quality

zingelshuher · 2024-02-27T09:35:42 1709026542

When using it for coding I'd rather have quality than garbage 3x faster and cheaper. So, I'm not moving. For narrow tasks it makes sense. Also interesting is the idea of having several models working together locally. With, may be, occasional ping to GPT-4.

osigurdson · 2024-02-27T05:00:24 1709010024

I greatly prefer to use ChatGPT-4 instead of 3.5 despite the slowness. Really a good feature for them to have would be to easily re-run a prompt on 4. However, the glitchiness of the service is kind of annoying.

arthur_sav · 2024-02-27T05:54:46 1709013286

OpenAI currently leads the front on (commercial) AI, so I doubt people will switch. In fact most offerings become outdated pretty fast and everyone else tries to play catch up.

Imagine using a GPT-2 type model when everyone else is using GPT-4. Until the dust settles there's no point in investing in alt models imo, unless you're leading the research.

wesleyyue · 2024-02-27T06:04:14 1709013854

I tested a bunch of models while building https://double.bot but ended up back on gpt4. Other models are fun to play with but it gets frustrating even if they miss 1/100 questions that gpt4 gets. I find that right now I get more value implementing features around the model that fixes all the GitHub copilot papercuts (autocomplete that closes brackets properly, auto import upon accepting suggestions, disable suggestions when writing comments to be less distracting, midline completions, etc etc)

Hopefully os models can catch-up to gpt4 in the next six months when we fixed all the low hanging fruit outside of the model itself

sgc · 2024-02-27T05:53:27 1709013207

To add to this question, are there LLMs that I can run on my own data, that also can provide citations similar to the way phind.com does for their results? Even better if they are multilingual.

adawg4 · 2024-02-27T18:24:46 1709058286

Mistral 7B was great for flights without wifi! Answers pretty good for information you need to find but its step by step instructions are hit or miss when it tries to do it for you.

0x008 · 2024-02-27T06:50:37 1709016637

Mixed results. I think llama2 in general is pretty bad, especially at anything else than english. I've had very good results with Mixtral for Chat.

Of course all of them feel like a Frankenstein compared to actual ChatGPT. They feel similar and work just as well until, sometimes, they put out complete and utter garbage or artifacts and you wonder if they skimped on fine-tuning.

samus · 2024-02-28T17:08:45 1709140125

In a certain sense, all open models skimp on fine-tuning. Since ChatGPT gets user feedback, OpenAI theoretically sits on a growing pile of data that can be used for continuous fine-tuning and alignment. For local models, you'd have to look for new checkpoints every few months or track good and bad responses and do your own alignment.

itake · 2024-02-27T05:27:03 1709011623

We have a first pass with our own model and then escalate to gpt if we aren't sure of our own model's results.

rspoerri · 2024-02-27T07:26:20 1709018780

Im using mixtral 8x7b (q5) for my use cases, such as scripting, searching for ideas and or definitions that i allways need to factcheck.

Currently i use lmstudio on my m2 with 96gb ram. But i‘m looking into switchin to ollama or another oss solution.

adawg4 · 2024-02-27T18:24:46 1709058286

Mistral 7B was great for flights without wifi! Answers pretty good for information you need to find but its step by step instructions are hit or miss when it tries to do it for you.

jianfgo · 2024-02-27T04:40:17 1709008817

Anyone has a tutorial how to achieve it to own a self-hosted model?

fbdab103 · 2024-02-27T05:25:37 1709011537

I think the llm utility[0] (the one from Simon, not Google) is probably the best quickstart experience you can find. Gives the option to connect to services via API or install/run local models.

As simple as

  pip install llm
  # add the local plugin
  llm install llm-gpt4all
  # Download and run a prompt against the Orca Mini 7B model
  llm -m orca-mini-3b-gguf2-q4_0 'What is the capital of France?'

Alternatively, you could use the llamafile[1] which is a tiny binary runner which gets packaged ontop of the multigigabyte models. Download the llamafile and you can launch it through your terminal or a web browser.

From the llamafile page, after you download the file, you can just launch it as

  ./mistral-7b-instruct-v0.2.Q5_K_M.llamafile -ngl 9999 --temp 0.7 -p '[INST]Write a story about llamas[/INST]'

[0] https://llm.datasette.io/en/stable/index.html

[1] https://github.com/Mozilla-Ocho/llamafile

Edit: added llm quickstart from the intro page

jassyr · 2024-02-27T04:49:16 1709009356

Reddit community r/LocalLlama has great info

gettodachoppa · 2024-02-27T14:01:01 1709042461

https://docs.sillytavern.app/usage/local-llm-guide/how-to-us...

Follow the guide all the way until you get to "Loading our model in Oobabooga". Then ignore the rest. You can do inference in Ooba under the Notebook tab.

(You can also ignore the "enabling HTTP API" parts, but it's quite handy, it's an OpenAI-compatible API which means you can use any OpenAI-compatible web UI)

Der_Einzige · 2024-02-27T06:11:59 1709014319

The other answers are recommending paths which give you #1. less control and #2. projects with smaller eco-systems.

If you want a truly general purpose front-end for LLMs, the only good solution right now is oobabooga: https://github.com/oobabooga/text-generation-webui

All other alternatives have only small fractions of the features that oobabooga supports. All other alternatives only support a fraction of the LLM backends that oobabooga supports, etc.

manca · 2024-02-27T05:46:06 1709012766

If you don't care about the details of how those model servers work, then something that abstracts out the whole process like LM Studio or Ollama is all you need.

However, if you want to get into the weeds of how this actually works, I recommend you look up model quantization and some libraries like ggml[1] that actually do that for you.

[1] https://github.com/ggerganov/ggml

tuanm · 2024-02-27T04:44:40 1709009080

You can try going get some pre-trained (sometimes, fine-tuned) models on HuggingFace, following their instructions. Good luck!

fbdab103 · 2024-02-27T05:22:44 1709011364

Bit "Draw the rest of the owl" there.

serf · 2024-02-27T05:37:33 1709012253

it's all pretty well put together nowadays honestly.

here's a dead simple way : (1) download LM Studio, install it[0] (2) download a model from within the client when prompted (3) have a ball.

the program is fairly intuitive, it takes care of finding the relevant files, and it can even accept addendum prompts and various ways to flavor or specialize answers.

Learn the basics there, take what you learn to a more 'industrial' playground later on.

[0]: https://lmstudio.ai/

cosmosgenius · 2024-02-27T04:50:56 1709009456

I use lmstudio and continue.dev. Deepseek model usually but i try out other models every now and then.

alwinaugustin · 2024-02-27T04:44:33 1709009073

you can use Ollama and download many models. Performance depends on your laptop's capacity.

rukuu001 · 2024-02-27T06:07:30 1709014050

It’s been ok. I’m running llama2 7b and it’s … fine. The results I get from gpt4 aren’t much better. This for general tasks.

Mostly I think I need to use LLMs more effectively

armini · 2024-02-27T05:34:09 1709012049

Would be great if people could share their app demo, host & model used/trained for better context.

kgeist · 2024-02-27T05:33:49 1709012029

The main hurdle is lack of multilingual support in open weights models.

ParetoOptimal · 2024-02-27T04:12:33 1709007153

For personal use I tried this.

I no longer use llms.