Sorry but I hate stuff like this We spend $30k+ per month with OpenAI and Anthro...

thomashop · 2024-05-23T14:08:16 1716473296

Hate seems like quite a strong reaction.

My hunch is that quite soon, LLMs will be totally interchangeable. Due to the intense competition and the fact that people are basically training on the same base data distribution.

In the tasks I'm using LLMs switching one for another makes less difference than I had predicted even.

However, I'm not spending $30k+ per month, so I guess my opinion may be less informed.

What is your use case? Could these micro-optimizations you need to do now be the result of the technology still being quite immature?

I'm working with digital twins of musicians/celebrities. But have also done more analytical stuff with LLMs.

My current side project involves working with the production company of a well-known German soap opera to help them write further episodes. The first thing we did was write a small evaluation system. Could be interesting to test with Unify.

weird-eye-issue · 2024-05-23T14:26:26 1716474386

We do content generation and need a high level of consistency and quality

Our prompts get very complex and we have around 3,000 lines of code that does nothing but build prompts based on user's options (using dozens of helpers in other files)

They aren't going to get more interchangeable because of the subjective nature of them

Give five humans the same task and even if that task is just a little bit complex you'll get wildly different results. And the more complex it gets the more different the results will become

It's the same with LLMs. Most of our prompt changes are more significant but in one recent case it was a simple as changing the word "should" to "must" to get similar behavior between two different models.

One of them basically ignored things we said it "should" do and never performed the thing we wanted it to whereas the other did it often, despite these being minor version differences of the same model

dilek · 2024-05-23T14:37:35 1716475055

thank you!

prompt engineering is a thing, and it's not a thing that you get on social media posts with emojis or multiple images.

is it a public-facing product?

danlenton · 2024-05-23T16:50:04 1716483004

If you do test it out, feel free to ping me with any questions!

danlenton · 2024-05-23T15:00:48 1716476448

Thanks for weighing in. I'm sure for your setup right now, our router in it's current form would not be useful for you. This is the very first version, and the scope is therefore relatively limited.

On our roadmap, we plan to support:

- an API which returns the neural scores directly, enabling model selection and model-specific prompts to all be handled on the client side

- automatic learning of intermediate prompts for agentic multi-step systems, taking a similar view as DSPy, where all intermediate LLM calls and prompts are treated as latent variables in an optimizable end-to-end agentic system.

With these additions, the subtleties of the model + prompt relationship would be better respected.

I also believe that LLMs will become more robust to prompt subtleties over time. Also, some tasks are less sensitive to these minor subtleties you refer to.

For example, if you have a sales call agent, you might want to optimize UX for easy dialgoue prompts (so the person on the other end isn't left waiting), and take longer thinking about harder prompts requiring the full context of the call.

This is just an example, but my point is that not all LLM applications are the same. Some might be super sensitive to prompt subtleties, others might not be.

Thoughts?

weird-eye-issue · 2024-05-23T16:02:11 1716480131

I don't want/need any of that

It's already hard enough to get consistent behavior with a fixed model

If we need to save money we will switch to a cheaper model and adapt our prompts for that

If we are going more for quality we'll use and more expensive model and adapt our prompts for that

I fail to see any use case where I would want a third party choosing which model we are using at run time...

We are adding a new model this week and I've spent dozens of hours personally evaluating output and making tweaks to make it feasible

Making it sound like models are interchangeable is harmful

danlenton · 2024-05-23T16:48:38 1716482918

Makes sense, however I would clarify that we don't need to make the final decision. If you're using the neural scoring function as an API, then you can just get predictions about how each model will likely perform on your prompt, and then use these predictions however you want (if at all). Likewise, the benchmarking platform [https://youtu.be/PO4r6ek8U6M] can be used to just assess different models on your prompts without needing to do any routing. Nonetheless, this perspective is very helpful.

qeternity · 2024-05-23T13:45:58 1716471958

Came here to ask a similar question: who is this targeted to? We see very different end to end behavior with even different quantization levels of the same model. The idea that we would on the fly route across providers is mind boggling.

danlenton · 2024-05-23T16:59:11 1716483551

One use case is optimizing agentic systems, where a custom router [https://youtu.be/9JYqNbIEac0] is trained end-to-end on the final task (rather than GPT4-as-a-judge). Both the intermediate prompts and the models used can then be learned from data (similar to DSPy), whilst ensuring the final task performance remains high. This is not supported with v0, but it's on the roadmap. Thoughts?

qeternity · 2024-05-24T19:19:29 1716578369

We do agentic systems. We already optimize for these things. We route between different models based on various heuristics. I absolutely would not want that to be black box. And doing any sort of vector similarity to determine task complexity is not going to work well.

I would also not try to emulate DSPy, which is a massively overrated bit of kit and of little use in a production pipeline.

tarasglek · 2024-05-28T12:40:28 1716900028

Curious, can you explain why you think DSPy overrated?

weird-eye-issue · 2024-05-23T14:31:34 1716474694

Yeah even for a pure chatbot it doesn't make sense because in our experience users want to choose the exact model they are using...

danlenton · 2024-05-23T17:09:50 1716484190

Interesting, do you have any hunch as to why this is? We've seen in more verticalized apps where the underlying model is hidden from the user (sales call agent, autopilot tool, support agent etc.) that trying to reach high quality on hard prompts and high speed on the remaining prompts makes routing an appealing option.

weird-eye-issue · 2024-05-24T08:25:09 1716539109

We charge users different amounts of credits based on the model used. They also just generally have a personal preference for each model. Some people love Claude, some hate it, etc

For something like a support agent why couldn't the company just choose a model like GPT-4o and stick with one? Would they really trust some responses going to 3.5 (or similar)?

danlenton · 2024-05-24T10:32:28 1716546748

Currently the motivation is mainly speed. For the really easy ones like "hey, how's it going?" or "sorry I didn't hear you, can you repeat?" you can easily send to Llama3 etc. Ofc you could do some clever caching or something, but training a custom router directly on the task to optimize the resultant performance metric doesn't require any manual engineering.

Still, I agree that routing in isolation is not thaaat useful in many LLM domains. I think the usefulness will increase when applying to multi-step agentic systems, and when combining with other optimizations such as end-to-end learning of the intermediate prompts (DSPy etc.)

Thanks again for diving deep, super helpful!