Show HN: Route your prompts to the best LLM

jefftk · 2024-05-23T02:13:21 1716430401

> Pricing-wise, we charge the same rates as the backend providers we route to, without taking any margins. We also give $50 in free credits to all new signups.

What's your plan for making money? Are you planning to eventually take a margin? Negotiate discounts with your backend providers? Mine the data flowing through your system?

cosmojg · 2024-05-23T04:51:45 1716439905

The data flowing through LLM routers is a hot commodity right now. OpenRouter, for example, even provides a flat-rate 1% discount across the board if you agree to let them use your API calls for model training, and rumor has it that they're already profitable. To be fair, they do seem to be collaborating with model providers on some level, so they are likely getting discounted access on top of selling data.

ec109685 · 2024-05-23T07:05:50 1716447950

It’s surprising how these app developers are okay with this much data being shown: https://openrouter.ai/models/mistralai/mixtral-8x7b-instruct...

kromem · 2024-05-23T03:07:06 1716433626

Man, this space would get so much more interesting so quickly if base model providers had a revenue share system in place for routed requests...

runako · 2024-05-23T03:24:37 1716434677

This would quickly erode confidence in the routers themselves...

loceng · 2024-05-23T03:46:40 1716436000

Or create a competitive environment between routers?

danlenton · 2024-05-23T07:29:44 1716449384

Another point here is that some users prefer to use their own API keys for the backend providers (a feature we're releasing soon). Any "discounts" would then be harder to implement. I do generally think it's much cleaner if we route based on the public facing price + performance, so our users don't need to lock into our own SSO if they'd prefer not to.

loceng · 2024-05-24T16:21:54 1716567714

I think the biggest risk with advanced AI is that it's captured [likely by bad actors under the guise of cover stories] and that it strays away from being as free market as is possible.

E.g. I don't think there should be any patents regarding what AI creates and how it can create it - let's not give people monopolies anymore for which all possibilities will come into existence due to passionate people, not due to the possibility of being able to patent something. E.g. telling the system to turn a 2D photo into a 3D rendering and then extrapolating/reverse engineering that to tie into materials and known building code requirements is plainly obvious, as one easy example; a "gold rush" for patents on AI etc only aims to benefit relatively rent-seekers and those in the VC industrial complex, etc.

danlenton · 2024-05-23T07:26:19 1716449179

that's a good point, impartiality would then be questioned

namdnay · 2024-05-23T10:40:56 1716460856

So they end up in the same situation as hotels or airlines, beholden to the middlemen? They’ll never allow that :)

danlenton · 2024-05-23T07:25:37 1716449137

I certainly wouldn't complain about this lol

danlenton · 2024-05-23T07:24:19 1716449059

The idea is that at some point in future, we release new and improved router configurations which do take small margins, but from the user perspective they're still paying less than using a single endpoint. We don't intend to inflate the price when users only use the single-sign-on benefits. Negotiating discounts with backend providers is another possibility, but right now we're just focused on providing value.

Aeolun · 2024-05-23T11:29:22 1716463762

Honestly, I’d feel a lot more secure about building on this if you did take (for example) a small fixed fee every month. Or a 10% commission on any requests (volume discounts on that commission for high-volume users?).

If I start using you now you’ll either disappear in the future or you’ll suddenly start charging more, neither of which I like.

I’m already paying for inference, a little amount on top of that for the convenience of a single API is pretty useful.

bitbuilder · 2024-05-23T16:24:58 1716481498

Agree heavily with this sentiment. It sounds like this could be a useful tool for a personal project of mine, but I wasn't nearly as interested after reading they're not attempting to make money yet. I'm a bit burnt out on that business model. Predictability is just as important as price when I'm deciding how to invest a large portion of my free time. I happily gave OpenRouter $20 for their service, and I've barely dented the credits with thousands of test runs over two months.

On that note, I think I'd be even more likely to pay for Unify.ai if I could opt to bypass the auto-routing and use it the same way I use OpenRouter - a single endpoint to route to any model I want. Sometimes I've already determined the best model for a task, and other times I want redundant models for the same task. It's possible Unify has this option, though I didn't see it while skimming the docs.

But really, all in all, this is a super cool project and I'm happy it was shared.

danlenton · 2024-05-23T12:20:07 1716466807

Makes sense, thanks a lot for the feedback. We're pretty confident that future versions of our router will provide sufficient value where we can take margins here, we therefore don't expect the need to start charging for Single-sign-on (SSO) alone. The SSO benefits are only the beginning in my mind, our main value will come from custom benchmarks across all models + providers and optimizing LLM applications, including agentic workfows. I do very much see your point though. Thankfully, we're very fortunate to have several years of runway, so we don't plan on disappearing anytime too soon!

roynasser · 2024-05-23T12:42:32 1716468152

A common model in some cost cutting software is to charge x% of the total savings... Win/win...just a suggestion... use picks "main LLM" and you calculate the "non optimized cost" based on that. Whatever savings you drive you take a share of the savings.

janekm · 2024-05-23T14:37:13 1716475033

It's tough in this case, because if you incentivise just to save cost, it could always route you to the cheapest LLM but the quality would suffer...

danlenton · 2024-05-23T14:43:31 1716475411

however, as janekm says, we can't charge just based on cost savings. We would need the router points to be sufficiently compelling wrt quality, speed and cost (including our own margins) that users still sometimes opt for these router points. Suffice it to say, if any router configs do start to take margins, then this will be clearly reflected in the overall router cost plotted on the scatter graph. UX will not be affected.

danlenton · 2024-05-23T14:41:01 1716475261

Yeah that's a great point, something we'll keep in mind as we work out the final business model. Thanks!

disagree123 · 2024-05-23T10:06:48 1716458808

[flagged]

danlenton · 2024-05-23T10:24:43 1716459883

lol well sure, that too

disagree123 · 2024-05-23T11:57:20 1716465440

but only as a secondary goal right? ;)

It's all about value for the world!

ianbicking · 2024-05-22T22:46:51 1716418011

I've bumped into a few of these. I use https://openrouter.ai as a model abstraction, but not as a router. https://withmartian.com does the same thing but with a more enterprise feel. Also https://www.braintrustdata.com/ though it's less clear how committed they are to that feature.

That said, while I've really enjoyed the LLM abstraction (making it easy for me to test different models without changing my code), I haven't felt any desire for a router. I _do_ have some prompts that I send to gpt-3.5-turbo, and could potentially use other models, but it's kind of niche.

In part this is because I try to do as much in a single prompt as I can, meaning I want to use a model that's able to handle the hardest parts of the prompt and then the easy parts come along with. As a result there's not many "easy" prompts. The easy prompts are usually text fixup and routing.

My "routing" prompts are at a different level of abstraction, usually routing some input or activity to one of several prompts (each of which has its own context, and the sum of all contexts across those prompts is too large, hence the routing). I don't know if there's some meaningful crossover between these two routing concepts.

Another issue I have with LLM portability is the use of tools/functions/structured output. Opus and Gemini Pro 1.5 have kind of implemented this OK, but until recently GPT was the only halfway decent implementation of this. This seems to be an "advanced" feature, yet it's also a feature I use even more with smaller prompts, as those small prompts are often inside some larger algorithm and I don't want the fuss of text parsing and exceptions from ad hoc output.

But in the end I'm not price sensitive in my work, so I always come back to the newest GPT model. If I make a switch to Opus it definitely won't be to save money! And I'm probably not going to want to fiddle, but instead make a thoughtful choice and switch the default model in my code.

ankrgyl · 2024-05-22T23:15:18 1716419718

Thanks for mentioning Braintrust!

We are very committed to the proxy :)

Although, to your point, we have seen less market pull for routing, and more for (a) supporting the latest LLMs, (b) basic translation (e.g. tool call API b/w Anthropic & OpenAI), and (c) solid infra features like caching/load balancing api keys/secret management. So that's our focus.

danlenton · 2024-05-23T00:06:54 1716422814

Super helpful feedback, thanks for going so deep! I agree that for the really heavy agentic stuff, the router in it's current form might not be the most important innovation.

However, for several use cases speed is really paramount, and can directly hinder the UX. Examples include sales call agents, copilots, auto-complete engines etc. These are some of the areas where we've seen the router really shine, diverting to slow models when absolutely necesary on complex prompts, but using fast models as often as possible to minimize disruption to the UX.

Having said that, another major benefit of the platform is the ability to quickly run objective benchmarks for quality, cost and speed across all models and providers, on your own prompts [https://youtu.be/PO4r6ek8U6M]. We have some users who run benchmarks regularly for different checkpoints of their fine-tuned model, comparing against all other custom fine-tuned models, as well as the various foundation models.

As for the overlap in routing concepts you mentioned, I've thought a lot about this actually. It's our intention to broaden the kinds of routing we're able to handle, where we assume all control flow decision (routing) and intermediate prompts are latent variables (DSPy perspective). In the immediate future there is not crossover though.

I agree cost is often an afterthought. Generally our users either care about improving speed, or they want to know which model or combination of models would be best for their task in terms of output quality (GPT-4, Opus, Gemini? etc.). This is not trivial to guage without performing benchmarks.

As for usually wanting to make a full LLM switch as opposed to routing, what's the primary motivation? Avoiding extra complexity + dependencies in the stack? Perhaps worrying about model-specific prompts no longer working well with a new model? The general loss of control?

lovesdogsnsnow · 2024-05-22T15:51:36 1716393096

This is interesting! Sort of a super mixture of experts model. What's the latency penalty paid with your router in the middle?

The pattern I often see is companies prototyping on the most expensive models, then testing smaller/faster/cheaper models to determine what is actually required for production. For which contexts and products do you foresee your approach being superior?

Given you're just passing along inference costs from backend providers and aren't taking margin, what's your long-term plan for profitability?

danlenton · 2024-05-22T21:18:21 1716412701

Great question! Generally the neural network used for the router takes maybe ~20ms during inference. When deployed on prem, in your own cloud environment, then this is the only latecy. When using the public endpoints with our own intermediate server, it might add ~150ms to the time-to-first-token, but inter-token-latency is not affected.

We generally see the router being useful when the LLM application is being scaled, and cost and speed start to matter a lot. However, in some cases the output quality actually improved, as we're able to squeeze the best of GPT4 and Claude etc.

Long-term plan for profitability would come from some future version of the router, where we save the user time and money, and then charge some overhead for the router, but with the user still paying less than they would be with a single endpoint. Hopefully that makes sense?

Happy to answer any other questions!

jonahx · 2024-05-23T05:52:51 1716443571

Do you save the user data, ie, the searches themselves? What do your TOS guarantee about the use of that data?

danlenton · 2024-05-23T08:43:21 1716453801

We use this data to improve the base router by default. It's fully anonymized, and you can opt out.

ColinHayhurst · 2024-05-23T15:18:11 1716477491

Without opt out it would be a no go, so that's great to hear. What's the downside of opting out?

danlenton · 2024-05-23T17:02:24 1716483744

no down side

nl · 2024-05-23T06:01:07 1716444067

If I was doing this I'd negotiate a volume discount, charge the clients the base rate and pocket the difference.

danlenton · 2024-05-23T07:42:28 1716450148

definitely on the cards, we're keeping our options open here. Right now just focused on creating value though.

swiftlyTyped · 2024-05-22T22:18:39 1716416319

Really great stuff.

People use the same model / server for all queries not because it's sensible, but because it's simple. This brings the same simplicity to the far more optimal solution.

And great startup play too, by definition no incumbent can fill this role.

danlenton · 2024-05-22T22:26:28 1716416788

Thanks - glad to hear the idea resonates!

saberience · 2024-05-23T10:26:39 1716459999

How can you know which LLM is better for every specific question?

That is, how do you quantify "better"?

Imagine I asked: Write me a sad poem about "Yorick" in Hamlet?

You have to predict which LLM is going to be produce the "best" poem, but by what criterion do you evaluate that?

danlenton · 2024-05-23T10:34:20 1716460460

Currently, we simply use GPT4-as-a-judge, with a general system prompt we've written which is task agnostic. This is then used to train the neural scoring function, which predicts quality ahead-of-time. However, it's on our roadmap to add make the judging more flexible, potentially task-specific judge prompts and in-context examples, also perhaps using a jury [https://arxiv.org/pdf/2404.18796].

ceroxylon · 2024-05-23T03:03:48 1716433428

For those who like testing smaller local models, there is a tool in LangChain called Routing and in LlamaIndex called a "router query engine" which achieve a similar thing:

https://python.langchain.com/v0.1/docs/use_cases/query_analy...

https://docs.llamaindex.ai/en/stable/examples/query_engine/R...

danlenton · 2024-05-23T07:37:18 1716449838

Thanks for sharing! These are useful toos, but they are a bit different, more based on similarity search in prompt space (a bit like semantic router: https://github.com/aurelio-labs/semantic-router). Our router uses a neural network for the routing decisions, and it can be trained on your own prompts [https://youtu.be/9JYqNbIEac0]. We're also adding support for on-prem deployment soon!

BrutalCoding · 2024-05-23T12:52:59 1716468779

Heya I found a minor typo here: https://unify.ai/docs/demos/demos/LangChain/RAG_playground/R... under "Usage" step 2.

> "Input your Unify APhttps://github.com/Anteemony/RAG"

Your product looks good in my view, although I have only spend about 10min thus far. The docs look pretty easy to follow along.

I'll probably give this a try soon!

danlenton · 2024-05-23T16:54:31 1716483271

Thanks for sharing, will get this fixed now!

Oras · 2024-05-23T10:32:11 1716460331

I've been using a prompt management service (cloud) for 2 months and am pretty happy that I can define prompts and check quality outside the code, and it helps me to do the routing manually.

Having benchmarks (I assume this is ongoing?) would make it even more interesting, as I wouldn't have to manually manage the routing each time I add a new model.

You mentioned you don't have a margin right now, but how about keeping logs, the dashboard for cost, and benchmarking?

danlenton · 2024-05-23T10:40:30 1716460830

Yes the benchmarks are ongoing, we continually plot the speed and cost across time in our runtime benchmarks [https://unify.ai/benchmarks], and we use this live data when plotted the quality scatter graphs [https://console.unify.ai/dashboard]. The router configurations are "self-improving" in the sense that any given router config will quickly wrap the latest models and providers under the hood. Using a router config is a way of riding the wave of models and providers, whilst just specifying your priorities for quality, speed and cost. We will have some case studies which better explain this soon!

mzl · 2024-05-23T07:59:24 1716451164

Sounds similar to the Composition of Experts from SambaNova (although that is for increased accuracy on single systems mostly, not for decreased cost from various suppliers): https://sambanova.ai/blog/samba-coe-the-power-of-routing-ml-...

danlenton · 2024-05-23T08:15:45 1716452145

Yep it's the same idea. We also see improvements in quality on some benchmarks beyond any specific model. This is common especially common when training a custom router on your own prompt dataset: https://youtu.be/9JYqNbIEac0

Animats · 2024-05-23T10:14:04 1716459244

That makes a lot of sense for anything that's public-facing and free.

If the user is having a dialog with the system, they start out with a trivial question, and are routed to a simple LLM, are they stuck there? Or can the conversation be transitioned to something more powerful?

Something like this will be needed to direct traffic as systems specialize more.

danlenton · 2024-05-23T10:21:51 1716459711

Great question! Nope they're not stuck with the simple LLM after the first question. Chatbots simply aggregate all prior conversation history in the input prompt, and make a new LLM call. The router treats each LLM call seperately, and so a single conversation can occur across many models and providers.

joaquincabezas · 2024-05-22T21:59:51 1716415191

I’ve already heard a few times that the selection of models is seen as a problem (need to benchmark continually, justify changes…), this is an elegant solution.

I don’t know if choosing different models for the same consumer can be problematic (seen as not consistent), but maybe using this approach will force the post-processing code not to be “coupled” with one particular model.

danlenton · 2024-05-22T23:35:54 1716420954

This is a great point. With models becoming more intelligent, they're seeming to become less brittle to the subtleties in the prompts, which might mean decoupling will occur naturally anyway. With regards to customers wanting to stick with "one model", we more often find that they want to stick with "one cloud environment". This is not a problem, the router can be deployed inside custom cloud environments easily. Interesting to see how much auto-prompting such as DSPy generalizes across models though, when you really start to see intermediate prompts as latent variables. Not sure how much these learned prompts might behave differently with different LLMs. I guess you might need to jointly learn the prompts and the routing decisions together for it to work really well. One step at a time though!

aleksiy123 · 2024-05-23T02:59:58 1716433198

I wonder, have you also done any work on actually improving performance by enhancing the prompt or something similar?

I always thought a product like this that could empirically decrease costs for the same performance or increase performance for a small increase in cost would have a fairly simple road of justifying its existence.

danlenton · 2024-05-23T07:33:28 1716449608

As outlined, DSPy is a create tool for this. Currently their focus is on optimizing in-context examples, but their broader vision includes optimising system prompts for intermediate LLM nodes in an agentic system as well. We uploaded an explainder video about DSPy last week! Feel free to check it out: https://youtu.be/kFB8kFchCH4?si=0YKAVcpKJjogP1sX

ramoz · 2024-05-23T03:12:16 1716433936

I think dspy aims to do this: https://github.com/stanfordnlp/dspy

aleksiy123 · 2024-05-23T03:14:14 1716434054

Cool thanks for the link very interesting.

treprinum · 2024-05-23T09:42:37 1716457357

Does the router approach make sense with multi-tasking LLMs? One could execute multiple different tasks with a single prompt (chat response, NER, translation etc.) and with the latest models even do images or video alongside text. Doesn't a router get in the way, unnecessarily increasing latency?

danlenton · 2024-05-23T10:27:06 1716460026

So the neural scoring introduces ~20ms latency, but this only impacts time-to-first-token (not inter-token-latency). When using our public endpoints there is an additional ~150ms latency, but you can deploy the router on-prem in your own cloud, so then it would only be the inference latency. Generally the improvements in ITL outweigh the small addition to TTFT.

aravindputrevu · 2024-05-24T12:29:00 1716553740

Great to know this!

I have come across Portkey's Open-source AI Gateway which kind of does the same.

https://github.com/portkey-ai/gateway

It looks like with more LLM adoption, resiliency and cost related aspects take off sooner than expected unlike other technological trends in the past.

I'm also thinking that there is a chance if something like this could help build a better RAG pipeline or evals for the GenAI App. Because end of the day you want to reduce hallucinations but still get good generative responses.

iknownthing · 2024-05-22T22:51:16 1716418276

> It is trained in a supervised manner on several open LLM datasets, using GPT4 as a judge.

Does this mean GPT4 predictions are used as labels? Is that allowed?

danlenton · 2024-05-22T22:58:01 1716418681

They're not used as labels directly (we're not trainig an LLM which outputs text), they are used as an intermediate step, which is then used to compute a simple score which the neural score function is then trained on. The neural score function takes prompts and latent model representations in, and produces a score from 0-1

iknownthing · 2024-05-22T23:56:44 1716422204

TOS says you cannot "Use Output Data to develop models that compete with OpenAI." You're probably fine but it's close.

danlenton · 2024-05-23T00:09:58 1716422998

To be honest we're thinking of moving away from this soon anyway. Open source models will soon make for perfectly good judges (or juries), with Llama3 etc.: https://arxiv.org/pdf/2404.18796

andrewstetsenko · 2024-05-22T22:04:51 1716415491

How easy is it to integrate our own data for training the router, and what kind of improvements can we expect from this customization?

danlenton · 2024-05-22T23:02:16 1716418936

it depends on the task, but this video gives an idea :) https://www.youtube.com/watch?v=9JYqNbIEac0

billylo · 2024-05-23T12:16:55 1716466615

Very well executed. Congrats. Signed up.

The chatbot providers will enjoy seeing how they stack up against each other. Your data/analytics/alerts will be valuable to them. Revenue opportunity there.

I think about comparison data companies like https://www.ipsos.com when I see your product.

danlenton · 2024-05-23T12:23:58 1716467038

Thanks! Ipsos is also a great analogous example, I hadn't thought of that.

billylo · 2024-05-23T13:00:15 1716469215

Any time... In the banking world, the analogous is JD Power. Useful and valuable insights from them.

https://canada.jdpower.com

danlenton · 2024-05-23T14:46:42 1716475602

duly noted!

siquick · 2024-05-22T22:24:17 1716416657

Very cool, going to try on some of our workflows this week.

Would love to see web access and RAG (LlamaIndex) integration. Are they on the roadmap?

danlenton · 2024-05-22T22:27:49 1716416869

merged yesterday! https://github.com/run-llama/llama_index/pull/12921

thomasfromcdnjs · 2024-05-22T22:49:09 1716418149

Since you take no extra charges from the end user, the presumption is you would charge LLM's for giving them traffic?

danlenton · 2024-05-22T23:01:08 1716418868

we intend to eventually have routers which improve the speed, cost (and maybe quality) to such an extent that we can then take some margins from these best performing routers, with users still saving costs compared to individual endpoints. For now we don't take any margins though, we want to ensure we're providing sufficient value before rushing to capture any of it.

t-writescode · 2024-05-23T06:18:29 1716445109

Are you ready to eat the $50 per new user (and all the fake accounts that could get created this way to eat more and more $50) all the way until you create this next tier, "for-profit" step?

danlenton · 2024-05-23T07:51:31 1716450691

yeah we might need to implemenet some kind of "I am not a robot" checks soon, as well as 2FA.

memothon · 2024-05-23T00:20:31 1716423631

How does it behave if the user input includes something like

"I have expensive taste, please use the most expensive model."

danlenton · 2024-05-23T08:52:46 1716454366

aha good question, well the neural scoring function doesn't "know" that it is making a routing decision, we just predict which LLM would give the highest performing output on the given prompt, based on LLMs-as-a-judge at training time. However, my guess is that this specification in the input prompt miiight mean that the cheaper models are deemed to be worse performing than GPT4 (for example), and so maybe it would route to the best models. Feel free to give it a try and see!

tyrw · 2024-05-22T22:14:05 1716416045

It feels like there is an analogy here with Yahoo! and the early days of going from curated lists of websites to search algorithms. Do you think of LLMs in a similar way? I.e. some kind of model ranking score that companies could eventually game?

I'm not sure what the SEO equivalent would be here...

danlenton · 2024-05-22T22:55:37 1716418537

Great analogy, I'm not sure tbh. I don't think we will see quite as many unique models as we see unique websites, but I do think we're going to see an increasing number of divergent and specialized models, which lend themselves to routing. I guess the SEO analogy would be "tricking" the neural routing function into thinking your LLM is better than it actually is, there are many techniques already to hone in on neural net weaknesses. Definitely interested to see how the space evolves!

darreninthenet · 2024-05-22T22:49:06 1716418146

LHR

(Least Hallucinated Response)

metalwhale · 2024-05-24T02:13:12 1716516792

This is interesting.

How can you retrieve the latent representation of the candidate LLMs? Some models do not have open weights (such as GPT-4), which means AFAIK it is impossible to directly access the hidden latent space through their API.

Am I missing something?

danlenton · 2024-05-29T00:39:51 1716943191

We just initialize a random latent vector for each model, and then jointly train each of these unique latent vectors :)

potatoman22 · 2024-05-23T02:48:33 1716432513

This is super cool! I wonder if you could do a similar thing, but choosing between a collection of prompts for a task based on the input. Similar to dynamic few-shot prompting, but replacing the entire prompt instead of just the examples.

danlenton · 2024-05-23T08:02:39 1716451359

I agree this is an interesting direction, I think this is on the roadmap for DSPy [https://github.com/stanfordnlp/dspy], but right now they mainly focus on optimizing the in-context examples.

_pdp_ · 2024-05-23T06:54:57 1716447297

I think everyone in this space soon or later invest a router. At CBK we did too. I am not sure if this is a big problem to solve but I am thinking we need to open finish our Open Relay soon.

danlenton · 2024-05-23T07:31:31 1716449491

Agreed, we've spoken to tons of users who reach out to us and start the conversation with "we've tried to implement this ourselves".

t_mann · 2024-05-23T08:55:46 1716454546

Sounds interesting, can you explain the conceptual differences between your approach and mixture-of-experts (for someone with minimal understanding of MoE)?

danlenton · 2024-05-23T10:30:59 1716460259

Sure! Basically traditional MoE has several linear layers, and the network learns to route down those paths, based on the training loss (similar to how CNNs learn through max-pooling, which is also non-differentiable). However, MoEs have been shown to specialiaze on tokens, not high-level semantics. This was eloquently explained by Fuzhao Xue, author of OpenMoE, in one of our reading groups: https://www.youtube.com/watch?v=k3QOpJA0A0Q&t=1547s

In contrast, our router sits at a higher level of the stack, sending prompts to different models and providers based on quality on the prompt distribution, speed and cost. Happy to clarify further if helpful!

xena · 2024-05-22T22:34:31 1716417271

Your notification for the launch woke me up several times last night because you had the notification change hourly for 4 hours.

danlenton · 2024-05-22T23:05:48 1716419148

Very sorry about that! I had an issue with my google calendar, I set it to "do not send emails" but for some reason some still came through. Fixed + removed now.

3abiton · 2024-05-22T21:55:13 1716414913

Isn't this what a MoE LLM does already?

sodality2 · 2024-05-22T22:10:25 1716415825

MoE-based models are one model with multiple experts. This solution could use entirely different models with different architectures (and probably supports MoE models itself)

danlenton · 2024-05-22T23:43:22 1716421402

exactly!

danlenton · 2024-05-22T23:09:34 1716419374

MoE LLMs use several "expert" fully connected layers, which are routed to during the forward pass, all trained end-to-end. This approach can also work with black-box LLMs like Opus, GPT4 etc. It's a similar concept but operating at a higher level of abstraction.

santiagobasulto · 2024-05-23T10:10:18 1716459018

Hey Daniel, this is pretty interesting. We're heavy users of LLMs, can I email you some feedback and ideas?

danlenton · 2024-05-23T10:20:19 1716459619

absolutely! You're also welome to schedule a call: https://calendly.com/unify-chat/general

Best email would be: daniel.lenton@unify.ai

Cheers!

ModernMech · 2024-05-23T14:45:49 1716475549

My new test is the goat and boat test. It seems to trip all LLMs up by being too close to a classic logic puzzle, but off so much that the question is actually straightforward.

The question: "You are on the shore of a river with a goat and a boat. How do you get to the other side?"

I expect it to tell me "Use the boat to get to the other side"

Instead it over fits to the solution of the logic puzzle and throws actual logic out the window.

GPT-4-Turbo

  To get to the other side of the river with a goat and a boat, you can follow these steps, assuming the boat can only carry you and one other item (the goat or something else if you had more items) at a time:

  Take the goat across the river: Start by taking the goat in the boat with you and row across the river. Leave the goat on the other side.

  Return alone: Row back across the river alone, leaving the goat on the far side.

  Cross again if needed: If you had more items to bring across (like a wolf and cabbage in the classic puzzle), you'd now take another item across, but since it's just you and the goat, you're done. You and the goat are both on the other side of the river.

  Since the scenario only mentions a goat and a boat, and no other items or animals that could complicate the process (like needing to keep the goat away from a cabbage), the task is straightforward: just take the goat across the river with you in the boat.

llama-3b (this one is hilarious actually)

  What a clever question!

  Since I have a goat and a boat, I can use the goat to help me cross the river. Here's what I'd do:

  First, I'd tie the goat to the boat using a rope or a long piece of twine.
  Then, I'd push the boat into the river and let it drift away from the shore.
  As the boat floats away, the goat would naturally try to follow the boat, since it's tied to it.
  Once the goat is in the middle of the river, I'd swim out to the boat and untie the goat.
  With the goat now free, I'd climb back into the boat and row or paddle it to the other side of the river.
  Once I reach the other side, I'd tie the goat to a tree or a rock to keep it safe and secure.
  Voilà! I've successfully crossed the river with the help of my trusty goat and boat!

Claude-3-Haiku

  To get the goat, yourself, and the boat to the other side of the river, you can follow these steps:

  Take the goat across the river in the boat.
  Leave the goat on the other side and come back in the empty boat.
  Take yourself across the river in the boat.
  Now that you're on the other side, you can bring the boat back to the original side and get the goat.
  This way, you can transport the goat to the other side of the river without leaving it alone with the boat, which could lead to the goat getting lost or the boat drifting away.
  The key is to make multiple trips, transporting one item at a time, to ensure that all the necessary elements (you, the goat, and the boat) end up on the other side of the river safely.

Mistral-7b-Instruct-V0.2 (Mistral-Ai)

  You leave the goat on the shore, go across the river in the boat, and then return to get the goat. Since you can't leave the goat unattended and the boat is too small to carry both the goat and yourself at the same time, this is the only possible solution.

frontman1988 · 2024-05-23T14:57:38 1716476258

ChatGPT-4 gives the answer pretty easily:

To cross the river with the goat using the boat, you would follow these steps:

1. *Get into the boat with the goat.* 2. *Row to the other side of the river.* 3. *Disembark with the goat.*

Now, both you and the goat are on the other side of the river.

ModernMech · 2024-05-23T15:56:11 1716479771

But I didn't ask it how to get the goat to the other side; the goat is a red herring.

Havoc · 2024-05-23T16:34:56 1716482096

It’s LLMs all the way down

dheera · 2024-05-23T03:44:41 1716435881

> dynamic router for LLMs

This is sort of how Mixture-of-Experts models work, actually.

danlenton · 2024-05-23T07:56:40 1716451000

Yep! Although MoE use several "expert" linear layers within a single network, and generally the "routing" is not based on high-level semantics, but token specialization, as disuccussed by Fuzhao Xue, author of OpenMoE, in one of our reading groups: https://www.youtube.com/watch?v=k3QOpJA0A0Q&t=1547s

swyx · 2024-05-23T04:40:33 1716439233

yeah this is very nice, open source Martian. i never really understood the value of routing all the time. you want stability and predictability in models. and models have huge brand value. you're never going to, through routing, construct a "super" model that people want more than one or a few really good brand name models.

danlenton · 2024-05-23T08:47:01 1716454021

Yeah I agree, routing all the time is overkill for some applications, however for others it really is necessary in order to reach the necessary speed and performance for users of your real-time app.

For those who don't want to always route, another core benefit of our platform is simple custom benchmarking on your task across all existing providers: https://youtu.be/PO4r6ek8U6M

If you then just want to use that provider rather than a router config, then that's fine!

Sontho · 2024-05-23T04:18:37 1716437917

It's a good options to test local models I guess.

danlenton · 2024-05-23T08:54:07 1716454447

actually we don't support local deployment quite yet, it's all run in our server. However, fully local deployment is on the roadmap, stay tuned!

weird-eye-issue · 2024-05-23T13:39:18 1716471558

Sorry but I hate stuff like this

We spend $30k+ per month with OpenAI and Anthropic

Even minor prompt changes between minor model versions can have huge differences in output

We make specific tweaks to all our dozens of production prompts based on the exact model that will be used

Treating LLMs as if they are interchangeable is simply bogus

thomashop · 2024-05-23T14:08:16 1716473296

Hate seems like quite a strong reaction.

My hunch is that quite soon, LLMs will be totally interchangeable. Due to the intense competition and the fact that people are basically training on the same base data distribution.

In the tasks I'm using LLMs switching one for another makes less difference than I had predicted even.

However, I'm not spending $30k+ per month, so I guess my opinion may be less informed.

What is your use case? Could these micro-optimizations you need to do now be the result of the technology still being quite immature?

I'm working with digital twins of musicians/celebrities. But have also done more analytical stuff with LLMs.

My current side project involves working with the production company of a well-known German soap opera to help them write further episodes. The first thing we did was write a small evaluation system. Could be interesting to test with Unify.

weird-eye-issue · 2024-05-23T14:26:26 1716474386

We do content generation and need a high level of consistency and quality

Our prompts get very complex and we have around 3,000 lines of code that does nothing but build prompts based on user's options (using dozens of helpers in other files)

They aren't going to get more interchangeable because of the subjective nature of them

Give five humans the same task and even if that task is just a little bit complex you'll get wildly different results. And the more complex it gets the more different the results will become

It's the same with LLMs. Most of our prompt changes are more significant but in one recent case it was a simple as changing the word "should" to "must" to get similar behavior between two different models.

One of them basically ignored things we said it "should" do and never performed the thing we wanted it to whereas the other did it often, despite these being minor version differences of the same model

dilek · 2024-05-23T14:37:35 1716475055

thank you!

prompt engineering is a thing, and it's not a thing that you get on social media posts with emojis or multiple images.

is it a public-facing product?

danlenton · 2024-05-23T16:50:04 1716483004

If you do test it out, feel free to ping me with any questions!

danlenton · 2024-05-23T15:00:48 1716476448

Thanks for weighing in. I'm sure for your setup right now, our router in it's current form would not be useful for you. This is the very first version, and the scope is therefore relatively limited.

On our roadmap, we plan to support:

- an API which returns the neural scores directly, enabling model selection and model-specific prompts to all be handled on the client side

- automatic learning of intermediate prompts for agentic multi-step systems, taking a similar view as DSPy, where all intermediate LLM calls and prompts are treated as latent variables in an optimizable end-to-end agentic system.

With these additions, the subtleties of the model + prompt relationship would be better respected.

I also believe that LLMs will become more robust to prompt subtleties over time. Also, some tasks are less sensitive to these minor subtleties you refer to.

For example, if you have a sales call agent, you might want to optimize UX for easy dialgoue prompts (so the person on the other end isn't left waiting), and take longer thinking about harder prompts requiring the full context of the call.

This is just an example, but my point is that not all LLM applications are the same. Some might be super sensitive to prompt subtleties, others might not be.

Thoughts?

weird-eye-issue · 2024-05-23T16:02:11 1716480131

I don't want/need any of that

It's already hard enough to get consistent behavior with a fixed model

If we need to save money we will switch to a cheaper model and adapt our prompts for that

If we are going more for quality we'll use and more expensive model and adapt our prompts for that

I fail to see any use case where I would want a third party choosing which model we are using at run time...

We are adding a new model this week and I've spent dozens of hours personally evaluating output and making tweaks to make it feasible

Making it sound like models are interchangeable is harmful

danlenton · 2024-05-23T16:48:38 1716482918

Makes sense, however I would clarify that we don't need to make the final decision. If you're using the neural scoring function as an API, then you can just get predictions about how each model will likely perform on your prompt, and then use these predictions however you want (if at all). Likewise, the benchmarking platform [https://youtu.be/PO4r6ek8U6M] can be used to just assess different models on your prompts without needing to do any routing. Nonetheless, this perspective is very helpful.

qeternity · 2024-05-23T13:45:58 1716471958

Came here to ask a similar question: who is this targeted to? We see very different end to end behavior with even different quantization levels of the same model. The idea that we would on the fly route across providers is mind boggling.

danlenton · 2024-05-23T16:59:11 1716483551

One use case is optimizing agentic systems, where a custom router [https://youtu.be/9JYqNbIEac0] is trained end-to-end on the final task (rather than GPT4-as-a-judge). Both the intermediate prompts and the models used can then be learned from data (similar to DSPy), whilst ensuring the final task performance remains high. This is not supported with v0, but it's on the roadmap. Thoughts?

qeternity · 2024-05-24T19:19:29 1716578369

We do agentic systems. We already optimize for these things. We route between different models based on various heuristics. I absolutely would not want that to be black box. And doing any sort of vector similarity to determine task complexity is not going to work well.

I would also not try to emulate DSPy, which is a massively overrated bit of kit and of little use in a production pipeline.

tarasglek · 2024-05-28T12:40:28 1716900028

Curious, can you explain why you think DSPy overrated?

weird-eye-issue · 2024-05-23T14:31:34 1716474694

Yeah even for a pure chatbot it doesn't make sense because in our experience users want to choose the exact model they are using...

danlenton · 2024-05-23T17:09:50 1716484190

Interesting, do you have any hunch as to why this is? We've seen in more verticalized apps where the underlying model is hidden from the user (sales call agent, autopilot tool, support agent etc.) that trying to reach high quality on hard prompts and high speed on the remaining prompts makes routing an appealing option.

weird-eye-issue · 2024-05-24T08:25:09 1716539109

We charge users different amounts of credits based on the model used. They also just generally have a personal preference for each model. Some people love Claude, some hate it, etc

For something like a support agent why couldn't the company just choose a model like GPT-4o and stick with one? Would they really trust some responses going to 3.5 (or similar)?

danlenton · 2024-05-24T10:32:28 1716546748

Currently the motivation is mainly speed. For the really easy ones like "hey, how's it going?" or "sorry I didn't hear you, can you repeat?" you can easily send to Llama3 etc. Ofc you could do some clever caching or something, but training a custom router directly on the task to optimize the resultant performance metric doesn't require any manual engineering.

Still, I agree that routing in isolation is not thaaat useful in many LLM domains. I think the usefulness will increase when applying to multi-step agentic systems, and when combining with other optimizations such as end-to-end learning of the intermediate prompts (DSPy etc.)

Thanks again for diving deep, super helpful!

cchance · 2024-05-23T03:33:42 1716435222

Surprised Gemini Flash isn't included

danlenton · 2024-05-23T07:58:50 1716451130

will be soon!

starwin1159 · 2024-05-23T11:02:18 1716462138

Will Gemini Pro be included?

danlenton · 2024-05-23T11:20:34 1716463234

It's on the roadmap! Hopefully will be added next week

lhousa · 2024-05-23T06:15:11 1716444911

It's always something that "unifies" stuff. Be it cabs or food or services, etc. There's potential for this.

danlenton · 2024-05-23T07:58:17 1716451097

aha maybe we should change our slogan to "Uber for LLMs"

Alifatisk · 2024-05-22T23:26:28 1716420388

Is this like openrouter ai?

danlenton · 2024-05-22T23:41:34 1716421294

definitely similar! I'm a fan of Alex and his work on OpenRouter :)

Some of the main differences would be: - we focus on performance based routing, optimizing speed, cost and quality [https://youtu.be/ZpY6SIkBosE] - we enable custom benchmarks on custom prompts, across all models + providers [https://youtu.be/PO4r6ek8U6M] - we enable custom routers to be trained on custom data [https://youtu.be/9JYqNbIEac0]

Our users often already have LLM apps deployed, and are then looking to take better control of their performance profile, perhaps increasing speed to improve user experience, or improving response quality via clear benchmarking across all models and providers on their particular prompts.

So they are similar, but solving slightly different problems I'd say.