Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Route your prompts to the best LLM (unify.ai)
298 points by danlenton 6 months ago | hide | past | favorite | 126 comments
Hey HN, we've just finished building a dynamic router for LLMs, which takes each prompt and sends it to the most appropriate model and provider. We'd love to know what you think!

Here is a quick(ish) screen-recroding explaining how it works: https://youtu.be/ZpY6SIkBosE

Best results when training a custom router on your own prompt data: https://youtu.be/9JYqNbIEac0

The router balances user preferences for quality, speed and cost. The end result is higher quality and faster LLM responses at lower cost.

The quality for each candidate LLM is predicted ahead of time using a neural scoring function, which is a BERT-like architecture conditioned on the prompt and a latent representation of the LLM being scored. The different LLMs are queried across the batch dimension, with the neural scoring architecture taking a single latent representation of the LLM as input per forward pass. This makes the scoring function very modular to query for different LLM combinations. It is trained in a supervised manner on several open LLM datasets, using GPT4 as a judge. The cost and speed data is taken from our live benchmarks, updated every few hours across all continents. The final "loss function" is a linear combination of quality, cost, inter-token-latency and time-to-first-token, with the user effectively scaling the weighting factors of this linear combination.

Smaller LLMs are often good enough for simple prompts, but knowing exactly how and when they might break is difficult. Simple perturbations of the phrasing can cause smaller LLMs to fail catastrophically, making them hard to rely on. For example, Gemma-7B converts numbers to strings and returns the "largest" string when asking for the "largest" number in a set, but works fine when asking for the "highest" or "maximum".

The router is able to learn these quirky distributions, and ensure that the smaller, cheaper and faster LLMs are only used when there is high confidence that they will get the answer correct.

Pricing-wise, we charge the same rates as the backend providers we route to, without taking any margins. We also give $50 in free credits to all new signups.

The router can be used off-the-shelf, or it can be trained directly on your own data for improved performance.

What do people think? Could this be useful?

Feedback of all kinds is welcome!




> Pricing-wise, we charge the same rates as the backend providers we route to, without taking any margins. We also give $50 in free credits to all new signups.

What's your plan for making money? Are you planning to eventually take a margin? Negotiate discounts with your backend providers? Mine the data flowing through your system?


The data flowing through LLM routers is a hot commodity right now. OpenRouter, for example, even provides a flat-rate 1% discount across the board if you agree to let them use your API calls for model training, and rumor has it that they're already profitable. To be fair, they do seem to be collaborating with model providers on some level, so they are likely getting discounted access on top of selling data.


It’s surprising how these app developers are okay with this much data being shown: https://openrouter.ai/models/mistralai/mixtral-8x7b-instruct...


Man, this space would get so much more interesting so quickly if base model providers had a revenue share system in place for routed requests...


This would quickly erode confidence in the routers themselves...


Or create a competitive environment between routers?


Another point here is that some users prefer to use their own API keys for the backend providers (a feature we're releasing soon). Any "discounts" would then be harder to implement. I do generally think it's much cleaner if we route based on the public facing price + performance, so our users don't need to lock into our own SSO if they'd prefer not to.


I think the biggest risk with advanced AI is that it's captured [likely by bad actors under the guise of cover stories] and that it strays away from being as free market as is possible.

E.g. I don't think there should be any patents regarding what AI creates and how it can create it - let's not give people monopolies anymore for which all possibilities will come into existence due to passionate people, not due to the possibility of being able to patent something. E.g. telling the system to turn a 2D photo into a 3D rendering and then extrapolating/reverse engineering that to tie into materials and known building code requirements is plainly obvious, as one easy example; a "gold rush" for patents on AI etc only aims to benefit relatively rent-seekers and those in the VC industrial complex, etc.


that's a good point, impartiality would then be questioned


So they end up in the same situation as hotels or airlines, beholden to the middlemen? They’ll never allow that :)


I certainly wouldn't complain about this lol


The idea is that at some point in future, we release new and improved router configurations which do take small margins, but from the user perspective they're still paying less than using a single endpoint. We don't intend to inflate the price when users only use the single-sign-on benefits. Negotiating discounts with backend providers is another possibility, but right now we're just focused on providing value.


Honestly, I’d feel a lot more secure about building on this if you did take (for example) a small fixed fee every month. Or a 10% commission on any requests (volume discounts on that commission for high-volume users?).

If I start using you now you’ll either disappear in the future or you’ll suddenly start charging more, neither of which I like.

I’m already paying for inference, a little amount on top of that for the convenience of a single API is pretty useful.


Agree heavily with this sentiment. It sounds like this could be a useful tool for a personal project of mine, but I wasn't nearly as interested after reading they're not attempting to make money yet. I'm a bit burnt out on that business model. Predictability is just as important as price when I'm deciding how to invest a large portion of my free time. I happily gave OpenRouter $20 for their service, and I've barely dented the credits with thousands of test runs over two months.

On that note, I think I'd be even more likely to pay for Unify.ai if I could opt to bypass the auto-routing and use it the same way I use OpenRouter - a single endpoint to route to any model I want. Sometimes I've already determined the best model for a task, and other times I want redundant models for the same task. It's possible Unify has this option, though I didn't see it while skimming the docs.

But really, all in all, this is a super cool project and I'm happy it was shared.


Makes sense, thanks a lot for the feedback. We're pretty confident that future versions of our router will provide sufficient value where we can take margins here, we therefore don't expect the need to start charging for Single-sign-on (SSO) alone. The SSO benefits are only the beginning in my mind, our main value will come from custom benchmarks across all models + providers and optimizing LLM applications, including agentic workfows. I do very much see your point though. Thankfully, we're very fortunate to have several years of runway, so we don't plan on disappearing anytime too soon!


A common model in some cost cutting software is to charge x% of the total savings... Win/win...just a suggestion... use picks "main LLM" and you calculate the "non optimized cost" based on that. Whatever savings you drive you take a share of the savings.


It's tough in this case, because if you incentivise just to save cost, it could always route you to the cheapest LLM but the quality would suffer...


however, as janekm says, we can't charge just based on cost savings. We would need the router points to be sufficiently compelling wrt quality, speed and cost (including our own margins) that users still sometimes opt for these router points. Suffice it to say, if any router configs do start to take margins, then this will be clearly reflected in the overall router cost plotted on the scatter graph. UX will not be affected.


Yeah that's a great point, something we'll keep in mind as we work out the final business model. Thanks!


[flagged]


lol well sure, that too


but only as a secondary goal right? ;)

It's all about value for the world!


I've bumped into a few of these. I use https://openrouter.ai as a model abstraction, but not as a router. https://withmartian.com does the same thing but with a more enterprise feel. Also https://www.braintrustdata.com/ though it's less clear how committed they are to that feature.

That said, while I've really enjoyed the LLM abstraction (making it easy for me to test different models without changing my code), I haven't felt any desire for a router. I _do_ have some prompts that I send to gpt-3.5-turbo, and could potentially use other models, but it's kind of niche.

In part this is because I try to do as much in a single prompt as I can, meaning I want to use a model that's able to handle the hardest parts of the prompt and then the easy parts come along with. As a result there's not many "easy" prompts. The easy prompts are usually text fixup and routing.

My "routing" prompts are at a different level of abstraction, usually routing some input or activity to one of several prompts (each of which has its own context, and the sum of all contexts across those prompts is too large, hence the routing). I don't know if there's some meaningful crossover between these two routing concepts.

Another issue I have with LLM portability is the use of tools/functions/structured output. Opus and Gemini Pro 1.5 have kind of implemented this OK, but until recently GPT was the only halfway decent implementation of this. This seems to be an "advanced" feature, yet it's also a feature I use even more with smaller prompts, as those small prompts are often inside some larger algorithm and I don't want the fuss of text parsing and exceptions from ad hoc output.

But in the end I'm not price sensitive in my work, so I always come back to the newest GPT model. If I make a switch to Opus it definitely won't be to save money! And I'm probably not going to want to fiddle, but instead make a thoughtful choice and switch the default model in my code.


Thanks for mentioning Braintrust!

We are very committed to the proxy :)

Although, to your point, we have seen less market pull for routing, and more for (a) supporting the latest LLMs, (b) basic translation (e.g. tool call API b/w Anthropic & OpenAI), and (c) solid infra features like caching/load balancing api keys/secret management. So that's our focus.


Super helpful feedback, thanks for going so deep! I agree that for the really heavy agentic stuff, the router in it's current form might not be the most important innovation.

However, for several use cases speed is really paramount, and can directly hinder the UX. Examples include sales call agents, copilots, auto-complete engines etc. These are some of the areas where we've seen the router really shine, diverting to slow models when absolutely necesary on complex prompts, but using fast models as often as possible to minimize disruption to the UX.

Having said that, another major benefit of the platform is the ability to quickly run objective benchmarks for quality, cost and speed across all models and providers, on your own prompts [https://youtu.be/PO4r6ek8U6M]. We have some users who run benchmarks regularly for different checkpoints of their fine-tuned model, comparing against all other custom fine-tuned models, as well as the various foundation models.

As for the overlap in routing concepts you mentioned, I've thought a lot about this actually. It's our intention to broaden the kinds of routing we're able to handle, where we assume all control flow decision (routing) and intermediate prompts are latent variables (DSPy perspective). In the immediate future there is not crossover though.

I agree cost is often an afterthought. Generally our users either care about improving speed, or they want to know which model or combination of models would be best for their task in terms of output quality (GPT-4, Opus, Gemini? etc.). This is not trivial to guage without performing benchmarks.

As for usually wanting to make a full LLM switch as opposed to routing, what's the primary motivation? Avoiding extra complexity + dependencies in the stack? Perhaps worrying about model-specific prompts no longer working well with a new model? The general loss of control?


This is interesting! Sort of a super mixture of experts model. What's the latency penalty paid with your router in the middle?

The pattern I often see is companies prototyping on the most expensive models, then testing smaller/faster/cheaper models to determine what is actually required for production. For which contexts and products do you foresee your approach being superior?

Given you're just passing along inference costs from backend providers and aren't taking margin, what's your long-term plan for profitability?


Great question! Generally the neural network used for the router takes maybe ~20ms during inference. When deployed on prem, in your own cloud environment, then this is the only latecy. When using the public endpoints with our own intermediate server, it might add ~150ms to the time-to-first-token, but inter-token-latency is not affected.

We generally see the router being useful when the LLM application is being scaled, and cost and speed start to matter a lot. However, in some cases the output quality actually improved, as we're able to squeeze the best of GPT4 and Claude etc.

Long-term plan for profitability would come from some future version of the router, where we save the user time and money, and then charge some overhead for the router, but with the user still paying less than they would be with a single endpoint. Hopefully that makes sense?

Happy to answer any other questions!


Do you save the user data, ie, the searches themselves? What do your TOS guarantee about the use of that data?


We use this data to improve the base router by default. It's fully anonymized, and you can opt out.


Without opt out it would be a no go, so that's great to hear. What's the downside of opting out?


no down side


If I was doing this I'd negotiate a volume discount, charge the clients the base rate and pocket the difference.


definitely on the cards, we're keeping our options open here. Right now just focused on creating value though.


Really great stuff.

People use the same model / server for all queries not because it's sensible, but because it's simple. This brings the same simplicity to the far more optimal solution.

And great startup play too, by definition no incumbent can fill this role.


Thanks - glad to hear the idea resonates!


How can you know which LLM is better for every specific question?

That is, how do you quantify "better"?

Imagine I asked: Write me a sad poem about "Yorick" in Hamlet?

You have to predict which LLM is going to be produce the "best" poem, but by what criterion do you evaluate that?


Currently, we simply use GPT4-as-a-judge, with a general system prompt we've written which is task agnostic. This is then used to train the neural scoring function, which predicts quality ahead-of-time. However, it's on our roadmap to add make the judging more flexible, potentially task-specific judge prompts and in-context examples, also perhaps using a jury [https://arxiv.org/pdf/2404.18796].


For those who like testing smaller local models, there is a tool in LangChain called Routing and in LlamaIndex called a "router query engine" which achieve a similar thing:

https://python.langchain.com/v0.1/docs/use_cases/query_analy...

https://docs.llamaindex.ai/en/stable/examples/query_engine/R...


Thanks for sharing! These are useful toos, but they are a bit different, more based on similarity search in prompt space (a bit like semantic router: https://github.com/aurelio-labs/semantic-router). Our router uses a neural network for the routing decisions, and it can be trained on your own prompts [https://youtu.be/9JYqNbIEac0]. We're also adding support for on-prem deployment soon!


Heya I found a minor typo here: https://unify.ai/docs/demos/demos/LangChain/RAG_playground/R... under "Usage" step 2.

> "Input your Unify APhttps://github.com/Anteemony/RAG"

Your product looks good in my view, although I have only spend about 10min thus far. The docs look pretty easy to follow along.

I'll probably give this a try soon!


Thanks for sharing, will get this fixed now!


I've been using a prompt management service (cloud) for 2 months and am pretty happy that I can define prompts and check quality outside the code, and it helps me to do the routing manually.

Having benchmarks (I assume this is ongoing?) would make it even more interesting, as I wouldn't have to manually manage the routing each time I add a new model.

You mentioned you don't have a margin right now, but how about keeping logs, the dashboard for cost, and benchmarking?


Yes the benchmarks are ongoing, we continually plot the speed and cost across time in our runtime benchmarks [https://unify.ai/benchmarks], and we use this live data when plotted the quality scatter graphs [https://console.unify.ai/dashboard]. The router configurations are "self-improving" in the sense that any given router config will quickly wrap the latest models and providers under the hood. Using a router config is a way of riding the wave of models and providers, whilst just specifying your priorities for quality, speed and cost. We will have some case studies which better explain this soon!


Sounds similar to the Composition of Experts from SambaNova (although that is for increased accuracy on single systems mostly, not for decreased cost from various suppliers): https://sambanova.ai/blog/samba-coe-the-power-of-routing-ml-...


Yep it's the same idea. We also see improvements in quality on some benchmarks beyond any specific model. This is common especially common when training a custom router on your own prompt dataset: https://youtu.be/9JYqNbIEac0


That makes a lot of sense for anything that's public-facing and free.

If the user is having a dialog with the system, they start out with a trivial question, and are routed to a simple LLM, are they stuck there? Or can the conversation be transitioned to something more powerful?

Something like this will be needed to direct traffic as systems specialize more.


Great question! Nope they're not stuck with the simple LLM after the first question. Chatbots simply aggregate all prior conversation history in the input prompt, and make a new LLM call. The router treats each LLM call seperately, and so a single conversation can occur across many models and providers.


I’ve already heard a few times that the selection of models is seen as a problem (need to benchmark continually, justify changes…), this is an elegant solution.

I don’t know if choosing different models for the same consumer can be problematic (seen as not consistent), but maybe using this approach will force the post-processing code not to be “coupled” with one particular model.


This is a great point. With models becoming more intelligent, they're seeming to become less brittle to the subtleties in the prompts, which might mean decoupling will occur naturally anyway. With regards to customers wanting to stick with "one model", we more often find that they want to stick with "one cloud environment". This is not a problem, the router can be deployed inside custom cloud environments easily. Interesting to see how much auto-prompting such as DSPy generalizes across models though, when you really start to see intermediate prompts as latent variables. Not sure how much these learned prompts might behave differently with different LLMs. I guess you might need to jointly learn the prompts and the routing decisions together for it to work really well. One step at a time though!


I wonder, have you also done any work on actually improving performance by enhancing the prompt or something similar?

I always thought a product like this that could empirically decrease costs for the same performance or increase performance for a small increase in cost would have a fairly simple road of justifying its existence.


As outlined, DSPy is a create tool for this. Currently their focus is on optimizing in-context examples, but their broader vision includes optimising system prompts for intermediate LLM nodes in an agentic system as well. We uploaded an explainder video about DSPy last week! Feel free to check it out: https://youtu.be/kFB8kFchCH4?si=0YKAVcpKJjogP1sX


I think dspy aims to do this: https://github.com/stanfordnlp/dspy


Cool thanks for the link very interesting.


Does the router approach make sense with multi-tasking LLMs? One could execute multiple different tasks with a single prompt (chat response, NER, translation etc.) and with the latest models even do images or video alongside text. Doesn't a router get in the way, unnecessarily increasing latency?


So the neural scoring introduces ~20ms latency, but this only impacts time-to-first-token (not inter-token-latency). When using our public endpoints there is an additional ~150ms latency, but you can deploy the router on-prem in your own cloud, so then it would only be the inference latency. Generally the improvements in ITL outweigh the small addition to TTFT.


Great to know this!

I have come across Portkey's Open-source AI Gateway which kind of does the same.

https://github.com/portkey-ai/gateway

It looks like with more LLM adoption, resiliency and cost related aspects take off sooner than expected unlike other technological trends in the past.

I'm also thinking that there is a chance if something like this could help build a better RAG pipeline or evals for the GenAI App. Because end of the day you want to reduce hallucinations but still get good generative responses.


> It is trained in a supervised manner on several open LLM datasets, using GPT4 as a judge.

Does this mean GPT4 predictions are used as labels? Is that allowed?


They're not used as labels directly (we're not trainig an LLM which outputs text), they are used as an intermediate step, which is then used to compute a simple score which the neural score function is then trained on. The neural score function takes prompts and latent model representations in, and produces a score from 0-1


TOS says you cannot "Use Output Data to develop models that compete with OpenAI." You're probably fine but it's close.


To be honest we're thinking of moving away from this soon anyway. Open source models will soon make for perfectly good judges (or juries), with Llama3 etc.: https://arxiv.org/pdf/2404.18796


How easy is it to integrate our own data for training the router, and what kind of improvements can we expect from this customization?


it depends on the task, but this video gives an idea :) https://www.youtube.com/watch?v=9JYqNbIEac0


Very well executed. Congrats. Signed up.

The chatbot providers will enjoy seeing how they stack up against each other. Your data/analytics/alerts will be valuable to them. Revenue opportunity there.

I think about comparison data companies like https://www.ipsos.com when I see your product.


Thanks! Ipsos is also a great analogous example, I hadn't thought of that.


Any time... In the banking world, the analogous is JD Power. Useful and valuable insights from them.

https://canada.jdpower.com


duly noted!


Very cool, going to try on some of our workflows this week.

Would love to see web access and RAG (LlamaIndex) integration. Are they on the roadmap?



Since you take no extra charges from the end user, the presumption is you would charge LLM's for giving them traffic?


we intend to eventually have routers which improve the speed, cost (and maybe quality) to such an extent that we can then take some margins from these best performing routers, with users still saving costs compared to individual endpoints. For now we don't take any margins though, we want to ensure we're providing sufficient value before rushing to capture any of it.


Are you ready to eat the $50 per new user (and all the fake accounts that could get created this way to eat more and more $50) all the way until you create this next tier, "for-profit" step?


yeah we might need to implemenet some kind of "I am not a robot" checks soon, as well as 2FA.


How does it behave if the user input includes something like

"I have expensive taste, please use the most expensive model."


aha good question, well the neural scoring function doesn't "know" that it is making a routing decision, we just predict which LLM would give the highest performing output on the given prompt, based on LLMs-as-a-judge at training time. However, my guess is that this specification in the input prompt miiight mean that the cheaper models are deemed to be worse performing than GPT4 (for example), and so maybe it would route to the best models. Feel free to give it a try and see!


It feels like there is an analogy here with Yahoo! and the early days of going from curated lists of websites to search algorithms. Do you think of LLMs in a similar way? I.e. some kind of model ranking score that companies could eventually game?

I'm not sure what the SEO equivalent would be here...


Great analogy, I'm not sure tbh. I don't think we will see quite as many unique models as we see unique websites, but I do think we're going to see an increasing number of divergent and specialized models, which lend themselves to routing. I guess the SEO analogy would be "tricking" the neural routing function into thinking your LLM is better than it actually is, there are many techniques already to hone in on neural net weaknesses. Definitely interested to see how the space evolves!


LHR

(Least Hallucinated Response)


This is interesting.

How can you retrieve the latent representation of the candidate LLMs? Some models do not have open weights (such as GPT-4), which means AFAIK it is impossible to directly access the hidden latent space through their API.

Am I missing something?


We just initialize a random latent vector for each model, and then jointly train each of these unique latent vectors :)


This is super cool! I wonder if you could do a similar thing, but choosing between a collection of prompts for a task based on the input. Similar to dynamic few-shot prompting, but replacing the entire prompt instead of just the examples.


I agree this is an interesting direction, I think this is on the roadmap for DSPy [https://github.com/stanfordnlp/dspy], but right now they mainly focus on optimizing the in-context examples.


I think everyone in this space soon or later invest a router. At CBK we did too. I am not sure if this is a big problem to solve but I am thinking we need to open finish our Open Relay soon.


Agreed, we've spoken to tons of users who reach out to us and start the conversation with "we've tried to implement this ourselves".


Sounds interesting, can you explain the conceptual differences between your approach and mixture-of-experts (for someone with minimal understanding of MoE)?


Sure! Basically traditional MoE has several linear layers, and the network learns to route down those paths, based on the training loss (similar to how CNNs learn through max-pooling, which is also non-differentiable). However, MoEs have been shown to specialiaze on tokens, not high-level semantics. This was eloquently explained by Fuzhao Xue, author of OpenMoE, in one of our reading groups: https://www.youtube.com/watch?v=k3QOpJA0A0Q&t=1547s

In contrast, our router sits at a higher level of the stack, sending prompts to different models and providers based on quality on the prompt distribution, speed and cost. Happy to clarify further if helpful!


Your notification for the launch woke me up several times last night because you had the notification change hourly for 4 hours.


Very sorry about that! I had an issue with my google calendar, I set it to "do not send emails" but for some reason some still came through. Fixed + removed now.


Isn't this what a MoE LLM does already?


MoE-based models are one model with multiple experts. This solution could use entirely different models with different architectures (and probably supports MoE models itself)


exactly!


MoE LLMs use several "expert" fully connected layers, which are routed to during the forward pass, all trained end-to-end. This approach can also work with black-box LLMs like Opus, GPT4 etc. It's a similar concept but operating at a higher level of abstraction.


Hey Daniel, this is pretty interesting. We're heavy users of LLMs, can I email you some feedback and ideas?


absolutely! You're also welome to schedule a call: https://calendly.com/unify-chat/general

Best email would be: daniel.lenton@unify.ai

Cheers!


My new test is the goat and boat test. It seems to trip all LLMs up by being too close to a classic logic puzzle, but off so much that the question is actually straightforward.

The question: "You are on the shore of a river with a goat and a boat. How do you get to the other side?"

I expect it to tell me "Use the boat to get to the other side"

Instead it over fits to the solution of the logic puzzle and throws actual logic out the window.

GPT-4-Turbo

  To get to the other side of the river with a goat and a boat, you can follow these steps, assuming the boat can only carry you and one other item (the goat or something else if you had more items) at a time:

  Take the goat across the river: Start by taking the goat in the boat with you and row across the river. Leave the goat on the other side.

  Return alone: Row back across the river alone, leaving the goat on the far side.

  Cross again if needed: If you had more items to bring across (like a wolf and cabbage in the classic puzzle), you'd now take another item across, but since it's just you and the goat, you're done. You and the goat are both on the other side of the river.

  Since the scenario only mentions a goat and a boat, and no other items or animals that could complicate the process (like needing to keep the goat away from a cabbage), the task is straightforward: just take the goat across the river with you in the boat.
llama-3b (this one is hilarious actually)

  What a clever question!

  Since I have a goat and a boat, I can use the goat to help me cross the river. Here's what I'd do:

  First, I'd tie the goat to the boat using a rope or a long piece of twine.
  Then, I'd push the boat into the river and let it drift away from the shore.
  As the boat floats away, the goat would naturally try to follow the boat, since it's tied to it.
  Once the goat is in the middle of the river, I'd swim out to the boat and untie the goat.
  With the goat now free, I'd climb back into the boat and row or paddle it to the other side of the river.
  Once I reach the other side, I'd tie the goat to a tree or a rock to keep it safe and secure.
  Voilà! I've successfully crossed the river with the help of my trusty goat and boat!
Claude-3-Haiku

  To get the goat, yourself, and the boat to the other side of the river, you can follow these steps:

  Take the goat across the river in the boat.
  Leave the goat on the other side and come back in the empty boat.
  Take yourself across the river in the boat.
  Now that you're on the other side, you can bring the boat back to the original side and get the goat.
  This way, you can transport the goat to the other side of the river without leaving it alone with the boat, which could lead to the goat getting lost or the boat drifting away.
  The key is to make multiple trips, transporting one item at a time, to ensure that all the necessary elements (you, the goat, and the boat) end up on the other side of the river safely.
Mistral-7b-Instruct-V0.2 (Mistral-Ai)

  You leave the goat on the shore, go across the river in the boat, and then return to get the goat. Since you can't leave the goat unattended and the boat is too small to carry both the goat and yourself at the same time, this is the only possible solution.


ChatGPT-4 gives the answer pretty easily:

To cross the river with the goat using the boat, you would follow these steps:

1. *Get into the boat with the goat.* 2. *Row to the other side of the river.* 3. *Disembark with the goat.*

Now, both you and the goat are on the other side of the river.


But I didn't ask it how to get the goat to the other side; the goat is a red herring.


It’s LLMs all the way down


> dynamic router for LLMs

This is sort of how Mixture-of-Experts models work, actually.


Yep! Although MoE use several "expert" linear layers within a single network, and generally the "routing" is not based on high-level semantics, but token specialization, as disuccussed by Fuzhao Xue, author of OpenMoE, in one of our reading groups: https://www.youtube.com/watch?v=k3QOpJA0A0Q&t=1547s


yeah this is very nice, open source Martian. i never really understood the value of routing all the time. you want stability and predictability in models. and models have huge brand value. you're never going to, through routing, construct a "super" model that people want more than one or a few really good brand name models.


Yeah I agree, routing all the time is overkill for some applications, however for others it really is necessary in order to reach the necessary speed and performance for users of your real-time app.

For those who don't want to always route, another core benefit of our platform is simple custom benchmarking on your task across all existing providers: https://youtu.be/PO4r6ek8U6M

If you then just want to use that provider rather than a router config, then that's fine!


It's a good options to test local models I guess.


actually we don't support local deployment quite yet, it's all run in our server. However, fully local deployment is on the roadmap, stay tuned!


Sorry but I hate stuff like this

We spend $30k+ per month with OpenAI and Anthropic

Even minor prompt changes between minor model versions can have huge differences in output

We make specific tweaks to all our dozens of production prompts based on the exact model that will be used

Treating LLMs as if they are interchangeable is simply bogus


Hate seems like quite a strong reaction.

My hunch is that quite soon, LLMs will be totally interchangeable. Due to the intense competition and the fact that people are basically training on the same base data distribution.

In the tasks I'm using LLMs switching one for another makes less difference than I had predicted even.

However, I'm not spending $30k+ per month, so I guess my opinion may be less informed.

What is your use case? Could these micro-optimizations you need to do now be the result of the technology still being quite immature?

I'm working with digital twins of musicians/celebrities. But have also done more analytical stuff with LLMs.

My current side project involves working with the production company of a well-known German soap opera to help them write further episodes. The first thing we did was write a small evaluation system. Could be interesting to test with Unify.


We do content generation and need a high level of consistency and quality

Our prompts get very complex and we have around 3,000 lines of code that does nothing but build prompts based on user's options (using dozens of helpers in other files)

They aren't going to get more interchangeable because of the subjective nature of them

Give five humans the same task and even if that task is just a little bit complex you'll get wildly different results. And the more complex it gets the more different the results will become

It's the same with LLMs. Most of our prompt changes are more significant but in one recent case it was a simple as changing the word "should" to "must" to get similar behavior between two different models.

One of them basically ignored things we said it "should" do and never performed the thing we wanted it to whereas the other did it often, despite these being minor version differences of the same model


thank you!

prompt engineering is a thing, and it's not a thing that you get on social media posts with emojis or multiple images.

is it a public-facing product?


If you do test it out, feel free to ping me with any questions!


Thanks for weighing in. I'm sure for your setup right now, our router in it's current form would not be useful for you. This is the very first version, and the scope is therefore relatively limited.

On our roadmap, we plan to support:

- an API which returns the neural scores directly, enabling model selection and model-specific prompts to all be handled on the client side

- automatic learning of intermediate prompts for agentic multi-step systems, taking a similar view as DSPy, where all intermediate LLM calls and prompts are treated as latent variables in an optimizable end-to-end agentic system.

With these additions, the subtleties of the model + prompt relationship would be better respected.

I also believe that LLMs will become more robust to prompt subtleties over time. Also, some tasks are less sensitive to these minor subtleties you refer to.

For example, if you have a sales call agent, you might want to optimize UX for easy dialgoue prompts (so the person on the other end isn't left waiting), and take longer thinking about harder prompts requiring the full context of the call.

This is just an example, but my point is that not all LLM applications are the same. Some might be super sensitive to prompt subtleties, others might not be.

Thoughts?


I don't want/need any of that

It's already hard enough to get consistent behavior with a fixed model

If we need to save money we will switch to a cheaper model and adapt our prompts for that

If we are going more for quality we'll use and more expensive model and adapt our prompts for that

I fail to see any use case where I would want a third party choosing which model we are using at run time...

We are adding a new model this week and I've spent dozens of hours personally evaluating output and making tweaks to make it feasible

Making it sound like models are interchangeable is harmful


Makes sense, however I would clarify that we don't need to make the final decision. If you're using the neural scoring function as an API, then you can just get predictions about how each model will likely perform on your prompt, and then use these predictions however you want (if at all). Likewise, the benchmarking platform [https://youtu.be/PO4r6ek8U6M] can be used to just assess different models on your prompts without needing to do any routing. Nonetheless, this perspective is very helpful.


Came here to ask a similar question: who is this targeted to? We see very different end to end behavior with even different quantization levels of the same model. The idea that we would on the fly route across providers is mind boggling.


One use case is optimizing agentic systems, where a custom router [https://youtu.be/9JYqNbIEac0] is trained end-to-end on the final task (rather than GPT4-as-a-judge). Both the intermediate prompts and the models used can then be learned from data (similar to DSPy), whilst ensuring the final task performance remains high. This is not supported with v0, but it's on the roadmap. Thoughts?


We do agentic systems. We already optimize for these things. We route between different models based on various heuristics. I absolutely would not want that to be black box. And doing any sort of vector similarity to determine task complexity is not going to work well.

I would also not try to emulate DSPy, which is a massively overrated bit of kit and of little use in a production pipeline.


Curious, can you explain why you think DSPy overrated?


Yeah even for a pure chatbot it doesn't make sense because in our experience users want to choose the exact model they are using...


Interesting, do you have any hunch as to why this is? We've seen in more verticalized apps where the underlying model is hidden from the user (sales call agent, autopilot tool, support agent etc.) that trying to reach high quality on hard prompts and high speed on the remaining prompts makes routing an appealing option.


We charge users different amounts of credits based on the model used. They also just generally have a personal preference for each model. Some people love Claude, some hate it, etc

For something like a support agent why couldn't the company just choose a model like GPT-4o and stick with one? Would they really trust some responses going to 3.5 (or similar)?


Currently the motivation is mainly speed. For the really easy ones like "hey, how's it going?" or "sorry I didn't hear you, can you repeat?" you can easily send to Llama3 etc. Ofc you could do some clever caching or something, but training a custom router directly on the task to optimize the resultant performance metric doesn't require any manual engineering.

Still, I agree that routing in isolation is not thaaat useful in many LLM domains. I think the usefulness will increase when applying to multi-step agentic systems, and when combining with other optimizations such as end-to-end learning of the intermediate prompts (DSPy etc.)

Thanks again for diving deep, super helpful!


Surprised Gemini Flash isn't included


will be soon!


Will Gemini Pro be included?


It's on the roadmap! Hopefully will be added next week


It's always something that "unifies" stuff. Be it cabs or food or services, etc. There's potential for this.


aha maybe we should change our slogan to "Uber for LLMs"


Is this like openrouter ai?


definitely similar! I'm a fan of Alex and his work on OpenRouter :)

Some of the main differences would be: - we focus on performance based routing, optimizing speed, cost and quality [https://youtu.be/ZpY6SIkBosE] - we enable custom benchmarks on custom prompts, across all models + providers [https://youtu.be/PO4r6ek8U6M] - we enable custom routers to be trained on custom data [https://youtu.be/9JYqNbIEac0]

Our users often already have LLM apps deployed, and are then looking to take better control of their performance profile, perhaps increasing speed to improve user experience, or improving response quality via clear benchmarking across all models and providers on their particular prompts.

So they are similar, but solving slightly different problems I'd say.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: