Hacker News new | past | comments | ask | show | jobs | submit login
Why host your own LLM? (marble.onl)
262 points by andy99 on Aug 15, 2023 | hide | past | favorite | 132 comments



I chose to self-host for a variety of reasons. One, I had been using gpt-3.5-turbo for a hobby project and had an abysmal accessibility experience -- I received repeated 429s with serial requests far more sparse than the limits, at most frequent, 10 times less often than the recommended threshold. When I adjusted further the 429s kept returning. It reached a point where even llama.cpp models (on 20 cores) were definitely more performant. I received absolutely no response from customer support despite being an actual paying customer. I imagine OpenAI affiliates here will downvote this, but I found OpenAI's API accessibility to be one of the most terrible I have ever used.


FWIW, we've been live with GPT-3.5 turbo since May and it's improved a LOT.

Latency is down consistently across the board and we haven't seen a single "429 model overloaded" error in the past month: https://twitter.com/_cartermp/status/1686894576202907651/


It is worth noting that I had the perception, incorrect or not, that larger customers were being given priority to OpenAI APIs. It is good to see that you are reporting improvements.


Oh, they undoubtedly were. We have decent volume, but worked very hard in our prompting to minimize input and output tokens, so we don't have a big bill. I just think that in the past months they've added significantly more capacity and done a lot of work on their inference servers to make things run better.


Given Meta's obvious, and most welcome, open source play with LLaMA 2, it behooves OpenAI to be performant and accessible to everyone. For my application, I am finding that 70b LLaMA 2 models are very close to the inference quality of gpt-3.5-turbo.


It's gotten better for everyone in the last few months. It used to be a nightmare, but I haven't seen a timeout or rate limit error in a long time.


Where do you host your model? I am looking around on where I can deploy one without ruining me financially.


The easy answer here would be either pure CPU at Hetzner (e.g. a 24 core i9 with 64GB RAM for €84/month) or GPU at lambdalabs (starting at $360/month). Or maybe vast.ai, if you find a trustworthy offering with good uptime.

Though GPU workloads are still a point where building your own server and running it from your basement or putting it in colo can be very attractive.


You could easily host your model on https://beam.cloud (I'm a founder). You just add a decorator to your existing Python code:

    from beam import App, Runtime

    app = App(name="gpu-app", runtime=Runtime(gpu="T4"))

    @app.rest_api()
    def inference():
        print("This is running on a GPU")
Then run beam deploy {your-app}.py and boom, it's running on the cloud


An A10G for 1200$ per month will ruin me financially


I think the Beam website should be a lot clearer about how things work[0], but I think Beam is offering to bill you for your actual usage, in a serverless fashion. So, unless you're continuously running computations for the entire month, it won't cost $1200/mo.

If it works the way I think it does, it sounds appealing, but the GPUs also feel a bit small. The A10G only has 24GB of VRAM. They say they're planning to add an A100 option, but... only the 40GB model? Nvidia has offered an 80GB A100 for several years now, which seems like it would be far more useful for pushing the limits of today's 70B+ parameter models. Quantization can get a 70B parameter model running on less VRAM, but it's definitely a trade-off, and I'm not sure how the training side of things works with regards to quantized models.

Beam's focus on Python apps makes a lot of sense, but what if I want to run `llama.cpp`?

Anyways, Beam is obviously a very small team, so they can't solve every problem for every person.

[0]: what is the "time to idle" for serverless functions? is it instant? "Pay for what you use, down to the second" sounds good in theory, but AWS also uses per-second billing on tons of stuff... EC2 instances don't just stop billing you when they go idle, though, you have to manually shut them down and start them up. So, making the lifecycle clearer would be great. Even a quick example of how you would be billed might be helpful.


why did you decide to make such a bad pitch for your product like this?


I found gpu-mart.com but haven't tested yet.

An A4000 for 139$ x 12 is not terrible


Last year I did invest in a dual RTX 3090 Ryzen self-build tower. It runs fairly cool in the basement. So I literally self-host. I am confident that I have or soon will reach the cheaper to self host point of the cost curve, particularly as the two GPUs see very consistent use.


what are you consistently using it for?


I use Runpod, an A4000 is $0.31/hr


I have had similar issues with the Azure GPT-3.5, it would respond fast sometimes, and just hang for minutes other times, not even a 429. Just blocked randomly.


How much are you spending to self host a LLM?


Just listened to Lex Fridman's three hour interview with George Hotz this weekend. He spoke about his new company, Tiny Corp.

Tiny Corp. will be producing the Tiny Box that lets you host your own LLM at home using TinyGrad software.

The tinybox

738 FP16 TFLOPS

144 GB GPU RAM

5.76 TB/s RAM bandwidth

30 GB/s model load bandwidth (big llama loads in around 4 seconds)

AMD EPYC CPU

1600W (one 120V outlet)

Runs 65B FP16 LLaMA out of the box (using tinygrad, subject to software development risks)

$15,000

Hardware startups are extremely difficult. But Hotz's other company Comma.ai is already profitable so it is possible. I find the guy extremely encouraging and he is always doing interesting stuff.


Is $15,000 really an "at home" sort of price?

(If money is no object, why not grab an oxide.computer rack? Assuming you have three-phase power, of course...)


I don't think the $15k price tag is intended for regular consumers. I didn't listen to the Lex Fridman interview (my patience starts to wane after the first two hours), but I did listen to another of Hotz's interview on another podcast, and I believe he referred to this their "Tiny Box" as more of a dev box. It's for the engineers and tinkerers who want a capable machine for AI. So while expensive, it fills a niche for those who don't want to deal with linking multiple Nvidia GPUs together and deal with the power and cooling setup that come with that.


He's definitely thinking bigger than just dev boxes. Lately he's repeatedly mentioned building own chips to be somewhere in the future. How plausible that is is another matter


Right, that’s the long term plan. I’m speaking specifically about the $15k first release of the tiny box.


Right, especially when it's not clear if such a machine will be needed a year or two from now to run a model with the equivalent performance.


At this rate it may not be needed in a few months to run a model of equivalent performance.

At work I've been successfully running llama2 variants on Mac M2 systems with more than adequate throughput for small experiments and examples. Can also boot a 4x Nvidia T4 instance on AWS (g4dn.12xlarge) to larger models.

I feel you'd need to be doing inference on a 24/7 basis for at least a year to justify the price of this machine.


That's roughly the price of two original Macintosh computers adjusted for inflation.


Does oxide do GPU builds? I thought they're classic server racks (CPU)

> three-phase power,

Think for the tiny grad it's run off single, but for the US the GPUs are power limited given the 110V grid.


We (oxide) aren't doing a GPU focused product yet, that's right.


A new car, especially electric, costs significantly more than $15,000, yet people buy them and bring them home daily.


Transportation is a universal need, and the time savings over other forms are often worth the premium. Doubtful the same can be said for these.


> Is $15,000 really an "at home" sort of price?

For something that is... going to hallucinate? I don't get it?


I mean, people spend way more than that on random art paintings and sculptures right?

With this, it could always be repurposed if something new and shiny comes along


> With this, it could always be repurposed if something new and shiny comes along

I mean, the oxide.computer specs make me go "this would be the most killer gaming machine in all of existence" ... though it's a server rack meant to host an entire private cloud for an entire corporation, I still would love to glom together as much resources as can fit into a single VM for gaming purposes.


George Hotz misread the NVIDIA Inception pricing as per unit instead of as a rebate, believing the GPUs are 80% cheaper than they actually are.


Why buy this instead of building your own setup with X 4090s?


I think he mentions this in the podcast. One of the challenging problems they're attempting to tackle is enabling the build to work with standard 120 volt wall sockets. He mentions throttling the system (GPUs specifically) to use less power at cost to performance.

That and setting up the software to work w/ an array of GPUs rather than one seems difficult.


I don't know if he actually said this, but power limits on gpus do work. I've been limiting my gpu to about 60% max power during long AI workloads and I lost about 10% of performance. A 40% reduction in power usage seems worthwhile.

It probably depends on your gpu though.

Undervolting would probably be even better.


same here with the 3090 I use. Power limited to 60% and I barely notice a difference when experimenting around with AI or playing games. Really good to use in a SFF PC


What OS/software are you using to throttle the cards? I'm interested in doing this on Arch


    ~ > sudo nvidia-smi -pl 300
    Power limit for GPU 00000000:09:00.0 was set to 300.00 W from 450.00 W.


You got the Arch solution. For anyone else on windows: you can just use MSI Afterburner (regardless what you have an MSI card or not).


144GB Vram vs 48?


And once you buy 6 4090s for 6x24=144, and add the other components, you are not that far off from 15k. The 4090s are still a bit cheaper and with a more standard setup, but it's close enough to consider the alternative.


Good luck running that on 1600W and keeping it cool.


What podcast (or otherwise) was this?

Was it this one? https://www.youtube.com/watch?v=dNrTrx42DGQ


Yes, that's the podcast I watched. He also talks a lot about comma.ai which is now profitable on $5 million of VC funding. I think that he has raised a similar amount of funding for Tiny Corp.


Think of it like an ac unit, they are expensive but indispensable. I have always envisioned locally trained ai and now it is happening!


You probably need an AC unit too if you're going to be dumping 2000W of heat into your living space.


why are you valuing George Hotz (or Lex Fridman's) opinions in this at all?


Imagine how many 4090s you could buy and run in a cluster for $15,000 though


I hate to burst your bubble, but more than two 4090s is going to put you at BoM + labor costs around $15k. Especially if you have to upgrade your electrical and hvac.


They seem to have six, either 4090s or 3090s judging by the total amount of VRAM. How many more do you think you could get with $15k considering all the other hardware costs? I doubt you could make it more efficient at this price point.


It might be tough to make more efficient, but $15k seems exactly about the price of "stick 6 4090s in a decent box and throw in a couple grand for my troubles", versus any revolutionary hardware configuration. The way it advertises running fp16 Llama 70B feels a bit contrived too, given the prevelance of quantizing to 8 bit at minimum.


In my opinion the best hardware to run big models is to go and get a mac studio ultra. You have 192GB of unified RAM which can run pretty much every available model without losing performance. And it would cost you half that price.


> without losing performance

But isn't M2 ULTRA over 20x slower than this thing? ~30 TFlops vs 738.


By losing performance I meant you don’t need to quantize the model a lot since it fits in RAM. My bad for not clarifying it.


According to this [1] article (current top of hn) memory bandwidth is typically the limiting factor, so as long as your batch size isn't huge you probably aren't losing too much in performance.

https://finbarr.ca/how-is-llama-cpp-possible/


The Tinybox is planned to use AMD, not NVIDIA. The GPU they're using to build Tinybox is the 7900 XTX.


We host our own LLM because:

1. We are not permitted, by rule, to send client data to unapproved third parties.

2. I generally do not trust third parties with our data, even if it falls outside of #1. Just look at the hoopla with Zoom; do you really want OpenAI further solidifying their grip on the industry with your data?

3. We have the opportunity to refine the models ourselves to get better results than offered out of the box.

4. It's fun work to do and there's a confluence of a ton of new and existing nerdy technology to learn and use.


Stupid question from an outsider, but is it possible to grab a pretrained model (most likely one of the "camelids"), feed it your own data (I have about 1k documents that cannot leave my network) and use it for fast-but-sometimes-wrong information retrieval?


What is your experience with quality? Even if you don't have the option to use a third-party LLM the question is if your self-hosted solution is good enough so your users (employees) will accept it. While you can forbid external solutions in the end you can't force them to use your own solution - at least not in the long run.

I'm very curious what your experience is? Do you think to self-host is good enough so users will accept it?


1. We can and do disallow sharing of client and company data with third party LLMs by policy and, yes, we can ‘force’ them to use our product capabilities.

2. We tested 13 and 30b parameter pretrained LLMs and found them “good enough” within the confines of the original scope.

That said, we are expanding the use cases and with that comes the need for increasing levels of quality. This is where the self-training comes in.

We’re fairly confident it’ll meet our needs for now.


Thanks for your answer. I think I have to research this further. I disagree with your point about forcing your employees. Everyone has GPT as a comparison, if the internal solution is a little bit worse it might fly. If it is too bad people will not put up with it and if you force them they will either ignore the internal solution or leave.

EDIT: Or in the worst case ignore the policy and get fired.


Getting fired could be the least of their concerns depending on the nature of the data shared with third parties.


I think a good example is Microsoft Source Safe. Microsoft had the strongest incentive to use their own product internally and yet even a company this size could not resist the pressure in the long run to use a marginally better product.

We will see how this turns out with LLMs and while I wish companies were more concerned about what data they share with third parties my experience is that apart from military/intelligence related data everything can and will be outsourced.


I host an LLM because it's cheaper for my use case. To many people focus on how an LLM interfaces with users but I believe the best most reliable use for an LLM is for analyzing free form text and having it put that data into quantifiable fields or tagging. Things like this would have taken an interns or overseas laborers weeks to months to do can now finally be automated.


I've had a similar thought. I want to feed LLMs (and friends) messy data from my house and let it un-mess as best it can. A big hurdle in managing home data (chat logs, emails, browser history, etc) is making use of it. i don't want to have to tag all of my data. LLMs seem really attractive for that to me.

I have this urge to toy with the idea but i also find "Prompt Engineering" to be very unattractive. It feels like something i'd have to re-tailor towards any new model i change to. Not very re-usable and difficult to make "just work".


Ya prompt engineering can be a more difficult than it looks. Especially when dealing with less intelligent models. That's why I recommend having an error checking stage were the model gives a model should be able to return a simple "True" or "Yes" when presented it's last response. This eats up more GPU time but the signal to noise ratio improves drastically.


> That's why I recommend having an error checking stage were the model gives a model should be able to return a simple "True" or "Yes" when presented it's last response.

Mind elaborating on that? Looks like a typo but i'm having difficulty knowing for sure. Thanks!


Would also be interested in learning more about this


Thank you for this hint. I have saved a history file from a remote session i did yesterday, some interesting stuff there but also a lot of clutter from all the mistakes I did and asking the free chatgpt to unclutter it for me just made it a lot easier to read and went from 133 lines to 39 lines with added comments and line spacing.

It's not perfect but it removed a lot of useless stuff like all the times i mistyped '--recurse-submodules' or the times I went into a folder only to realize it was not the correct folder for what I was trying to do.


> I want to feed LLMs (and friends) messy data from my house and let it un-mess as best it can.

What would be your goal for doing that?


To then be able to have secondary steps of lookup. I often want a search engine for "my life". Most commonly from text messages where i discussed something. In a perfect world it would be able to link context between emails, browser history, chat conversations, etc. I'd love a flexible system that could record what i have in boxes, in the fridge, etc.

Sounds a bit silly, but of course it's mostly just for fun. However on the more practical side, i do often find myself needing to dig through old text conversations trying to find that one message. Not having a flexible, deep search behind it sucks. I often find myself wanting to do the same with my browser history. Find that one website i visited, etc.

I have the thought that it would be great to make my data points more rich. Don't just tag my browser history with isolated tags, such as Programming, Rust, etc - but infer meaning from my searching. Be able to see that i'm working on ProjectX actively via CLI Git activity, and that i'm searching for Y. Be able to correlate commit Z with search Y. etcetc

It feels to me there's a ton of small, edge case utility that can be gained by dumping everything to a local server and having it link the data. But i don't want to do any of that manually.

Likewise, i've wanted to manage "Home Inventory" before - what's in boxes, etc. Managing that myself is tedious, though. LLMs seem ripe for figuring out associations - even dumb LLMs. My hope is that i eventually can start wiring things together and having the LLMs start making rich data out of messy untagged data.

Would be neat, /shrug


This would be truly great. I have been pushing back the task of itemizing all of my belongings into a spreadsheet in the event of a natural disaster/fire/theft and having an assistant that I could say, "I have this road bike I built" and have it look through my emails and gather all the components and associated costs, then add it to a spreadsheet, would be a boon to many people. Of course, having it do it automagically from my purchases would be even better.


have you heard of rewind.ai? it sounds like it might be a possible solution for what you're looking for (not affiliated with it though, and also don't have it on my Mac, so not sure how well it works in reality)


Just leaving this, there was an effort to build a truly open source one: https://github.com/dhamaniasad/cytev2


Yea, though it's not local. They claim it is, but then use ChatGPT .. which is odd.

Personally i want to build a fairly dumb system though. Ie make a system which can be useful with LLama2 13B or w/e. Something that doesn't require state of the art GPT4+.

If that means compromising on some features that's fine, but at least then it can be truly and fully local.


Exactly this. It's so fast to spin up classifiers now when it used to take weeks to get something working.

What LLMs are you using?


Absolutely, just look at the number of manual data entry jobs on Upwork. IMO one of the superpowers of LLMs is not generating text or images, it's understanding and transforming unstructured data.


What type of analysis do you do on the text? And how is the performance/cost of running vs more specialized models trained for the task?


This isn't our field but its something similar. So say some of your clients are old publications. Some have articles dating back to the 1800s. Nearly all the work is digitized but searching for something in the great categorized mess is a nightmare. As most old publications are downsizing they don't have the man power to curate there archives but are inundated with research requests nearly 24/7. As a service to help these publications maintain there image as an organized informative keeper of historical records you could do the following. 1. have an LLM make a series of tags for all the articles. 2. make a summary for all the articles to improve search results. 3. provide a service to them or up sold to their clients were a question/prompt can be ran across every article or a section of articles.

> how is the performance/cost of running vs more specialized models trained for the task. most models are GNU licensed so thats not an issue. But I imagine you meant the age old question of hosting yourself vs using openAI. Truth is as of now it currently is not foretasted to beat using one of the less intelligent models on openAI. hardware cost alone yes but Dev time is very expensive. Lucky were a small company & our CEO sees this as training. Because LLMs are so new there really isn't a large labor market for it yet. If our devs and engineers get in this early then we can beat others to market as the technology develops and new opportunities come to light. on top of having possible HIPPA, GDPR, or other security laws to follow that OpenAI has been very shooty about, we do not want be at the whim of OpenAI or another SaaS provider on a mission critical part of a vertical. They have talked about depreciating old model. As well they have had content changes in there models to placate political critics, well not realizing that this pulls the rug out from under developers that need any sense of stability from there product.


Do you have any suggestions about how to start implementing something like this in-house? I'm sitting on thousands of PDFs (that can be trivially turned into text) and it would be really useful to train an LLM on them for information retrieval.

But the dev and computing cost of this feels so huge that I'm not even sure where to start.


my first way of showcasing this was by taking a spare computer sitting around the office then writing a little python script that used and LLM to parse information out of file names that our finance team would use to label rebilling invoices. the invoices included the client, payment date, amount, late payment status, etc write in a concluded an completely non consistent file name. the little office PC had 16gb of ram so it was usable for an LLM via the CPU and I just let it run for like 2 days. I continued with my normal work and when it finished I had an intern spend 1 whole day validating just 6% of the data and found it to be 97 percent accurate. I made some obvious changes an was able to fill in that 3% gap. (later we did find a hand full of errors but over all you could consider the validation 99% accurate)

While it really resonated with my management I felt worried I wouldn't be able to replicate these kind of results on other projects.

THE ONLY REAL ADVICE I CAN GIVE ON AI PROJECTS IS . . . don't let your managements expectation of LLMs out weigh its capabilities.

I'm sure I speak for many people here when your non-tech fluent directors get together and think GPT4 is some sort of deity. GPT4 smart (or used to be at least) ill give it that, but small locally hosted 7b/13b LLMs are very limited and people for whatever reason get AI infatuation the second they finally see you show direct value in it they will lose there shit in its assumed capabilities. you got to be direct with them that no matter what dumb video they saw on Sam Altman, what your are proposing is not that. Be very clear in its possible scope because there is some idiot in our organization that will assume assume you can programmatically answer prayers. I actually had this guy from our networking team try and raise a concern about the LLM going sentient and us having a "Skynet" problem. granted this was back in march/2023 so AI histira was a little more rampant but still.

tl;dr my recommendation for your pdf project is run https://github.com/oobabooga/text-generation-webui. if your can get a 30 series GPU in your company Then run a 13B 4bit model that can pull info, assign tags, run minor analysis on your text. else find a spare 16gb machine and do the same but but over a longer time scale.

run a prompt that checks for hallucinations. "does the following text make sense? previous prompt + text if yes then keep else make intern do it.

GPT-j-7b is still one of the best models because it has indexing & categorizing at the main prosperous. other models are great but core idea behind LLMs is that its just a high level auto complete


How accurate is an LLM for this task? I was thinking of using one for analyzing free form PDF text to find a specific element, but I was worried about hallucinations.


Extractive tasks are part of where LLMs shine, and where you get the least amount of hallucination as long as you fine-tune your model.

By fine-tuning the model to extract a specific desired output from the text you give it, it learns that the output always comes from the input, and so you get less random outputs than just by prompting an instruction-tuned model (which was fine-tuned to find the answer in its weights, instead of copying it from the input).


I'm pretty ignorant on which is the best self hosted LLM for such a task or how to fine-tune it. Do you know of any resources on how to set that up?

It seems like llama2 is the biggest name on HN when it comes to self hosting but I have no idea how it actually performs.


You could just try it out if you have the hardware at home.

Grab KoboldCPP and a GGML model from TheBloke that would fit your RAM/VRAM and try it.

Make sure you follow the prompt structure for the model that you will see on TheBloke's download page for the model (very important).

KoboldCPP: https://github.com/LostRuins/koboldcpp

TheBloke: https://huggingface.co/TheBloke

I would start with a 13b or 7b model quantized to 4-bits just to get the hang of it. Some generic or story telling model.

Just make sure you follow the prompt structure that the model card lists.

KoboldCPP is very easy to use. You just drag the model file onto the executable, wait till it loads and go to the web interface.


Won't you run out of context size though? The older models only went up to 2000 tokens, newer ones up to 16k.

Ie how do you feed the LLM the text along with your question without it forgetting most of the text? I assume the text you want to feed it is longer than 16,000 words.


For my use-case the PDFs are only a few pages long generally, so I think the 16k word limit would be well within my needs. I'm trying to find a list of device names from an FDA 510k summary (for medical device clearances). Currently I'm doing this manually and it's quite time consuming. I have around 15,000 PDFs to get through manually, but it's pretty slow work.


I assume asking for "quantifiable fields" is akin to requesting "return the data in JSON format", yes?

How do you do the tagging bits, though?


Cheaper than GPT-3? Can you give a comparison of the costs?


GPt-3 is very expensive if you use it frequently compared to just running in a desktop machine you already have. Of course, if you’re buying new hardware just to run a model for yourself locally, that’s a different cost analysis, but for me I had other reasons to have a decent gpu.

If you have a product that uses an LLM and can get away with one of the open source ones, it’s probably cheaper (and def lower latency/response time) to host yourself too somewhere like azure or aws.


May I ask what your stack is?


Nothing complex actually but it is a little messy and cobbled together

I run oobabooga's API on docker with a 13B 4bit quantized model. https://github.com/oobabooga/text-generation-webui

We use GTX 3060s because there the best bang for there buck in terms of VRAM. Our current set up is mostly proof of concept or used of inner office work well we work on scaling to get a fluid handler built so it can distribute workloads around the multiple GPUs.

Lucky the crypto mining community laid the ground work for some of the hardware.


> Lucky the crypto mining community laid the ground work for some of the hardware.

Are you using riser cards to connect the GPU's to the motherboards then? I thought about trying a setup like yours, but was worried that the riser card interfaces would create a bottleneck. Ideally I'd like to run some cards in a separate box and connect them to my main computer through some kind of cable interface, but I'm not sure if that's possible without seriously affecting performance.


It takes searching and experimenting to figure out what works, and to avoid some of the sketchier stuff (and to lean towards things you could legally use for a startup), but I'm pretty happy with my current home setup, on an old PC with RTX 3090 and 64GB main RAM.

8-bit quantized uncensored Llama 2 13B, doing 50 generated tokens/second, using CPU+GPU including 17GB of 3090's 24GB VRAM.

I also have quantized 70B running currently CPU-only, but I might later be able to speed that up with some CUDA or OpenCL offloading.

This is on Debian Stable (like usual), albeit currently with closed Nvidia CUDA stack, and necessarily with the closed Llama 2 that I can only fine-tune atop. (I'm hoping that some scientific/academic non-profit/govt effort will be able to muster fully open models in the future.)

One of the main reasons I picked Llama 2 was the relatively friendly licensing (and Meta is earning lots of goodwill with that). With this licensing, and the performance I'm getting, in theory, I could even shoestring bootstrap an indie startup with low online LLM demands, from a single consumer hardware box in the proverbial startup garage or kitchen table. (Though I'd try to get affordable cloud compute first.)


I am about to start working on a non-profit project -- not a startup, but similar in terms of resources dedicated to the project and how we hope it will scale.

One of our big questions is whether it makes sense to rent or to buy for training/finetuning/RLHF. The advantage of renting is obvious: I don't think that this phase of the project will last very long, and if it turns out that the idea is a success we'll have no problem securing funding for perma-improvement infra.

The possible advantage of buying is that we would then have the hardware available for inference hosting. We do expect some amount of demand in perpetuity. Having that ongoing cost as small as possible would allow us to continue serving the "clients" we KNOW would benefit a lot from our service with minimal recurring revenue.


Just a suggestion but they have 4bit quantified models that are even smaller and faster that the 8 bit. Your average 13B 4bit model is only about 8-9gb of VRAM. Using this I bet you can get a much higher perimeter model on the 3090.


I was using various 4-bit quantized earlier, but decided to go back to 8-bit for 13B, since I had the VRAM anyway, and (at the time, for other reasons) was seeing some quirky behavior.

70B is currently 4-bit on this box, and once I have GPU accel for 70B, I'll see how the quality compares to 13B 8-bit.


Whoa, 50 tokens/second locally sounds amazing. Any recommendations on guides or documentation for setting up the stack to run on hardware like that?


I’m pretty amazed by how good 13B models are since they’ve gotten the orca treatment. This new one released today has the best evaluation performance of all so far and is in some ways comparable or better than the original LLaMA-65b… a bit shocked by that.

https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B

https://huggingface.co/spaces/Open-Orca/OpenOrca-Platypus2-1...


The model has also been quantized if you have less than 26GB of RAM: https://huggingface.co/TheBloke/OpenOrca-Platypus2-13B-GGML


I self host an LLM (Vicuna 13b) for two reasons. One is cost, and the second is privacy. I don’t want OpenAI or any other provider knowing what I'm working on because they could replicate it. I'm not saying that they would considering that there could be thousand of business cases for using an LLM, but why risk it. By running it locally I have one less thing to worry about.


I built a simple, asynchronous, serialized API on top of llama.cpp for exactly this reason. https://github.com/edfletcher/llama.http/tree/master/example... It can run on low-resource VPSes even, if you have the patience for CPU inference (which will take awhile)!


One benefit of self-hosting LLMs is the wide range of fine-tuned models available, including uncensored models. A popular one over the last weeks was Llama 2 Uncensored by George Sung: https://ollama.ai/blog/run-llama2-uncensored-locally

A few more:

- Wizard Vicuna 13B uncensored

- Nous Hermes Llama 2

- WizardLM Uncensored llama2


How could you _not_?

I'm only waiting to have decent LLMs able locally to be able to start really using them.

No way I'd feed my code, customer data, personal info, secrets, emails, etc., to some dubious cloud machine which is already excellent at exctracting valuable or juicy bits from what I'm feeding it.


+1

I have a little quip I use to troll people who say in effect- but but contract law, your contract says your data is yours even when it is on someone's cloud servers- I say- I know, I know, but remember "possession is 9/10ths of the law."

I do believe in 99.(many 9s) cases no one admin with visibility cares about any given customer's stuff but if they do and if it matters, by then it's too late.


I've been thinking about hosting my own LLM to see if I can hyper customized it to me basically, kinda like an AI companion. My main issue is building the hardware, there so much fluff in that space, it's hard to know what to get and what works well together


- Intel Core i9-11900KF 3.5 GHz 8-Core Processor

- Corsair H150i PRO 47.3 CFM Liquid CPU Cooler

- MSI MPG Z590 GAMING EDGE WIFI ATX LGA1200 Motherboard

- G.Skill Ripjaws V 64 GB (4 x 16 GB) DDR4-3600 CL18 Memory

- Samsung 970 Evo Plus 1 TB M.2-2280 PCIe 3.0 X4 NVME Solid State Drive

- MSI GeForce RTX 3090 TI SUPRIM X 24G GeForce RTX 3090 Ti 24 GB Video Card

- Corsair Carbide Series 275R ATX Mid Tower Case

- Corsair RM1000x (2021) 1000 W 80+ Gold Certified Fully Modular ATX Power Supply

- Microsoft Windows 11 Pro

On this setup I've been able to run every model 13B and below with 0 issue. Even been able to fine-tune Llama 2 13B using my own data (emails, SMS, FB messages, WhatsApp, etc.) with pretty fun results!


What was your stack for doing the fine tuning with your own data? I've been looking around at various different approaches and I think I have a general idea of how to approach it (generating embeddings, putting them in a vector DB, somehow linking that to a useable UI, etc.). Would definitely be keen to understand what your solution looks like though!


I am interested in this as well. Pretty please!!!


> On this setup I've been able to run every model 13B and below with 0 issue. Even been able to fine-tune Llama 2 13B using my own data (emails, SMS, FB messages, WhatsApp, etc.) with pretty fun results!

Is it useful? Would you let it draft responses at you? Curious about the fun results :)


I haven't tried self-hosting due to the hesitation around the general drab I've experienced in the past trying trying to host other ML models.

Find a repo. follow the install instructions. What is this weird error? A library issue..? Maybe it's my OS..?

It always seems to be tedious compared to open projects in other domains. Maybe that can't be solved.


I replied to a comment in another post yesterday on this - https://news.ycombinator.com/item?id=37120346

Honestly the easiest way that “just works” is to use LM Studio which you can run locally https://lmstudio.ai/

Obviously you’ll have faster results if you have a fancy gaming GPU or something like the M2 Max/Ultra but you don’t need those to have a play and see if it interests you.


I just use GTP4all [0] both the GUI and the python bindings[1]

[0] https://gpt4all.io/index.html [1] https://docs.gpt4all.io/


Llama models have been pretty easy to host. StableDiffusion was a real nightmare when it came out (and still is at times).

Using docker has an initial threshold you have to get over but once you do, everything becomes very easy in it. How you end up using docker matters very little once you get the concepts.


In my experience, people specialzed in machine learning are usually researchers and mathematicians, not engineers. Writing a package that will work on any random person's hardware and system is a non-trivial engineering task.


You should run your own LLM if you can. Just keep in mind that many hobby users simply cannot do that. They represent the majority of LLM users - not the majority of power users. These people will struggle to use LLMs without some technical support, not because they cannot learn, of course they can, mostly because it is not their priority. LLMs as a technology needs to be made more widely accessible by making it open-source, so folks can run their own instances if they decided to do so, but also hosting it and providing it as a service for those who simply do not have the skills or desire to run them themselves.


A few things holding me back for now:

- I use LLMs for code generation for a startup and they are not competitive for that yet.

- Most of the popular open models are non-commercial.

- The only practical way I know of to get large custom datasets for training is to have OpenAI's models generate them, and they forbid this in their terms of service.

Having something that's truly open and closer to GPT-4 for code generation will probably happen within less than a year (I hope) and will be a game changer for self-hosting.


ChatGPT is powerful, but it gives you different answers to the same question from one session to the next. And research found that overall performance can vary over time, sometimes for the worse. So you may host your own LLM for reproducibility.

I have not tried public LLMs myself. Do they give reproducible results?


> I have not tried public LLMs myself. Do they give reproducible results?

Public LLMs I don't know but images generated using StableDiffusion are, of course, totally deterministic.

There really is no reason a LLM cannot be deterministic and if it isn't: fix it (even if this comes at a tiny performance cost).


If you fix the random number seed virtually all LLMs should be deterministic. However, just 1 token difference in the input could produce a very different output, depending on the sampler, model, etc. So, LLMs can be deterministic, but in practice they are pure alchemy.


> it gives you different answers to the same question

Sometimes answer is wrong and then right.

If it is deterministic, then what if gets "stuck" on wrong answer?


They do that on purpose. The API gives you a setting you can change to 0 or 1 for maximum creativity or maximum reproducibility


That doesn't work for GPT-4 though. https://news.ycombinator.com/item?id=37006224


Except it's never reproducible. It's a bug probably.


I own my LLM not because I need it now but having the luxury to fall back if openai ran out of money


I run Llama 7b with CPU only. It is fun when my Internet goes down and I have nothing else to do.


They are a bit of an "offline" network, in a way.


Whether or not its "better" to host your own, the positive effects from the open source community at trimming down the parts of state-of-the-art LLM that matter and improving efficiency will be good for the community.


thanks for writing the article: any recommended links on HOW to host your own LLM?


thanks for this article: any recommended links on HOW to host your own LLM?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: