Hacker News new | past | comments | ask | show | jobs | submit login
Is anyone using self hosted LLM day to day and training it like a new employee
100 points by reachableceo 11 months ago | hide | past | favorite | 72 comments
I have this idea to use LLM daily. Train it on my emails / notes / chats . Have it draft replies and I edit them as needed and it learns from that.

Is anyone doing anything like that? I have all of the open source stuff downloaded (models , lollms-webui , promptfoo, etc ) and have been experimenting with the interactive chat stuff . Also txtai to make semantic search.

That all seems pretty mature / progressing nicely . A few more months and I expect a clear reference stack will emerge .

What about the assistant stack ? I invest all these resources to self host and feed in all my data. I want to maximize the ROI.




The most limiting factor I’ve come across is hitting the context window. Eventually your new eager employee starts to forget what you’ve taught them but they’re too confident to admit it.


> Eventually your new eager employee starts to forget what you’ve taught them but they’re too confident to admit it.

Seems very realistic!


No, it would be realistic if after two weeks on the job they start telling you how to run the company.


It should already have a published series of shelf-help books by that point


Are there methods to "summarize what they've learned" and then replace the context window with the shorter version? This seems like pretty much what we do as humans anyway... we need to encode our experiences into stories to make any sense of them. A story is a compression and symbolization of the raw data one experiences.


Yeah that's a fairly well studied one. Most of these techniques are rather "lossy" compared to extending the context window. The most likely "real solution" is going to be using various tricks and finetuning on higher context lengths to just extend the context window.

Here's a bunch of other related methods,

Summarizing context - https://arxiv.org/abs/2305.14239

continuous finetuning - https://arxiv.org/pdf/2307.02839.pdf

retrieval augmented generation - https://arxiv.org/abs/2005.11401

knowledge graphs - https://arxiv.org/abs/2306.08302

augmenting the network a side network - https://arxiv.org/abs/2306.07174

another long term memory technique - https://arxiv.org/abs/2307.02738


this is a fantastically useful comment. thank you filterfiber :)


Is there a realistic way to actually increase the context window?


Yes! The obvious answer is to just increase your positions and train for that. This requires a ton of memory however (context length is squared) so most are currently training at 4k/8k and then finetuning higher similar to many of the image models.

However there's been some work that to "get extra milage" out of the current models so-to speak with rotary positions and a few other tricks. These in combination with finetuning is the current method many are using at the moment IIRC.

Here's a decent overview https://aman.ai/primers/ai/context-length-extension/

Rope - https://arxiv.org/abs/2306.15595

Yarn (based on rope) - https://arxiv.org/pdf/2309.00071.pdf

LongLoRA - https://arxiv.org/pdf/2309.12307.pdf

The bottleneck is quickly going to be inference. Since the current transformer models need the context length ^2, the memory requirements go up very quickly. IIRC a 4090 can _barely_ fit a 4bit 30B model in memory with 4096k context length.

From my understanding some form of RNNs are likely to be the next step for longer context. See RWKV as an example of a decent RNN https://arxiv.org/abs/2305.13048


I’ve absolutely explored this idea but, similar to lossy compression, sometimes important nuance is lost in the process. There is both an art and science to recalling the gently compacted information and being able to recognize when it needs to be repeated back.


If there was something like Objects in OO programming, but for LLM’s, would that solve this?

Like a Topic-based Personality Construct where the model first determines which of its “selves” should answer the question, and then grabs appropriate context given the situation.


look up "frames", it's an old concept and also influenced OOP.


The animal brain equivalent isn't summarize a context window to account for limited working memory. It's to never leave training mode to go into inference-only mode. The learned models in animal brains never stop learning.

There is nothing stopping someone from keeping an LLM in online-training mode forever. We don't do that because it's economically infeasible, not because it wouldn't work.


Putting too much information in the context window is counter-productive in my experience. Low signal/noise ratio tends to increate the likelihood of model hallucinations, and we don't want that!

What works in my experience - structuring the task similar to a human-driven workflow, breaking it down into small steps is needed. Each step could be driven by a small prompt, relevant document fragments (if RAG is used) and condensed essays/tutorials/guides that were written by a powerful LLM (ideally, GPT-4 pre-Turbo).

Using this approach, you could stay well below 8k token limit even on the most demanding tasks.

(Big size contexts are leaky on all LLMs anyway)


What about some generation-augmented retrieval augmented generation set-up where all your conversations are indexed for regular text search, and then you use the LLMs language knowledge to generate relevant search phrases the results of which are included in the current prompt?


I would imagine that daily "training" here involves something more like RLHF than just appending to a big prompt.


I think you'll need to save good responses (and bad responses that you fixed?) and regularly run more training passes.


Yeah, especially with a large knowledge base I find it important to keep a log of prompts/responses and perform team reviews of both. It’s honestly making more work than it’s saving at the moment with the hope that it’ll be more helpful down the road. On the plus side it’s made the team more interested in tasks around technical documentation and marketing material, so still a win!


The solution is RAG


Funny but also true in real life :-(

I start to feel like a one eye king under blind people.

I even remember sometimes when I told people specific things.


I'm pretty interested in this as well. I have moved from Notion to Obsidian for my personal notes, to-do lists and errata in preparation for this since obsidian uses local plaintext files.

What I would love to get working at some point is allowing an LLM access to my schedule, notes and goals and then have it help prompt me at appropriate times. "Hey, TJ I noticed you haven't worked out this week, it's sunny today this might be a good time". That sort of thing.

There seem to be good tooling around agents, prompt engineering, RAG etc. The 'glue' around getting the LLM to help figure out when appropriate time(s) to check in with me is the bit I am missing, but that's probably mostly down to me being an artist and only a very very JR hobbyist programmer though.


Having come from Notion also, I LOVE Obsidian for its non-proprietary markdown file structure. Incredible powerful plugins, too.

FWIW, it's not really what you're seeking... but there is a plugin that allows you to invoke an LLM from within Obsidian (via Ollama): https://github.com/hinterdupfinger/obsidian-ollama

In short, it allows you to set up prompts to act on selected text directly within a file, e.g. 'Summarize this selection as a markdown formatted list of key points', 'Write a PRD', 'Translate to [Language]', 'Run this as a prompt', etc.


That's a pretty slick plugin I will have to check it out, thanks for the suggestion.


> What I would love to get working at some point is allowing an LLM access to my schedule, notes and goals and then have it help prompt me at appropriate times. "Hey, TJ I noticed you haven't worked out this week, it's sunny today this might be a good time". That sort of thing.

If you work in a Microsoft world, this is what GraphAPI is all about: enabling access to all the things using your personal authentication token. This includes emails, calendar, OneNote, One Drive, and essentially everything. I've been working on making an easy to use OneNote provider with GraphAPI underneath it that I can use to work with the LLM.

https://learn.microsoft.com/en-us/graph/overview

https://learn.microsoft.com/en-us/graph/use-the-api


interesting, thank you.


I've been thinking about the 'glue' bit too.

I think a cron job run say every hour would be good enough. It would just have to collect all the inputs and make a decision about how "valuable" it would be to check in (haven't worked out in 3 days, very valuable), then it's just about connecting it to the right outputs (email, push noti, ...)

The job itself would be cheap and trivial to host on something like lambda


I’m working on this right now, will be open sourcing soon probably :)


That's pretty cool - let us know if you do!


Just be aware of the security implications. Maybe it's unrealistic but what if someone sends you a calendar invite containing a prompt injection? It may not seem like a big issue but at worst (e.g. with github copilot) something like this may lead to remote code execution.


Yeah, it's a good point. Have a feeling we will see a lot of sneaky security issues around LLM's over the next few years to say the least.


I like it.


Training a local LLM on individual facts is a tricky one. Typically it’s not possible to train with a limited quantity of data and expect the model to generalize on that data well. In context learning generalizes well, but it’s a bad fit for an “employee” model that’s supposed to learn over a long stretch of time.

If your goal is to bake new concepts into the model weights, your only real option is a dataset with that concept being used in a wide variety of contexts.

A more feasible approach I think would be retrial augmented generation. You’d essentially store your conversations in a database and calculate embeddings as you go. This would allow you to later do a natural language search of the database, and insert the most relevant portion of the conversation into your context window.


Yeah, I think training for facts in general is kind of problematic since you often have to overfit and the model may lose capability in other areas. I suspect that the only situations where it really makes sense to train on facts are where the facts are very nuanced and require a lot of interpretation, or more often where the facts are just so extensive that they can't be crammed effectively into the context window you're working with. Otherwise, you're better off with a vector db and a well-written prompt.


What if we just train it to respect facts in general, then couldn't we just supply it a list of facts?

Sort of how they made chatGPT way more likely to obey requests?


You can supply the model with a list of facts already, that’s not the problem. Within the context window the model is able to learn and generalize new information.

Fine tuning is very unintelligent in the sense that it doesn’t take the context of the training samples into account. It’s a dumb optimizer that’s trying to minimize next token loss. Gradient descent is not beholden to the behaviors you taught in the instruct fine tune step.


Embeddings can be tricky. They are just an average semantic vector over a chunk of text.

There is a high chance that a plain similarity search (dot product or cosine distance) will bring a lot of noise and junk into the request. And high noise/signal ratio in the context tends to lead to hallucinations.


It’s not perfect, if you know of a better alternative I would genuinely love to hear about it.


If I absolutely need to avoid hallucinations (e.g. when building marketing/sales assistants for the businesses), then I allow LLM to control and drive search for the relevant documents.

On a high level:

(1) give LLM ability and enough information to "expand" user query into a multiple search phrases. Search engine will use them to find most relevant fragments via a form of embedding search

(2) Get highest ranking document fragments and "show" them to LLM saying: "This are the results that were found in the document database using your search phrases via embedding similarity. Refine the search"

(3) Repeat that a couple of times, then rank final documents and combine them for the final answer synthesis.


I would like to extend the question: is anyone building a homelab for the specific purpose of training a LLM on their personal info? The choice of hardware (for speed, cost, and noise concerns) seems important.


For me, nothing fancy, I just added extra ram to a gaming notebook to get enough speed on answers, since it already as a good nvidia card, keeping the api open for access from another laptop I have, via api inside my local network.

I have an extra computer for services like filesharing, samba, nfs, git, firewall, etc, for instance caching the models I'm downloading with a squid proxy, so I can test several UIs downloading the same model over again. Not every UI is offering an easy way to set a single folder to store all gguf files, or it's lacking documentation.

I'm already having a lot of fun. There's people already doing much more than this. I'm more worried about integrating and gluing in a way that will become transparent after the new year, local models or not.

Also how to glue this with obsidian/logseq/neovim/etc in a way that I can use with fewest keystrokes possible, instead just uploading a gigantic context or sensible source code files.


I admittedly have not done this myself, but the hardware choices are less complex than you might think. The only truly important part is the GPU, and the number of GPUs on the consumer market that can handle a local LLM is presently quite small. It's pretty much dual-wielding 3090's or 4090's or bust. If you're running but not training a (relatively) small one, you can do well with just one. But if you want to run inference _and_ training on one large enough to be a consistent daily assistant (read: 70b+), cloud hardware for the training is just far more practical and economical. You're generally not gonna wanna drop the $$ for an A100 or two just to do a small training run every couple of weeks on a dataset made up of the amount of RLHF'd conversations one person might have with an LLM in that time.


I think a good use case for this would be auto generation of code documentation. There are many reasons to not want to upload your source code to a cloud AI service, but having an AI that was trained on your local code base so you could ask it “what does this foobar function do anyways?” would be killer.


I was thinking on doing something among the same lines but with code:

An LLM that is specifically trained for Software Development, and to which I feed the code of all my company's repositories. And I keep feeding commits/pull requests.

The idea is that I can query it about architectural issues, code improvements, and other technical aspects at different levels of abstraction (code, architecture, business, etc).

So far, I've played a bit with CodeRabbit and it's "just ok" but it is more of a very small windows to what "could be" than being actually useful.


As a solo developer answering emails that basically point people to various guides and FAQs I’ve published … I need this. Zendesk claims to have an AI component but forces you to input all training data into their own wiki knowledge base. I can see why they don’t want to use prior responses as training (pii concerns), but at least give me some boilerplate responses that I can use to get a head start and further train the model(s).


Why not do it the old fashioned way and hire a human for this? Humans also have the advantage that they don't just make up answers when they don't know something (or at least if you hire good ones).

I've had good experience hiring support folks and working with them on a shared inbox (we use HelpScout).


I considered both options for allaboutberlin.com. I want to offer either better search, or personalised advice.

Cost is a huge problem. An immigration lawyer would continuously eat into my personal income. If I don't get state funding to cover it, it makes zero business sense. It would also come with all the liability of a first employee.

A GPT that cites my website, German law and a dozen official websites would be a game changer. It could not give advice, but it could find answers like Phind does. It's just tech, with lower running costs and virtually no obligations (unlike an employee).

I just don't know if the result would be useful and trustworthy enough, and it's very expensive to try.

My conclusion is that I should focus on building a good knowledge base, and when the time is ripe, I can augment it with fancy tech.


he is the human that does this.


Right, but developers are significantly more expensive than customer support agents. Especially if the answers are already documented, and the rep just needs to be friendly in pointing customers to the right documentation.

If you can free up 5 hours of week of dev time by spending $100/week on a support rep, that's a great rate for dev time.

This assumes you're running a profitable business in the first place. If you're still struggling to find paying customers, then yeah it makes sense to do everything yourself.


What do you think of the quality of support provided by someone who spends 5 hours a week answering questions vs the primary developer? This isn't as straightforward as you think. If he can free up 5 hours a week for $500 and loses 10 customers because they didn't like the support from someone who spends 5 hours a week on this product they've never used, do you think he wins or loses?


>What do you think of the quality of support provided by someone who spends 5 hours a week answering questions vs the primary developer?

In my experience, the quality of support is great!

A year into my business, I brought someone in to do customer support for five hours per week, and it freed up a huge amount of time and mental bandwidth.[0]

Since then, I've expanded support to two non-technical customer service reps and two support engineers. It's rare that I answer a question now because my team can generally answer anything I can.

>If he can free up 5 hours a week for $500 and loses 10 customers because they didn't like the support from someone who spends 5 hours a week on this product they've never used, do you think he wins or loses?

I think this is wildly unlikely in practice.

When you bring on a support rep, you don't just say, "Okay, you're on your own. Do the best that you can." You work to onboard them and make sure they're comfortable answering questions and asking for help when they need it. You also course-correct or jump in if they give someone an incorrect or unhelpful answer.

In my two years of having a support team, I can't think of a single time that a customer said they were dropping us because they were dissatisfied with our support.

Instead, what I've found is that the support team often does a better job than I could have because they have more bandwidth to dig deeply into issues.[1] On about 10% of support requests, customers go out of their way to tell us how satisfied they are with the help we gave them.

Documenting things and onboarding teammates is time-consuming and difficult, but the commenter I was originally responding to said that they've already done that work. They described themselves as "a solo developer answering emails that basically point people to various guides and FAQs I’ve published." If they're already at that point, bringing in a support rep is a no-brainer.

In my experience, the majority of support requests I receive are nontechnical. I sell a physical product, so it's probably higher than a pure SaaS, but it's still about 60% of questions are about billing or they deleted their invoice by mistake or they need to see our tax forms for internal paperwork. That's all stuff that a support rep can handle just as well as I can even if they know nothing about my product.

[0] https://mtlynch.io/solo-developer-year-4/#good-leadership-me...

[1] https://forum.tinypilotkvm.com/-826#post-6


You just need embedding based search.


I've been building workflow assistants that make existing employees more productive or enable entirely new business models. Some of these assistants use selected local models (due to cost or privacy factors)

Currently the stack is gravitates around:

- GPT-4 - either to drive the entire workflow OR generate prompts, plans and guidelines for the local models to execute.

- structured knowledge bases (either derived from existing sources OR curated manually by companies to drive AI assistants).

- embedding search indexes, augmented by full-text search. Usually LLM has access to the search engine and can drive the search as needed, refining the queries if results aren't good enough.

All of that is instrumented with logic to capture user feedback at every single step. This is crucial for the continuous improvement of the model!

Bigger model can use this information once in a while, to improve plans and workflow guidelines to make the overall process more efficient.

AMA, if needed!


Do you use a framework to pull it all together? Like Langchain etc


LangChain is good for the demos and learning, but it is too complex and brittle for my taste.

Using a bit of boilerplate code (a couple of python files) that I copy to new projects.


Everyone I know just uses the hosted ones, because of the sheer performance gap.

For now, you can do all the custom/manual training you want, but gpt4 will almost always outperform it with the right context.

Hopefully that will change in the future. Even then, I don't expect people to want to self-host as in on their own machines. More like custom training, then host either on SAAS or PAAS, or their own on-prem if they have it. Spending the performance of a personal laptop isn't worth the reduction of performance on other tasks. Again, maybe that will change.


It doesn't need to be an exclusive choice. Hosted and local models can complement each other.

On one of the projects I used Chat GPT-4 to write instructions/tutorials that were then executed by local model on a large chunk of data (cleaning up product catalogues).

GPT-4 then reviewed some results and fine-tuned the instructions.


interesting, what does the local model do?


One example - a local model (Mistral 7B OpenChat 3.5, which ranks high in my benchmarks) was generating additional search keywords for products at online marketplace. A second run of the model was cleaning up bad keywords.

The fun part here - ChatGPT-4 reviewed some user searches, product details and comments of marketing department. Then it generated condensed tutorials on writing good keywords for the products in this system (keywords have to cover unexpected search terms that people would use when searching for a product).

The tutorial was supposed to be for the "junior marketing assistant", but in reality it was fed to Mistral 7B.

The second pass was done similarly. "Hey, ChatGPT, these are some sample keywords that are produced by junior marketing assistant according to your tutorial. Review them and write a short guide on correcting most common problems".

It works nicely.

Other cases of local models that work good - custom embeddings (multi-lingual, mapped to the same vector space), custom TTS, custom STT. These are mostly used for specialised personal assistants.


Cool use case, glad to see txtai [1] is helping (I'm the main dev for txtai).

Since you're using txtai, this article I just wrote yesterday might be helpful: https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

Looks like you've received a lot of great ideas here already though!

1 - https://github.com/neuml/txtai


Thanks. I build something similar (didn't know it was called RAG) about 6 months ago. I found the most difficult part was interfacing with the existing systems and extracting the documentation out of those. Google docs, Notion, Slack. Do you have any advice on easier ways to do this? Are there any libraries around that make this task a bit lighter?


Well for Google Docs, Notion, and Slack they all have APIs that are pretty straightforward to use.


Local models have taken a mind boggling leap over the past months so i'm sure we'll be able to add layers soon by ourselves even on a laptop?

Seriously this is not far from Chat GPT 3.5 in only 6.7GB's and runs on a Macbook Air:

https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF

But yeah current context windows are limiting.


I was testing this one days ago. It seems fine to use as a base for extra finetuning, but failed hard questions that chatgpt nailed.

One example was trying to use as a assistant to beat long games, without immediate rewards.

I was trying to log and simultaneously get feedback playing Stardew Valley. gpt-3.5-turbo-1106 basically went along with me and my daughter in a coop session giving nice suggestions, sometimes with huge gaps, but easy enough to ask more about after giving more context.

Mistral 7b and 13B was basically mixing up stardew valley with WoW and Genshin Impact, even giving a lot of context about the day I was, what the npcs answered, or things that I know on how to solve a certain quest. It straight made up non existing towns (stardew valley only has one) etc, etc.

I was running the model on a separate gaming notebook, with nvidia, while playing the game on the one I'm using now.


True, and makes sense that the logic is closing in but the breadth of the data is too narrow in 7GB's to ask questions about niche topics.

Mistral hasn't released their own official 13B/30B's yet, but i'm really looking forward to what they can do.

What is crazy is that Ultrafastbert, Speculative, Jacobi, or lookahead decoding could potentially speed up by up to 80x depending on size which could make GPT-4 like models feasible on entry level macs / Phones if similar wizardry is done memory wise.

..Yes im very optimistic after the insane progress over the last months with models like Mistral, Deepseek etc.


With RAG and fine tuning (which is cheap), you can fine tune a model on a daily basis so that one isn't trying to stuff everything in the context window.


I think with RAG it's pretty reasonable. Put your corpus in pinecone or some other vector store and relevant sections get injected along with your prompt which lessens the burden on context window.


I've been using & contributing to Lightrail (https://github.com/lightrail-ai/lightrail). Each instance comes with a local vectorDB and integrates with apps like Chrome & VSCode, so I can read in content like my notes, emails, etc. It doesn't support self-hosted LLMs yet unfortunately!


There's Rewind.ai for macOS, which tracks all audio, video, and text it can see as you work, then lets you query it via its local LLM chat. Works pretty well. Also can summarize meetings and work with your calendar in certain ways. It does not use local documents you have not viewed on-screen; it doesn't index your file directories.


I have a similar goal/desire as you. My current project is ingesting this type of data into Elasticsearch with vector embeddings and using a normal search+knn to generate some context when creating a prompt.

This works reasonably well with gpt4, but my context is almost always too large for self-hosted models so far.


This is something I imagine coming out of Autogen or OpenAi Assistants in a few months. You really need multiple agents (as of now) most of the time. IMO multiple GPT4 agents ARE smart enough to accomplish a lot, it's getting them working together and setup that's the issue.


But you would lose the benefit of self hosting


Ask HN:




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: