Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: How we leapfrogged traditional vector based RAG with a 'language map' (twitter.com/mutableai)
162 points by oshams 3 months ago | hide | past | favorite | 55 comments
TL;DR: Vector-based RAG performs poorly for many real-world applications like codebase chats, and you should consider 'language maps'.

Part of our mission at Mutable.ai is to make it much easier for developers to build and understand software. One of the natural ways to do this is to create a codebase chat, that answer questions about your repo and help you build features.

It might seem simple to plug in your codebase into a state-of-the-art LLM, but LLMs have two limitations that make human-level assistance with code difficult:

1. They currently have context windows that are too small to accommodate most codebases, let alone your entire organization's codebases.

2. They need to reason immediately to answer any questions without thinking through the answer "step-by-step."

We built a chat sometime a year ago based on keyword retrieval and vector embeddings. No matter how hard we tried, including training our own dedicated embedding model, we could not get the chat to get us good performance.

Here is a typical example: https://x.com/mutableai/article/1813815706783490055/media/18...

If you ask how to do quantization in llama.cpp the answers were oddly specific and seemed to pull in the wrong context consistently, especially from tests. We could, of course, take countermeasures, but it felt like a losing battle.

So we went back to step 1, let’s understand the code, let’s do our homework, and for us, that meant actually putting an understanding of the codebase down in a document — a Wikipedia-style article — called Auto Wiki. The wiki features diagrams and citations to your codebase. Example: https://wiki.mutable.ai/ggerganov/llama.cpp

This wiki is useful in and of itself for onboarding and understanding the business logic of a codebase, but one of the hopes for constructing such a document was that we’d be able to circumvent traditional keyword and vector-based RAG approaches.

It turns out using a wiki to find context for an LLM overcomes many of the weaknesses of our previous approach, while still scaling to arbitrarily large codebases:

1. Instead of context retrieval through vectors or keywords, the context is retrieved by looking at the sources that the wiki cites. 2. The answers are based both on the section(s) of the wiki that are relevant AND the content of the actual code that we put into memory — this functions as a “language map” of the codebase.

See it in action below for the same query as our old codebase chat:

https://x.com/mutableai/article/1813815706783490055/media/18...

https://x.com/mutableai/article/1813815706783490055/media/18...

The answer cites it sources in both the wiki and the actual code and gives a step by step guide to doing quantization with example code.

The quality of the answer is dramatically improved - it is more accurate, relevant, and comprehensive.

It turns out language models love being given language and not a bunch of text snippets that are nearby in vector space or that have certain keywords! We find strong performance consistently across codebases of all sizes. The results from the chat are so good they even surprised us a little bit - you should check it out on a codebase of your own, at https://wiki.mutable.ai, which we are happy to do for free for open source code, and starts at just $2/mo/repo for private repos.

We are introducing evals demonstrating how much better our chat is with this approach, but were so happy with the results we wanted to share with the whole community.

Thank you!




I agree that many AI coding tools have rushed to adopt naive RAG on code.

Have you done any quantitative evaluation of your wiki style code summaries? My first impression is that they might be too wordy and not deliver valuable context in a token efficient way.

Aider uses a repository map [0] to deliver code context. Relevant code is identified using a graph optimization on the repository's AST & call graph, not vector similarity as is typical with RAG. The repo map shows the selected code within its AST context.

Aider currently holds the 2nd highest score on the main SWE Bench [1], without doing any code RAG. So there is some evidence that the repo map is effective at helping the LLM understand large code bases.

[0] https://aider.chat/docs/repomap.html

[1] https://aider.chat/2024/06/02/main-swe-bench.html


I've been thinking about this a lot recently. So in Aider, it looks like "importance" is based on just the number of references to a particular file, is that right?

It seems like in a large repo, you'd want to have a summary of, say, each module, and what its main functions are, and allow the LLM to request repo maps of parts of the repo based on those summaries. e.g. in my website project, I have a documentation module, a client side module, a server side module, and a deployment module. It seems like it would be good for the AI to be able to determine that a particular request requires changes to the client and server parts, and just request those.


The repo map is computed dynamically, based on the current contents of the coding chat. So "importance" is relative to that, and will pull out the parts of each file which are most relevant to the task at hand.


Interesting, how does Aider decide what’s relevant to the chat?


I had forgotten that Aider uses tree-sitter for syntactic analysis. Happy to found you've got the tree-sitter queries ready, to retrieve code information from source. I was researching how to write the queries myself, for exactly the same purpose as Aider.


I tried using Aider but my codebase is a mix of Clojure Clojurescript and Java . I gave up making it work for me it as it created more issues for me. What I really hated about Aider was that it made code changes without my approval.


You might be interested in my project Plandex[1]. It’s similar to aider in some ways, but one major difference is that proposed changes are accumulated in a version-controlled sandbox rather than being directly applied to project files.

1 - https://github.com/plandex-ai/plandex


You can give 16x Prompt a try. It's GUI desktop app designed for AI coding workflow. It also doesn't automatically make code changes.

https://prompt.16x.engineer/


The recommended workflow is to just use /undo to revert any edits that you don’t like.


I've been curious about this use case, so cool to see, and more so, to know it worked!

This is essentially a realization of how graph RAG flavor systems work under the hood. Basically you create hierarchical summary indexes, such as topical cross-document ones, and tune the summaries to your domain. At retrieval time, one question will be able to leverage richer multi-hop concepts that span ideas that are individually distinct & lexically, but get used together. Smarter retrievers can choose to dynamically expand on this (agentic: 'follow links') or work more in bulk on the digests ('map/reduce over summaries') without having to run every chunk through the LLM.

Once you understand what is going on in core graph rag, you can even add non-vector relationships to the indexing and retrieval steps, such as from a static code analysis, which afaict is the idea here. For a given domain, likewise, you can do custom templates to tune what is in each summary, like different wiki page styles for different topics. (Note: despite the name & vendor advertising, no graph DB nor knowledge graph is needed for graph RAG, which makes its relationship to autowiki etc concepts less obvious.)

We are building out some tech here to deal with core production issues like to update/add items without reindexing everything and making larger ingests faster+cheaper. Eg, imagine monitoring a heavy feed or quickly changing repo. If of interest to any, please ping - we are putting together design partner cohorts for the RAG phase of louie.ai .


Nice!


I've been working on Webwright[1] for a month after having prototyped a few different terminal solutions for a coding agents. Webwright manifests in a psuedo terminal in Powershell or terminal on MacOS.

Using Claude.AI, I determined the `ast` package would be suitable (for Python scanning), so had Webwright author a new function module to scan the project and assemble a list of functions, function calls, imports, and decorators. I then installed the function module and relaunched the app.

It produced the following as a request to explain how the system works for y'all:

WebWright uses modular function calls to handle various tasks such as user input processing, file operations, git management, and AI interactions. The main application routes commands to specific functions, which in turn perform discrete tasks like reading files, managing containers, and querying AI APIs. This architecture ensures clear separation of concerns, maintainability, and scalability while efficiently handling complex operations asynchronously.

The heavy lifting in WebWright is primarily done in the AI handlers and utility modules:

1. AI Handlers (`lib.aifunc`):

   - The core AI functionality, like processing user queries and interacting with AI APIs (OpenAI, Anthropic), performs the most computationally intensive tasks.    
   
   - Managing asynchronous API calls, error handling, and processing large datasets.
2. Utility Modules:

   - Modules like lib.functions.filesystem, lib.functions.git_*, and lib.functions.manage_app_container handle substantial operations such as file I/O, version control, and Docker container management.
These components shoulder the bulk of the computational and I/O work, ensuring efficient task execution and resource management.

[1] https://github.com/MittaAI/webwright


That's awesome thanks!


This is literate programming! Why not just put the codebase in the wiki and not have two representations diverging. Why can't we have diagrams and links in code??? We can, like https://observablehq.com notebooks, it's a better representation for understanding.


patience ... all part of our plan :)


Hey, cofounder at Mutable.ai here.

I want to encourage you all to ask the chat some tough questions. You can ask very complex and general questions. Some examples: - Ask ollama (https://wiki.mutable.ai/ollama/ollama) how to add a new model - Ask langchain (https://wiki.mutable.ai/langchain-ai/langchain) "How can I build a simple personal assistant using this repo?" - Ask flash attention (https://wiki.mutable.ai/Dao-AILab/flash-attention) "What are the main benefits of using this code?"

It is also useful for search - for example, if you ask langchain "Where is the code that connects to vector databases?" it will surface all the relevant information.

Very curious to hear what you ask (and whether you find the response helpful)!


I’ve been building LLM-driven systems for customers for quite some time. We got tired of hallucinations from vector-based and hybrid RAGs last year, eventually arriving to the approach similar to yours.

It is even called Knowledge Mapping [1]. It works really well, and customers can understand it.

Probably the only difference with your approach is that we use different architectural patterns to map domain knowledge into bits of knowledge that LLMs will use to reason (Router, Knowledge Base, Search Scope, Workflows, Assistant etc)

My contacts are in the profile, if you want to bounce ideas!

[1] English article: https://www.trustbit.tech/en/wie-wir-mit-knowledge-maps-bess...


Looks similar to what we're doing in Pythagora with CodeMonkey agent (prompt: https://github.com/Pythagora-io/gpt-pilot/blob/main/core/pro..., code: https://github.com/Pythagora-io/gpt-pilot/blob/main/core/age...)

I think everyone who's seriously tackled the "code RAG" problem is aware a naive vector approach doesn't work, and some hybrid approach is needed (see also Paul's comments on Aider).

Intuitively, I expect a combo of lsp/treesitter directed by LLM + vector-RAG over "wiki" / metadata would be a viable approach.

Very exciting to see all the research into this!


This sort of approach always made more sense to me than RAG. I am less likely to try RAG than something that feeds the LLM what it actually needs. RAG is risky in providing piecemeal information that confuses the LLM.

The way I thought would work and like to try out is ask the LLM what info it wants next from an index of contents. Like a book. That index can be LLM generated or not. Then backtrack as you don't need that lookup in your dialogue any more and insert the result.

It won't work for everything but should work for many "small expert" cases and you then don't need a vector DB you just do prompts!

Cheap LLMs make this more viable perhaps than it used to be. Use an open source small LLM for the decision making then a quality open source or proprietary LLM for the chat or code gen.


It's still RAG, just the R in RAG is not vector-based anymore, no?


You’re right. Many people ppl e take a mental shortcut and assume that RAG is a vector DB search. Any kind of retrieval is retrieval. You can do keyword search. You can do a PageRank like query. You can sort content by date and send the most recent items to the LLM. It’s all retrieval. That is the R on Retrieval Augmented Generation.


What you describe sounds like Agetic RAG https://zzbbyy.substack.com/p/agentic-rag


> The traditional way to do RAG is to find information relevant to a query - and then incorporate it into the LLM prompt together with the question we want it to answer.

Technically this is incorrect. The original RAG paper used a seq2seq generator (BART) and involved two methods: RAG sequence and RAG token.

RAG sequence used the same fixed documents and appended them to the input query (note, this is different from a decoder-only model). RAG token generates each token based on a different document.

I only nitpick this because if someone is going to invent new fancy-sounding variants of RAG they should at least get the basics right.


Traditional does not equate original. The original technique was never used widely and cannot be called the traditional way.


Thanks!



Nice job on this, it’s a really interesting approach. I’ve been developing an open-source coding agent over the past year, and RAG just wasn’t working at all. I switched to a repo map approach (which sounds similar to what aider is doing) and that helped a bit but still wasn’t great.

However, a few weeks ago I built an agent that takes in a new GitHub issue and is given a variety of tools to do research on the background information to complete the issue. The tools include internet searches or clarifying questions to ask the person who wrote the ticket. But the most useful tool is the ability to look at the codebase and create a detailed markdown file of various files, explanations of what each file does, relevant code samples or snippets from the files, etc..

It’s still early, but anecdotally I’ve seen a huge increase in the quality of the code that uses this research as part of the context (along with the repo map and other details). It’s also able to tackle much more complex issues than it could before.

I definitely think you’re on to something here with this wiki approach. I’ll be curious to dig in and see the details of how you are creating these. Here is my research code if you’re interested: https://github.com/jacob-ai-bot/jacob/blob/feature/agent/src...

And here’s an example of the research output (everything past the exit criteria section): https://github.com/kleneway/jacob/issues/62


Wow, I didn't notice this hit the front page until now, will be answering questions momentarily!


These wikis are really interesting. I'm itching to try it on the common framework parts of our work monorepo.

[Update after looking at the Django wiki]

The wiki's structure appears to be very (entirely?) based on the directory structure. Is that right?

It would be interesting to give it the Django documentation in addition to the codebase, or possibly even just the heading structure of the documentation, and let it use that as a scaffold to generate the structure/organization of the wiki.

For a specific example, the Model is one of the most important concepts in Django, but in the auto-generated wiki, it shows up underneath databases.


Out of curiosity, would you mind pointing it at simonw/datasette? I think that might be interesting, particularly due to the plugin system.



sure thing, running it now!


A prompt here is a textual presentation of a data structure - this is an area that I am exploring right now. I wonder if it would not be equally effective if the LLM was getting a JSON or Python representation instead of the presumed Markdown. But my intuition is that because LLMs are trained on texts that are meant for human consumption - then they will follow better exactly the same texts as humans find easier and that means that they 'want' exactly the same presentation as humans - nicely formatted text or nicely formatted programs, not mixes.


it wants me to login to ask a question

I will just keep using phind

you have vc dollars - sponsor a public free search over open source repos.

Also think about what happens when your question touches multiple repos.

I tried a similar "search github repo with ai" product before, but it led me right back to phind, when it couldnt answer a question that required specific information from the repositiory as well as a google search.


It's funny to see this because from my perspective we've done this on a pretty scrappy budget. That said we're thinking of adding an open access for select repos. Why is logging in so terrible? We want to get to know our users.


it is though being the little guy.

But that is not my problem. I am just a consumer. I will use what ever is the best and pragmatic solution to my problem.

I even logged in to phind after a while because it is nice to have access to my old chats.

But you need to provide value first before I bother with this.

Why do I need to explain these basic things?

pls dont get defensive when you get free feedback. not a good sign


This comes across as pretty rude... they're already indexing and generating pages for public repos for free, and providing this search for free, just behind a login. If you want to use phind instead go ahead.


Thought show HN means being open for feedback not "please whorship my product"

here you go: your product is the best. You will come out on top in this very crowded market and your stocks will be worth millions.

I dont see it as rude to take my time and explain where their product is lacking. I am a power user of these kind of tools and tried many.


So no free lunch? Making a detailed wiki of all the code would take several developer-years for us, and we're just a handful of developers.

Or is the wiki generated somehow?


(edited) I think mutable is generating an auto-wiki of your repo.

Separately - would like to know if wiki can be auto-generated from a large corpus of text. Should be a much simpler problem? Any answers would be much appreciated!


Given that the company is called Mutable AI and they called the product (?) Auto Wiki then I have to assume that they auto-generated the wiki. But I agree that the wording is ambiguous and could be interpreted as "we manually created the wiki".

> So we went back to step 1, let’s understand the code, let’s do our homework, and for us, that meant actually putting an understanding of the codebase down in a document — a Wikipedia-style article — called Auto Wiki. The wiki features diagrams and citations to your codebase.

(Edit) Homepage makes it clear they're talking about ML-generated wiki https://mutable.ai/


Yeah I mean when I read "for us, that meant actually putting an understanding of the codebase down in a document" I assume that's a person. For me, current AI's don't have any understanding as such.

But yeah, upon closer inspection I see in the sidebar it says "Create your own wiki - AI-generated instantly". So that clears up my confusion.


yes, it's AI generated, but you can actually revise it manually and with a simple instruction to the AI!


That makes it much more interesting indeed.

Only info I was missing on your homepage was where is this data stored, assuming a private repo?

Also, the price is technically given as a one-off, but I assume it's meant per-month?


i actually think the world has many free lunches for those looking hard enough :)


BTW, if you ask for an open source repo in the comments to be processed by our system, we'll do it for free!


Can you share some basic insight into how your system processes an open source repo to generate a wiki with a hierarchy and structure that maps to the same or similar hierarchy and structure of the codebase?

I can understand the value of marrying that wiki to the codebase and how that would help LLM's better "understand" the codebase.

What I'm lost on is how you can auto-generate that wiki in the first place (in a high quality way). I presume it's not perfect, but a very interesting problem space and would love to hear what you've learned and what you are trying to accomplish this feat!

Thanks for posting btw, this HN comments section has been INCREDIBLE.


Thanks! It's a multi stage process where we summarize the code and structure it into articles. We also do some self verification, for example checking for dead links.


Do you build a syntax tree of the code, then loop the tree to auto-write an article for each node (or the larger or more material nodes) and then also reference the tree to pull in related nodes/pieces/modules/whatnot of the codebase when auto-writing a documentation article for each node?

This is fascinating thank you


yeah ! you know the best way to learn more is to join us, we're hiring: https://www.workatastartup.com/companies/mutable-ai


Greatly appreciate the suggestion! but I recently joined another YC backed Gen AI startup in the fine-tuning space (OpenPipe) :-D

Speaking of, there's a good chance fine tuned models will be a component of your fully-optimized codebase -> wiki automation process at some point in the future. Likely to increase consistency/reliability of LLM responses as clear patterns start to emerge in the process. If y'all decide to layer that on or even explore it as an optimization strategy hit us (or me directly) up. We love collaborating w/ engineers working on problems at the edge like this, aside from how engaging the problems themselves are, it also helps us build our best possible product too!

Very excited to follow your journey! just sent you a LI request.

Thanks again so much for sharing your wisdom!


This is amazing. Do you have any information on the implementation?


thanks! we tried to include some info in this post.


[flagged]


ahh, I remember someone on HN commented that screenshot is the defacto standard of "hypermedia"


Maybe you're mixing that with the observation that the Internet is mostly just some five image sharing sites, each posting screenshots of the other four.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: