Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Repogather – copy relevant files to clipboard for LLM coding workflows (github.com/gr-b)
65 points by grbsh 4 months ago | hide | past | favorite | 33 comments
Hey HN, I wanted to share a simple command line tool I made that has sped up and simplified my LLM assisted coding workflow. Whenever possible, I’ve been trying to use Claude as a first pass when implementing new features / changes. But I found that depending on the type of change I was making, I was spending a lot of thought finding and deciding which source files should be included in the prompt. The need to copy/paste each file individually also becomes a mild annoyance.

First, I implemented `repogather --all` , which unintelligently copies all sources files in your repository to the clipboard (delimited by their relative filepaths). To my surprise, for less complex repositories, this alone is often completely workable for Claude — much better than pasting in the just the few files you are looking to update. But I never would have done it if I had to copy/paste everything individually. 200k is quite a lot of tokens!

But as soon as the repository grows to a certain complexity level (even if it is under the input token limit), I’ve found that Claude can get confused by different unrelated parts / concepts across the code. It performs much better if you make an attempt to exclude logic that is irrelevant to your current change. So I implemented `repogather "<query here>"` , e.g. `repogather "only files related to authentication"` . This uses gpt-4o-mini with structured outputs to provide a relevance score for each source file (with automatic exclusions for .gitignore patterns, tests, configuration, and other manual exclusions with `--exclude <pattern>` ).

gpt-4o-mini is so cheap and fast, that for my ~8 dev startup’s repo, it takes under 5 seconds and costs 3-4 cents (with appropriate exclusions). Plus, you get to watch the output stream while you wait which always feels fun.

The retrieval isn’t always perfect the first time — but it is fast, which allows you to see what files it returned, and iterate quickly on your command. I’ve found this to be much more satisfying than embedding-search based solutions I’ve used, which seem to fail in pretty opaque ways.

https://github.com/gr-b/repogather

Let me know if it is useful to you! Always love to talk about how to better integrate LLMs into coding workflows.




I usually only edit 1 function using LLM on old code base.

On Greenfield projects. I ask Claude Soñnet to write all the function and their signature with return value etc..

Then I've a script which sends these signature to Google Flash which writes all the functions for me.

All this happens in paraellel.

I've found if you limit the scope, Google Flash writes the best code and it's ultra fast and cheap.


Interesting - isn't Google Flash worse at coding than Sonnet 3.5? I subscribe to to Claude for $20/m, but even if the API were free, I'd still want to use the Claude interface for flexibility, artifacts and just understandability, which is why I don't use available coding assistants like Plandex or Aider.

What if you need to iterate on the functions it gives? Do you just start over a with a different prompt, or do you have the ability to do a refinement with Google Flash on existing functions?


Claude Soñnet is a more creative coder.

That's why Gemini Flash might appear dumb in front of Sonnet. But who writes the dumb functions better which are guaranteed to work for long time in production? Gemini.

But Sonnet makes silly mistakes like, even when I feed it requirements.txt it still uses methods which either do not exist or used to exist but not anymore.

Gemini Flash isn't as creative.

So basically, we use Sonnet to do high level programming and Flash for low level (writing functions which are guaranteed to be correct and clean, no black magic)

Problem with sonnet is it's slow. Sometimes you'll be stuck in a loop where it suggests something, then removes it when it encounters errors, then it again suggests the vary same thing you tried before.

I am using Claude Soñnet via cursor.

>What if you need to iterate on the functions it gives?

I can do it via Aider and even modify the prompt it sends to Gemini Flash.


Do you have the script?


This symbolic link broke it:

srtp -> .

  File "repogather/file_filter.py", line 170, in process_directory
    if item.is_file():
       ^^^^^^^^^^^^^^
OSError: [Errno 62] Too many levels of symbolic links: 'submodules/externals/srtp/include/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp'


Thanks for letting me know - I’ll make sure it can support (transitively) circular symlinks soon.


Do you literally paste a wall of text (source code of the filtered whole repo) into the prompt and ask the LLM to give you a diff patch as an answer to your question?

Example,

Here is my whole project, now implement user authentication with plain username/password?


Yes! And I fought the urge to do this for so long, I think because it _feels_ wasteful for some reason? But Claude handles it like a champ*, and gets me significantly better and easier results than if I manually pasted a file in a described the rest of the context it needs by hand.

* Until the repository gets more complicated, which is why we need the intelligent relevance filtering features of repogather, e.g. `repogather "Only files related to authentication and avatar uploads"`


Yes? I mean, it works for small projects.


Yes


Nice! I built something similar, but in the browser with drag-and-drop at https://files2prompt.com

It doesn’t have all the fancy LLM integration though.


This looks very cool for complex queries!

If your codebase is structured in a very modular way than this one liner mostly just works:

find . -type f -exec echo {} \; -exec cat {} \; | pbcopy


I like this! I originally started with something similar (but this one is much cleaner!), but then wanted to add optional exclusions (like .gitignore, tests, configurations).

Would it be okay if I include this one liner in the readme (with credit) as an alternative?


Absolutely!


There's so many of these popping up! Here's mine - https://github.com/sammcj/ingest


In this thread: nobody using Cursor, embedding documentation, using various RAG techniques…


Cursor doesn’t fit into everyone’s workflow — I subscribe to it, but I’ve found myself preferring the Claude UI for various reasons.

Part of it is that I actually get better results using repogather + Claude UI for asking questions about my code than I get with Cursor’s chat. I suspect the index it creates on my codebase just isn’t very good, and it’s opaque to me.


It's fascinating to see how different frameworks are dealing with the problem of populating context correctly. Aider, for example, asks users to manually add files to context. Claude Dev attempts to grep files based on LLM intent. And Continue.dev uses vector embeddings to find relevant chunks and files.

I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.


I've been frustrated with embedding search approaches, because when they fail, they fail opaquely -- I don't know how to iterate on my query in order to get close to what I expected. In contrast, since repogather merely wraps your query in a simple prompt, it's easier to intuit what went wrong, if the results weren't as you expected.

> I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.

I've been extremely interested in this question! Will be interesting to see how things develop, but I suspect that relevance filtering is not as difficult as coding, so small, cheap LLMs will make the former a solved, inexpensive problem, while we will continue to build larger and more expensive LLMs to solve the latter.

That said, you can buy a lot of tokens for $150k, so this could be short sighted.


I am really happy with how Aider does it as it feels like a happy medium. The state of LLMs these days means you really have to break the problem down to digestible parts and if you are doing that, its not much more work to specify the files that need to be edited in your request. Aider can also prompt you to add a file if it thinks that file needs to be edited.


I haven't looked into this but do any of them use modern IDE code inspection tools? I'd think you would dump as much "find references" and "show definition" outputs for relevant variables into context as possible.


Aider also uses AST (tree sitter), it creates repo map using and sends it to LLM.


I like this approach a lot, especially because it's not opaque like embeddings. Maybe I can add an option to use this approach instead with repogather, if you are cost sensitive.


How does Aider (cli) compare to Claude Dev (VSCode plugin)? Anyone have a subjective analysis?


And how does repogather do it? From the README, it looks to me like it might provide the content of each file to the LLM to gauge its relevance. But this would seem prohibitively expensive on anything that isn't a very small codebase (the project I'm working on has on the order of 400k SLOC), even with gpt-4o-mini, wouldn't it?


repogather indeed as a last step stuffs everything not otherwise excluded through cheap heuristics into gpt-4o-mini to gauge relevance, so it will get expensive for large projects. On my small 8 dev startup's repo, this operation costs 2-4 cents. I was considering adding an `--intelligence` option, where you could trade off different methods between cost, speed, and accuracy. But, I've been very unsatisfied with both embedding search methods, and agentic file search methods. They seem to regularly fail in very unpredictable ways. In contrast, this method works quite well for the projects I tend to work on.

I think in the future as the cost of gpt-4o-mini level intelligence decreases, it will become increasingly worth it, even for larger repositories, to simply attend to every token for certain coding subtasks. I'm assuming here that relevance filtering is a much easier task than coding itself, otherwise you could just copy/paste everything into the final coding model's context. What I think would make much more sense for this project is to optimize the cost / performance of a small LLM fine-tuned for this source relevance task. I suspect I could do much better than gpt-4o-mini, but it would be difficult to deploy this for free.


Continue.dev approach sounds like it would provide the most relevant code?


Embeddings are actually generally not that effective for code.


This is what I've found. It's so hard to do embeddings correctly, while it's so easy to search over a large corpus with a cheap LLM! Embeddings are also really inscrutable when they fail, whereas I find myself easily iterating if repogather fails to return the right group of files.

Of course, 'cheap' is relative -- on a large repository, embeddings are 99%+ cheaper than even gpt-4o-mini.


LLM for coding is bit meh after novelty wears off.

I've had problems where LLM doesn't know which library version I am using. It keeps suggesting methods which do not exit etc...

As if LLM are unaware of library version.

Place where I found LLM to be most effect and effortless is CLI

My brother made this but I use it everyday https://github.com/zerocorebeta/Option-K


I agree - it's exciting at first, but then you have experience where you go down a rabbit hole for an hour trying to fix / make use of LLM generated code.

So you really have to know when the LLM will be able to cleanly and neatly solve your problem, and when it's going to be frustrating and simpler just to do it character by character. That's why I'm exploring building tools like this, to try to iron out annoyances and improve quality of life for new LLM workflows.

Option-K looks promising! I'll try it out.


When I get frustrated with GPT-4o, I then switch to Sonnet 3.5, usually with good results.

In my limited experience Sonnet 3.5 is more elegant at coding and making use of different frameworks.


For the library problem, I have even tried feeding it requirements.txt (which contains version of libraries, I also fed it python version I am using.) i use but no success with that either!

It's definately better than any other LLM out there.

But it also gets stuck and creates frustration.

Atm, I am writing code which does video editing. It keeps suggesting me some parallel approach, when it fails it suggest me going back to origin approach. Then it suggests me parralel approach again using some other hack and none of them work!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: