Show HN: Repogather – copy relevant files to clipboard for LLM coding workflows

faangguyindia · 2024-09-12T15:51:25 1726156285

I usually only edit 1 function using LLM on old code base.

On Greenfield projects. I ask Claude Soñnet to write all the function and their signature with return value etc..

Then I've a script which sends these signature to Google Flash which writes all the functions for me.

All this happens in paraellel.

I've found if you limit the scope, Google Flash writes the best code and it's ultra fast and cheap.

grbsh · 2024-09-12T15:58:14 1726156694

Interesting - isn't Google Flash worse at coding than Sonnet 3.5? I subscribe to to Claude for $20/m, but even if the API were free, I'd still want to use the Claude interface for flexibility, artifacts and just understandability, which is why I don't use available coding assistants like Plandex or Aider.

What if you need to iterate on the functions it gives? Do you just start over a with a different prompt, or do you have the ability to do a refinement with Google Flash on existing functions?

faangguyindia · 2024-09-13T02:26:33 1726194393

Claude Soñnet is a more creative coder.

That's why Gemini Flash might appear dumb in front of Sonnet. But who writes the dumb functions better which are guaranteed to work for long time in production? Gemini.

But Sonnet makes silly mistakes like, even when I feed it requirements.txt it still uses methods which either do not exist or used to exist but not anymore.

Gemini Flash isn't as creative.

So basically, we use Sonnet to do high level programming and Flash for low level (writing functions which are guaranteed to be correct and clean, no black magic)

Problem with sonnet is it's slow. Sometimes you'll be stuck in a loop where it suggests something, then removes it when it encounters errors, then it again suggests the vary same thing you tried before.

I am using Claude Soñnet via cursor.

>What if you need to iterate on the functions it gives?

I can do it via Aider and even modify the prompt it sends to Gemini Flash.

JackYoustra · 2024-09-13T00:35:03 1726187703

Do you have the script?

mrtesthah · 2024-09-13T04:01:36 1726200096

This symbolic link broke it:

srtp -> .

  File "repogather/file_filter.py", line 170, in process_directory
    if item.is_file():
       ^^^^^^^^^^^^^^

OSError: [Errno 62] Too many levels of symbolic links: 'submodules/externals/srtp/include/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp/srtp'

grbsh · 2024-09-13T12:15:09 1726229709

Thanks for letting me know - I’ll make sure it can support (transitively) circular symlinks soon.

reacharavindh · 2024-09-12T15:14:54 1726154094

Do you literally paste a wall of text (source code of the filtered whole repo) into the prompt and ask the LLM to give you a diff patch as an answer to your question?

Example,

Here is my whole project, now implement user authentication with plain username/password?

grbsh · 2024-09-12T16:37:34 1726159054

Yes! And I fought the urge to do this for so long, I think because it _feels_ wasteful for some reason? But Claude handles it like a champ*, and gets me significantly better and easier results than if I manually pasted a file in a described the rest of the context it needs by hand.

* Until the repository gets more complicated, which is why we need the intelligent relevance filtering features of repogather, e.g. `repogather "Only files related to authentication and avatar uploads"`

ukuina · 2024-09-12T15:41:59 1726155719

Yes? I mean, it works for small projects.

punkpeye · 2024-09-12T15:25:23 1726154723

reidbarber · 2024-09-12T22:32:45 1726180365

Nice! I built something similar, but in the browser with drag-and-drop at https://files2prompt.com

It doesn’t have all the fancy LLM integration though.

fellowniusmonk · 2024-09-12T21:50:04 1726177804

This looks very cool for complex queries!

If your codebase is structured in a very modular way than this one liner mostly just works:

find . -type f -exec echo {} \; -exec cat {} \; | pbcopy

grbsh · 2024-09-13T12:22:52 1726230172

I like this! I originally started with something similar (but this one is much cleaner!), but then wanted to add optional exclusions (like .gitignore, tests, configurations).

Would it be okay if I include this one liner in the readme (with credit) as an alternative?

fellowniusmonk · 2024-09-13T14:18:51 1726237131

Absolutely!

smcleod · 2024-09-13T20:21:56 1726258916

There's so many of these popping up! Here's mine - https://github.com/sammcj/ingest

jondwillis · 2024-09-13T05:25:44 1726205144

In this thread: nobody using Cursor, embedding documentation, using various RAG techniques…

grbsh · 2024-09-13T12:18:35 1726229915

Cursor doesn’t fit into everyone’s workflow — I subscribe to it, but I’ve found myself preferring the Claude UI for various reasons.

Part of it is that I actually get better results using repogather + Claude UI for asking questions about my code than I get with Cursor’s chat. I suspect the index it creates on my codebase just isn’t very good, and it’s opaque to me.

ukuina · 2024-09-12T15:24:16 1726154656

It's fascinating to see how different frameworks are dealing with the problem of populating context correctly. Aider, for example, asks users to manually add files to context. Claude Dev attempts to grep files based on LLM intent. And Continue.dev uses vector embeddings to find relevant chunks and files.

I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.

grbsh · 2024-09-12T16:28:30 1726158510

I've been frustrated with embedding search approaches, because when they fail, they fail opaquely -- I don't know how to iterate on my query in order to get close to what I expected. In contrast, since repogather merely wraps your query in a simple prompt, it's easier to intuit what went wrong, if the results weren't as you expected.

> I wonder if an increase in usable (not advertised) context tokens may obviate many of these approaches.

I've been extremely interested in this question! Will be interesting to see how things develop, but I suspect that relevance filtering is not as difficult as coding, so small, cheap LLMs will make the former a solved, inexpensive problem, while we will continue to build larger and more expensive LLMs to solve the latter.

That said, you can buy a lot of tokens for $150k, so this could be short sighted.

doctoboggan · 2024-09-12T18:19:50 1726165190

I am really happy with how Aider does it as it feels like a happy medium. The state of LLMs these days means you really have to break the problem down to digestible parts and if you are doing that, its not much more work to specify the files that need to be edited in your request. Aider can also prompt you to add a file if it thinks that file needs to be edited.

taneq · 2024-09-12T15:39:08 1726155548

I haven't looked into this but do any of them use modern IDE code inspection tools? I'd think you would dump as much "find references" and "show definition" outputs for relevant variables into context as possible.

faangguyindia · 2024-09-12T15:52:26 1726156346

Aider also uses AST (tree sitter), it creates repo map using and sends it to LLM.

grbsh · 2024-09-12T16:34:38 1726158878

I like this approach a lot, especially because it's not opaque like embeddings. Maybe I can add an option to use this approach instead with repogather, if you are cost sensitive.

jadbox · 2024-09-12T17:11:30 1726161090

How does Aider (cli) compare to Claude Dev (VSCode plugin)? Anyone have a subjective analysis?

avernet · 2024-09-12T15:34:44 1726155284

And how does repogather do it? From the README, it looks to me like it might provide the content of each file to the LLM to gauge its relevance. But this would seem prohibitively expensive on anything that isn't a very small codebase (the project I'm working on has on the order of 400k SLOC), even with gpt-4o-mini, wouldn't it?

grbsh · 2024-09-12T16:14:38 1726157678

repogather indeed as a last step stuffs everything not otherwise excluded through cheap heuristics into gpt-4o-mini to gauge relevance, so it will get expensive for large projects. On my small 8 dev startup's repo, this operation costs 2-4 cents. I was considering adding an `--intelligence` option, where you could trade off different methods between cost, speed, and accuracy. But, I've been very unsatisfied with both embedding search methods, and agentic file search methods. They seem to regularly fail in very unpredictable ways. In contrast, this method works quite well for the projects I tend to work on.

I think in the future as the cost of gpt-4o-mini level intelligence decreases, it will become increasingly worth it, even for larger repositories, to simply attend to every token for certain coding subtasks. I'm assuming here that relevance filtering is a much easier task than coding itself, otherwise you could just copy/paste everything into the final coding model's context. What I think would make much more sense for this project is to optimize the cost / performance of a small LLM fine-tuned for this source relevance task. I suspect I could do much better than gpt-4o-mini, but it would be difficult to deploy this for free.

punkpeye · 2024-09-12T15:25:15 1726154715

Continue.dev approach sounds like it would provide the most relevant code?

morgante · 2024-09-12T16:27:47 1726158467

Embeddings are actually generally not that effective for code.

grbsh · 2024-09-12T16:32:43 1726158763

This is what I've found. It's so hard to do embeddings correctly, while it's so easy to search over a large corpus with a cheap LLM! Embeddings are also really inscrutable when they fail, whereas I find myself easily iterating if repogather fails to return the right group of files.

Of course, 'cheap' is relative -- on a large repository, embeddings are 99%+ cheaper than even gpt-4o-mini.

faangguyindia · 2024-09-12T15:48:26 1726156106

LLM for coding is bit meh after novelty wears off.

I've had problems where LLM doesn't know which library version I am using. It keeps suggesting methods which do not exit etc...

As if LLM are unaware of library version.

Place where I found LLM to be most effect and effortless is CLI

My brother made this but I use it everyday https://github.com/zerocorebeta/Option-K

grbsh · 2024-09-12T16:41:40 1726159300

I agree - it's exciting at first, but then you have experience where you go down a rabbit hole for an hour trying to fix / make use of LLM generated code.

So you really have to know when the LLM will be able to cleanly and neatly solve your problem, and when it's going to be frustrating and simpler just to do it character by character. That's why I'm exploring building tools like this, to try to iron out annoyances and improve quality of life for new LLM workflows.

Option-K looks promising! I'll try it out.

hereme888 · 2024-09-12T17:12:16 1726161136

When I get frustrated with GPT-4o, I then switch to Sonnet 3.5, usually with good results.

In my limited experience Sonnet 3.5 is more elegant at coding and making use of different frameworks.

faangguyindia · 2024-09-13T02:31:01 1726194661

For the library problem, I have even tried feeding it requirements.txt (which contains version of libraries, I also fed it python version I am using.) i use but no success with that either!

It's definately better than any other LLM out there.

But it also gets stuck and creates frustration.

Atm, I am writing code which does video editing. It keeps suggesting me some parallel approach, when it fails it suggest me going back to origin approach. Then it suggests me parralel approach again using some other hack and none of them work!