Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Replace "hub" by "ingest" in GitHub URLs for a prompt-friendly extract (gitingest.com)
185 points by cyclotruc 9 days ago | hide | past | favorite | 51 comments
Gitingest is a open-source micro dev-tool that I made over the last week.

It turns any public Github repository into a text extract that you can give to your favourite LLM easily.

Today I added this url trick to make it even easier to use!

How I use it myself: - Quickly generate a README.md boilerplate for a project - Ask LLMs questions about an undocumented codebase

It is still very much work in progress and I plan to add many more options (file size limits, exclude patterns..) and a public API

I hope this tool can help you Your feedback is very valuable to help me prioritize And contributions are welcome!






Hi, great tool!

I've made https://uithub.com 2 months ago. Its speciality is the fact that seeing a repo's raw extract is a matter of changing 'g' to 'u'. It also works for subdirectories, so if you just want the docs of Upstash QStash, for example, just go to https://uithub.com/upstash/docs/tree/main/qstash

Great to see this keeps being worthwhile!


That looks awesome. You didn't mention it but uithub.com also has an API, I can definitely see myself using this for a new tool.

I wonder why nobody uses jsonl format to represent an entire codebase? It’s what I do and LLMs seems to prefer it. In fact, an LLM suggested this strategy to me. Uses less characters, too.

Are you suggesting that there's a correlation between what input formats provide best performance for an LLM input, and what sequence of tokens the same LLM outputs when prompted about what input formats provide best performance? Why would that be?

I don't think it's much difference, but I've read that Markdown codeblocks (or YAML, or XML) is better for code than JSON, for example: https://aider.chat/2024/08/14/code-in-json.html

I think it makes sense.

YAML is shorter and easier to read, Markdown codeblocks have no added syntax between the lines compared to normal code.

But JSON vs JSONL I can't come up with any big advantages for the LLM, it's mostly the same.


Why wouldn't that be? We've had several generations of LLMs since ChatGPT took the world by storm; current models are very much aware of LLMs that came before them, as well as associated discussions on how to best use them.

You can use JSON using the accept parameter of the API. The url structure remains the same. It also supports YAML and I found that's easiest to read for LLMs.

Previous example but in JSON:

https://uithub.com/upstash/docs/tree/main/qstash?accept=appl...

Is there any reason to prefer JSONL besides it being more efficient to edit? I'm happy to add it to my backlog if you think it has any advantages for LLMs


Since the site was hugged to death by HN, this appears to be the repo[0] for anyone wanting to run it locally.

[0] https://github.com/cyclotruc/gitingest


and of course, using the repo as an input for the service renders this[1]

[1] https://gitingest.com/cyclotruc/gitingest


  // Fetch stars when page loads
  fetchGitHubStars();
I do not understand why in the world so much of the code is related to poking the GH api to fetch the star count

Probably generated by AI, prompted by no- or junior dev. This is my opinion, of course, but it looks like code generated by an LLM.

I know the code is not great, but contributions are very much welcome because there's a lot of low hanging fruits

But why did you code it to fetch stars at all? You would have had to go out of your way to do that. If AI has written most of this I suspect people will be less inclined to contribute.

https://uithub.com is also a good one for this. They also have an API with more options.

Nothing against gitingest.com, but this is really peak of technology. Having LLMs which require feeding them info with copy&paste, peak of effectivity too. OMFG.

Great idea to make it just a simple URL change. Reminds me of the youtube download websites.

I made a similar CLI tool[0] with the added feature that you can pass `--outline` and it'll omit function bodies (while leaving their signatures). I've found it works really well for giving a high-level overview of huge repos.

You can then progressively expand specific functions as the LLM needs to see their implementation, without bloating up your context window.

[0] https://github.com/everestmz/llmcat


Interesting approach! While URL-based extraction is convenient, I've been working on a more comprehensive solution for repository knowledge retrieval (llama-github). The key challenge isn't just extracting code, but understanding the semantic relationships and evolution patterns within repositories.

A few observations from building large-scale repo analysis systems:

1. Simple text extraction often misses critical context about code dependencies and architectural decisions 2. Repository structure varies significantly across languages and frameworks - what works for Python might fail for complex C++ projects 3. Caching strategies become crucial when dealing with enterprise-scale monorepos

The real challenge is building a universal knowledge graph that captures both explicit (code, dependencies) and implicit (architectural patterns, evolution history) relationships. We've found that combining static analysis with selective LLM augmentation provides better context than pure extraction approaches.

Curious about others' experiences with handling cross-repository knowledge transfer, especially in polyrepo environments?


Is the unicode really the best way to display the file structure? The special unicode characters are encoded into 2 tokens, so I doubt it would function better overall for larger repos.

Also, even if different characters were used, the 2D ascii art style representation of the directory tree in general strikes me as something that's not going to be easily interpreted by an LLM, which might not have a conception of how characters are laid out in 2D space

Instead of a copy icon, it would be better to just generate the entire content as plaintext in the result (not in an html div on a rich html page) so the entire url could be used as an attachment or its contents piped directly into an agent/tool.

Ctrl-a + ctrl-c would remain fast.


Agreed, missing opportunity to be able to change a url from github.com/cyclotruc/gitingest to gitingest.com/cyclotruc/gitingest and simply recieve the result as plain text. A very useful little tool nonetheless.

Yeah I'm going to do that very soon with the API :)

for that you can use https://uithub.com (g -> u)

- for browsers it shows html - for curl is gets raw text


Looks neat! From what I understood, it's like zipping up your codebase in a streamlined TXT version for LLMs to ingest better?

What'd you say are the differences with using sth like Cursor, which has access to your codebase already?


It's in the same lane, just sometimes you need a quick and handy way to get that streamlined TXT from a public Repo without leaving your browser

Might be good to have some filtering as well. I added a repo that has a heap of localized docs that don't make much sense to ingest into an LLM but probably use up a majority of the tokens.

Hey! OP here: gitingest is getting a lot of love right now, sorry if it's unstable but please tell me what goes wrong so I can fix it!

I wrote a tool some time ago called ingest ... to do exactly this from both local directories, files, web urls etc... as well as estimating tokens and vram usage: https://github.com/sammcj/ingest

I implemented this same idea in bash for local use. Useful but only up to a certain size of codebase.

Does this use the txtar format created for developing the go language?

I actually use txtar with a custom CLI to quickly copy multiple files to my clipboard and paste it into an LLM chat. I try not to get too far from the chat paradigm so I can stay flexible with which LLM provider I use


If I understand correctly, this sounds like https://github.com/simonw/files-to-prompt/.

It's quite useful, with some filtering options (hidden files, gitignore, extensions) and support for Claude-style tags.


For some reason was giving a large file instead of reading from the readme


It seems to be broken, getting errors like "Error processing repository: Path ../tmp/pallets-flask does not exist"

Thank you, I'll look into it

Very cool! I will try this over the weekend with a new android app to see what kind of README I can generate.

Do you have any plans to expand it?


Yes I want to add a way to target a token count to control your LLM costs

isnt there a limit on prompt size? how would you actually use this? Im not very up to speed on this stuff

Gemini Pro has a 2 million character context window which is ~1000 pages of code.

Most projects would be way too big to put into a prompt—even if technically you're within the official context window, those are often misleading—the actual window where input is actually useful is usually much smaller than advertised.

What you can do with something like this is store it in a database and then query it for relevant chunks, which you then feed to the LLM as needed.


I wonder if building a local version of this which resolves dependency paths of the file your currently working on to a certain level so the LLM gains more context of related files instead of just the whole repo (which could be insane if you use a monorepo)

Ideally let the LLM chunk it up and figure out when to use those chunks.

It's like a web version of Repomix

The example buttons are a nice touch

Very clever!

It’s dead Jim

Github already has a way to get the raw text files

All of them in one operation? How?

I think he is confusing "plain" or "raw" view, so probably not all of them.

It did not digest https://github.com/torvalds/linux ¯\_(ツ)_/¯

This is really nice, congrats on shipping.

I also really like this idea in general of APIs being domains, eventually making the web a giant supercomputer.

Edit: There is literally nothing wrong with this comment but feel free to keep downvoting, only 5,600 clicks to go!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: