Show HN: Replace "hub" by "ingest" in GitHub URLs for a prompt-friendly extract

wwoessi · 2024-12-05T18:02:20 1733421740

Hi, great tool!

I've made https://uithub.com 2 months ago. Its speciality is the fact that seeing a repo's raw extract is a matter of changing 'g' to 'u'. It also works for subdirectories, so if you just want the docs of Upstash QStash, for example, just go to https://uithub.com/upstash/docs/tree/main/qstash

Great to see this keeps being worthwhile!

Arcuru · 2024-12-05T18:18:04 1733422684

That looks awesome. You didn't mention it but uithub.com also has an API, I can definitely see myself using this for a new tool.

helsinki · 2024-12-06T00:34:22 1733445262

I wonder why nobody uses jsonl format to represent an entire codebase? It’s what I do and LLMs seems to prefer it. In fact, an LLM suggested this strategy to me. Uses less characters, too.

addaon · 2024-12-06T02:02:46 1733450566

Are you suggesting that there's a correlation between what input formats provide best performance for an LLM input, and what sequence of tokens the same LLM outputs when prompted about what input formats provide best performance? Why would that be?

wwoessi · 2024-12-06T09:04:10 1733475850

I don't think it's much difference, but I've read that Markdown codeblocks (or YAML, or XML) is better for code than JSON, for example: https://aider.chat/2024/08/14/code-in-json.html

I think it makes sense.

YAML is shorter and easier to read, Markdown codeblocks have no added syntax between the lines compared to normal code.

But JSON vs JSONL I can't come up with any big advantages for the LLM, it's mostly the same.

TeMPOraL · 2024-12-06T16:38:48 1733503128

Why wouldn't that be? We've had several generations of LLMs since ChatGPT took the world by storm; current models are very much aware of LLMs that came before them, as well as associated discussions on how to best use them.

wwoessi · 2024-12-06T09:02:39 1733475759

You can use JSON using the accept parameter of the API. The url structure remains the same. It also supports YAML and I found that's easiest to read for LLMs.

Previous example but in JSON:

https://uithub.com/upstash/docs/tree/main/qstash?accept=appl...

Is there any reason to prefer JSONL besides it being more efficient to edit? I'm happy to add it to my backlog if you think it has any advantages for LLMs

prophesi · 2024-12-05T16:43:46 1733417026

Since the site was hugged to death by HN, this appears to be the repo[0] for anyone wanting to run it locally.

[0] https://github.com/cyclotruc/gitingest

bryant · 2024-12-05T17:05:05 1733418305

and of course, using the repo as an input for the service renders this[1]

[1] https://gitingest.com/cyclotruc/gitingest

mdaniel · 2024-12-05T17:29:35 1733419775

  // Fetch stars when page loads
  fetchGitHubStars();

I do not understand why in the world so much of the code is related to poking the GH api to fetch the star count

johnisgood · 2024-12-05T18:26:49 1733423209

Probably generated by AI, prompted by no- or junior dev. This is my opinion, of course, but it looks like code generated by an LLM.

cyclotruc · 2024-12-05T18:13:05 1733422385

I know the code is not great, but contributions are very much welcome because there's a lot of low hanging fruits

ugexe · 2024-12-06T01:43:23 1733449403

But why did you code it to fetch stars at all? You would have had to go out of your way to do that. If AI has written most of this I suspect people will be less inclined to contribute.

Mockapapella · 2024-12-05T17:25:14 1733419514

https://uithub.com is also a good one for this. They also have an API with more options.

Fokamul · 2024-12-05T17:50:39 1733421039

Nothing against gitingest.com, but this is really peak of technology. Having LLMs which require feeding them info with copy&paste, peak of effectivity too. OMFG.

evmunro · 2024-12-06T00:06:30 1733443590

Great idea to make it just a simple URL change. Reminds me of the youtube download websites.

I made a similar CLI tool[0] with the added feature that you can pass `--outline` and it'll omit function bodies (while leaving their signatures). I've found it works really well for giving a high-level overview of huge repos.

You can then progressively expand specific functions as the LLM needs to see their implementation, without bloating up your context window.

[0] https://github.com/everestmz/llmcat

Jet_Xu · 2024-12-07T08:30:15 1733560215

Interesting approach! While URL-based extraction is convenient, I've been working on a more comprehensive solution for repository knowledge retrieval (llama-github). The key challenge isn't just extracting code, but understanding the semantic relationships and evolution patterns within repositories.

A few observations from building large-scale repo analysis systems:

1. Simple text extraction often misses critical context about code dependencies and architectural decisions 2. Repository structure varies significantly across languages and frameworks - what works for Python might fail for complex C++ projects 3. Caching strategies become crucial when dealing with enterprise-scale monorepos

The real challenge is building a universal knowledge graph that captures both explicit (code, dependencies) and implicit (architectural patterns, evolution history) relationships. We've found that combining static analysis with selective LLM augmentation provides better context than pure extraction approaches.

Curious about others' experiences with handling cross-repository knowledge transfer, especially in polyrepo environments?

lukejagg · 2024-12-05T21:20:34 1733433634

Is the unicode really the best way to display the file structure? The special unicode characters are encoded into 2 tokens, so I doubt it would function better overall for larger repos.

shawnz · 2024-12-05T21:58:54 1733435934

Also, even if different characters were used, the 2D ascii art style representation of the directory tree in general strikes me as something that's not going to be easily interpreted by an LLM, which might not have a conception of how characters are laid out in 2D space

ComputerGuru · 2024-12-05T16:38:01 1733416681

Instead of a copy icon, it would be better to just generate the entire content as plaintext in the result (not in an html div on a rich html page) so the entire url could be used as an attachment or its contents piped directly into an agent/tool.

Ctrl-a + ctrl-c would remain fast.

vallode · 2024-12-05T16:43:22 1733417002

Agreed, missing opportunity to be able to change a url from github.com/cyclotruc/gitingest to gitingest.com/cyclotruc/gitingest and simply recieve the result as plain text. A very useful little tool nonetheless.

cyclotruc · 2024-12-05T17:02:19 1733418139

Yeah I'm going to do that very soon with the API :)

wwoessi · 2024-12-05T18:30:24 1733423424

for that you can use https://uithub.com (g -> u)

- for browsers it shows html - for curl is gets raw text

nfilzi · 2024-12-05T16:31:13 1733416273

Looks neat! From what I understood, it's like zipping up your codebase in a streamlined TXT version for LLMs to ingest better?

What'd you say are the differences with using sth like Cursor, which has access to your codebase already?

cyclotruc · 2024-12-05T16:34:50 1733416490

It's in the same lane, just sometimes you need a quick and handy way to get that streamlined TXT from a public Repo without leaving your browser

fastball · 2024-12-05T17:33:04 1733419984

Might be good to have some filtering as well. I added a repo that has a heap of localized docs that don't make much sense to ingest into an LLM but probably use up a majority of the tokens.

cyclotruc · 2024-12-05T16:30:38 1733416238

Hey! OP here: gitingest is getting a lot of love right now, sorry if it's unstable but please tell me what goes wrong so I can fix it!

smcleod · 2024-12-07T02:28:15 1733538495

I wrote a tool some time ago called ingest ... to do exactly this from both local directories, files, web urls etc... as well as estimating tokens and vram usage: https://github.com/sammcj/ingest

nonethewiser · 2024-12-05T18:05:11 1733421911

I implemented this same idea in bash for local use. Useful but only up to a certain size of codebase.

Cedricgc · 2024-12-05T17:47:11 1733420831

Does this use the txtar format created for developing the go language?

I actually use txtar with a custom CLI to quickly copy multiple files to my clipboard and paste it into an LLM chat. I try not to get too far from the chat paradigm so I can stay flexible with which LLM provider I use

maleldil · 2024-12-05T17:53:56 1733421236

If I understand correctly, this sounds like https://github.com/simonw/files-to-prompt/.

It's quite useful, with some filtering options (hidden files, gitignore, extensions) and support for Claude-style tags.

bosky101 · 2024-12-09T07:49:44 1733730584

For some reason was giving a large file instead of reading from the readme

wonderfuly · 2024-12-06T03:47:03 1733456823

You can also use https://chathub.gg/repo2txt

anamexis · 2024-12-05T16:34:35 1733416475

It seems to be broken, getting errors like "Error processing repository: Path ../tmp/pallets-flask does not exist"

cyclotruc · 2024-12-05T16:37:41 1733416661

Thank you, I'll look into it

modelorona · 2024-12-05T16:21:03 1733415663

Very cool! I will try this over the weekend with a new android app to see what kind of README I can generate.

Do you have any plans to expand it?

cyclotruc · 2024-12-05T16:31:27 1733416287

Yes I want to add a way to target a token count to control your LLM costs

Exuma · 2024-12-05T16:17:21 1733415441

isnt there a limit on prompt size? how would you actually use this? Im not very up to speed on this stuff

xnx · 2024-12-05T16:31:46 1733416306

Gemini Pro has a 2 million character context window which is ~1000 pages of code.

lolinder · 2024-12-05T16:21:17 1733415677

Most projects would be way too big to put into a prompt—even if technically you're within the official context window, those are often misleading—the actual window where input is actually useful is usually much smaller than advertised.

What you can do with something like this is store it in a database and then query it for relevant chunks, which you then feed to the LLM as needed.

tom1337 · 2024-12-05T22:31:00 1733437860

I wonder if building a local version of this which resolves dependency paths of the file your currently working on to a certain level so the LLM gains more context of related files instead of just the whole repo (which could be insane if you use a monorepo)

jackstraw14 · 2024-12-05T16:27:03 1733416023

Ideally let the LLM chunk it up and figure out when to use those chunks.

hereme888 · 2024-12-10T04:25:57 1733804757

It's like a web version of Repomix

matt3210 · 2024-12-05T16:28:02 1733416082

The example buttons are a nice touch

gardenhedge · 2024-12-06T11:54:01 1733486041

Very clever!

seventytwo · 2024-12-05T23:43:54 1733442234

It’s dead Jim

spencerchubb · 2024-12-05T16:28:25 1733416105

Github already has a way to get the raw text files

barbazoo · 2024-12-05T17:24:20 1733419460

All of them in one operation? How?

johnisgood · 2024-12-05T22:40:28 1733438428

I think he is confusing "plain" or "raw" view, so probably not all of them.

dim13 · 2024-12-06T00:08:28 1733443708

It did not digest https://github.com/torvalds/linux ¯\_(ツ)_/¯

moralestapia · 2024-12-05T17:03:13 1733418193

This is really nice, congrats on shipping.

I also really like this idea in general of APIs being domains, eventually making the web a giant supercomputer.

Edit: There is literally nothing wrong with this comment but feel free to keep downvoting, only 5,600 clicks to go!