I really wonder how Claude 100k does on larger workspaces, has anyone tried that? (I don't feel like paying another $20 to Anthropic too) Allegedly it's only marginally better than 3.5-turbo on average so it'll probably spit out nonsensical code but maybe the huge context can help.
So I said it's like 50 percent of the way there implying that it gets things right at a rate of 50 percent. That's a fuzzy estimation as well, obviously so don't get pedantic on me with that number.
When you ask for large output or give it large input you are increasing the sample size. Which means more likely that part of the answer are wrong. That's it. Simple statistic that are inline with my initial point. With AI we are roughly half way there at producing answers.
If you keep the answers and questions short you will have a much higher probability of being correct.
So that 50k line program? My claim is roughly 25k of those lines are usable. But that's a fuzzy claim because I LLMs can do much better than 25k. Maybe 75% is more realistic but I'll leave it at 50% so there's a lower bar for the nay sayers to attack.
What I would like to do is feed in my entire 50k line program and get something out.