> Additionally, while this wasn’t an issue for GPT, the Llama chat models would often output hundreds of miscellaneous tokens that were unnecessary for the task, further slowing down their inference time (e.g. “Sure! Happy to help…”).
That's the problem I've been facing with Llama 2 as well. It's almost impossible to have it just output the desired text. It will always add something before and after its response. Does anyone know if there's any prompt technique to fix this problem?
It's not useful for code, but you can see the difference of approach with NovelAI's homegrown Kayra model, which is set up to handle a mix of text completion and instruct functionality. It never includes extraneous prefix/suffix text and will smoothly follow instructions embedded in text without interrupting the text.
I wonder if LLMs will have less reasoning power if they simply return the output. AFAIK, they think by writing their thoughts. So forcing an LLM to just return the goddamn code might limit its reasoning skills, leading to poor code. Is that true?
Potentially it could have an impact if it omits a high level description before writing the code, although obviously things like "Sure! Happy to help" do not help.
In practice I haven't seen it make too much of a difference with GPT. The model can still use comments to express itself.
For non coding tasks, adding "Think step by step" makes a huge difference (versus YOLOing a single word reply).
> although obviously things like "Sure! Happy to help" do not help.
Yes you're right. I'm mostly concerned with the text that actually "computes" something before the actual code begins. Niceties like "sure! happy to help" don't compute anything.
CoT indeed works. Now I've seem people take it to the extreme by having tree of thoughts, forest of thoughts, etc. but I'm not sure how much "reasoning" we can extract from a model that is obviously limited in terms of knowledge and intelligence. CoT already gets us to 80% of the way. With some tweaks it can get even better.
I've also seen simulation methods where GPT "agents" talk to each other to form better ideas about a subject. But then again, it's like trying to achieve perpetual motion in physics. One can't get more intelligence from a system than one puts in the system.
> But then again, it's like trying to achieve perpetual motion in physics. One can't get more intelligence from a system than one puts in the system.
Not necessarily the same thing, as you're still putting in more processing power/checking more possible paths. Its kinda like simulated annealing, sure the system is dumb, but as long as checking if you have a correct answer is cheap, it still narrows down the search space a lot.
Yeah I get that. We assume there's X amount of intelligence in the LLM and try different paths to tap on that potential. The more paths are simulated, the closer we get to the LLM's intelligence asymptote. But then that's it—we can't go any further.
You can also just parse the text for all valid code blocks and combine them. I have a script which automatically check the clipboard for this
There's no reason to handle the LLM side of things, unless you want to try and optimize the amount of tokens which are code vs comments vs explanations and such. (Though you could also just start a new context window with only your code or such)
The model card also has prompt formats for context aware document Q/A and multi-CoT, using those correctly improves performance at such tasks significantly.
Llama-2-chat models have been overly fine-tuned to be like this. You can give a few-shot prompting a try, but they still don't gurantee a desired output. The best way to guarantee is to fine-tune on small (~1k) data points and go from there.
It depends on what your goal is, but I've had success reproducing specific output formatting by fine-tuning the base LLaMA2 models instead of the RLHF'd models. My use cases were simpler - information extraction/synthesis from text rather than creative writing. The base models might not be good fits for your task.
Prompt the model to always output answers / code within ```content``` strings or json. If it's json, then you can identify where it starts and ends. Strip everything outside the json.
That's the problem I've been facing with Llama 2 as well. It's almost impossible to have it just output the desired text. It will always add something before and after its response. Does anyone know if there's any prompt technique to fix this problem?