With how good gpt-3.5-turbo-0613 is (particularly with system prompt engineering), there's no longer as much of a need to use the GPT-4 API especially given its massive 20x-30x price increase.
The mass adoption of the ChatGPT APIs compared to the old Completion APIs proves my initial blog post on the ChatGPT API correct: developers will immediately switch for a massive price reduction if quality is the same (or better!): https://news.ycombinator.com/item?id=35110998
I have a startup of legal AI, the quality jump from GPT3.5 to GPT4 in this domain is straight mind-blowing, GPT3.5 in comparison is useless. But I see how in more conversational settings GPT3.5 can provide more appealing performance/price.
I suggested to my wife that ChatGPT would help with her job and she has found ChatGPT4 to be the same or worse as ChatGPT3.5. It’s really interesting just how variable the quality can be given your particular line of work.
Legal writing is ideal training data: mostly formulaic, based on conventions and rules, well-formed and highly vetted, with much of the best in the public domain.
Medical writing is the opposite, with unstated premises, semi-random associations, and rarely a meaningful sentence.
> Legal writing is ideal training data: mostly formulaic, based on conventions and rules, well-formed and highly vetted, with much of the best in the public domain.
That makes sense. The labor impact research suggests that law will be a domain hit almost as hard as education by language models. Almost nothing happens in court that hasn't occured hundreds of thousands of times before. A model with GPT-4 power specifically trained for legal matters and fine tuned by jurisdiction could replace everyone in a courtroom. Well there's still the bailiff, I think that's about 18 months behind.
My experience is that GPT-3.5 is not better or even nearly as good as GPT-4. Will it work for most use cases? Probably, yes. But GPT-3.5 effectively ignores instructions much more often than GPT-4 and I've found it far far easier to trip up with things as simple as trailing spaces; it will sometimes exhibit really odd behavior like spelling out individual letters when you give it large amounts of text with missing grammar/punctuation to rewrite. Doesn't seem to matter how I setup the system prompt. I've yet to see GPT-4 do truly strange things like that.
The initial gpt-3.5-turbo was flakey and required significant prompt engineering. The updated gpt-3.5-turbo-0613 fixed all the issues I had even after stripping out the prompt engineering.
It's definitely gotten better, but yeah, it really doesn't reliably support what I'm currently working on.
My project takes transcripts from YouTube, which don't have punctuation, splits them up into chunks, and passes each chunk to GPT-4 telling it to add punctuation with paragraphs. Part of the instructions includes telling the model that, if the final sentence of the chunk appears incomplete, to just try to complete it. Anyway, GPT-3.5-turbo works okay for several chunks but almost invariably hits a case where it either writes a bunch of nonsense or spells out the individual letters of words. I'm sure that there's a programmatic way I can work around this issue, but GPT-4 performs the same job flawlessly.
I've done exactly this for another project. I'd recommend grabbing an open source model and fine-tuning on some augmented data in your domain. For example: I grabbed tech blog posts, turned each post into a collection of phonemes, reconstructed the phonemes into words, added filler words, and removed punctuation+capitalization.
Sounds interesting, any chance you could share either your end result that you used to then fine-tune with, or even better the exact steps (ie technically how you did each step you already mentioned)?
And what open LLM you used it with / how successful you've found it?
If GPT 4 is working for you I wouldn't necessarily bother with this, but this is a great example of where you can sometimes take advantage of how much cheaper 3.5 is to burn some tokens and get a better output. For example I'd try asking it for something like :
{
"isIncomplete": [true if the chunk seems incomplete]
"completion": [the additional text to add to the end, or undefined otherwise]
"finalOutputWithCompletion": [punctuated text with completion if isIncomplete==true]
}
Technically you're burning a ton of tokens having it state the completion twice, but GPT 3.5 is fast/cheap enough that it doesn't matter as long as 'finalOutputWithCompletion' is good. You can probably add some extra fields to get an even nicer output than 4 would allow cost-wise and time-wise by expanding that JSON object with extra information that you'd ideally input like tone/subject.
I use it to generate nonsense fairytales for my sleep podcast (https://deepdreams.stavros.io/), and it will ignore my (pretty specific) instructions and add scene titles to things, and write the text in dramatic format instead of prose, no matter how much I try.
I mostly use it for generating tests, making documentation, refactoring, code snippets, etc. I use it daily for work along with copilot/x.
In my experience GPT3.5turbo is... rather dumb in comparison. It makes a comment explaining what a method is going to do and what arguments it will have - then misses arguments altogether. It feels like it has poor memory (and we're talking relatively short code snippets, nothing remotely near it's context length).
And I don't mean small mistakes - I mean it will say it will do something with several steps, then just miss entire steps.
GPT3.5turbo is reliably unreliable for me, requiring large changes and constant "rerolls".
GPT3.5turbo also has difficulty following the "style/template" from both the prompt and it's own response. It'll be consistent then just - change. An example being how it uses bullet points in documentation.
Codex is generally better - but noticeably worse then GPT4 - it's decent as a "smart autocomplete" though. Not crazy useful for documentation.
Meanwhile GPT4 generally nails the results, occasionally needing a few tweaks, generally only with long/complex code/prompts.
tl;dr - In my experience for code GPT3.5turbo isn't even worth the time it takes to get a good result/fix the result. Codex can do some decent things. I just use GPT4 for anything more then autocomplete - it's so much more consistent.
If you're manually interacting with the model, GPT 4 is almost always going to be better.
Where 3.5 excels is with programmatic access. You can ask it for 2x as much text between setup so the end result is well formed and still get a reply that's cheaper and faster than 4 (for example, ask 3.5 for a response, then ask it to format that response)
I am building an extensive LLM-powered app, and had a chance to compare the two using the API. Empirically, I have found 3.5 to be fairly unusable for the app's use case. How are you evaluating the two models?
It depends on the domain, but chain of thought can get 3.5 to be extremely reliable, and especially with the new 16k variant
I built notionsmith.ai on 3.5: for some time I experimented with GPT 4 but the result was significantly worse to use because of how slow it became, going from ~15 seconds per generated output to a minute plus.
And you could work around that with things like streaming output for some use cases, but that doesn't work for chain of thought. GPT 4 can do some tasks without chain of thought that 3.5 required it for, but there are still many times where it improves the result from 4 dramatically.
For example, I leverage chain of thought in replies to the user when they're in a chat and that results in a much better user experience: It's very difficult to run into the default 'As a large language model' disclaimer regardless of how deeply you probe a generated experience when using it. GPT 4 requires the same chain of thought process to avoid that, but ends up needing several seconds per response, as opposed to 3.5 which is near-instant.
-
I suspect a lot of people are building things on 4 but would get better quality of output if they used more aspects of chain of thought and either settled for a slower output or moved to 3.5 (or a mix of 3.5 and 4)
It depends a lot on the domain, even for CoT. I don't think there are enough NLU evaluations just yet to robustly compare GPT-3.5 w/ CoT/SC vs. GPT-4 wrt domain.
For instance, with MATH dataset, my own n=500 evaluation showed no difference between GPT-3.5 (w/ and w/o CoT) and GPT-4. I was pretty surprised by that.
I think this is very very use-case dependent, and your use case != everyone's use case.
In my experience, GPT-4 is night and day better than 3.5 turbo for almost everything I use OpenAI for.
The mass adoption of the ChatGPT APIs compared to the old Completion APIs proves my initial blog post on the ChatGPT API correct: developers will immediately switch for a massive price reduction if quality is the same (or better!): https://news.ycombinator.com/item?id=35110998