From what you said your solution wasn't designed correctly?
You should definitely have been using BM25 and SBERT for something like this, you definitely should have been asking 3.5 for structured output and doing any math yourself.
If these were answers from a fixed set of documents there's also a ton of pre-processing you should have been doing.
-
I recently helped a friend with a problem of identifying a certain form of language in EDGAR sourced contracts. He had someone who tried feeding in entire documents into LLMs to find these and the result was some 20 minute search time.
I took a few minutes sitting with him to come up with a demo that used synthetic data and SBERT to process documents in less than 30 seconds. 99% of that was stuff you could do 2 years ago, the part the LLM helped with was creating buckets of synthetic data quickly, and even that could be done procedurally if you had more time.
>> you definitely should have been asking 3.5 for structured output and doing any math yourself.
I suspect that this solution would have required a lot more time given my lack of experience in these technologies and there are a lot of unknowns for that particular use case. For example, a user could ask to compare legalese from 30 different providers to find all those which met a certain conditions; even if you can narrow down the input to just a few sentences per provider, you still need to feed it 30 times that amount or else the AI has no way to compare all the providers.
With my solution, GPT only saw a tiny highly relevant fraction of all the documents. It's possible that the solutions you mentioned would have helped cut our input size further but it's not clear by how much given the complexity of the questions and the data.
You should definitely have been using BM25 and SBERT for something like this, you definitely should have been asking 3.5 for structured output and doing any math yourself.
If these were answers from a fixed set of documents there's also a ton of pre-processing you should have been doing.
-
I recently helped a friend with a problem of identifying a certain form of language in EDGAR sourced contracts. He had someone who tried feeding in entire documents into LLMs to find these and the result was some 20 minute search time.
I took a few minutes sitting with him to come up with a demo that used synthetic data and SBERT to process documents in less than 30 seconds. 99% of that was stuff you could do 2 years ago, the part the LLM helped with was creating buckets of synthetic data quickly, and even that could be done procedurally if you had more time.