All these "your own GPT4" and the like should really not have "own" in the description. They are just someone else using another someone else's API. A better title might be, "Use our wrapper for openai's GPT4 service to summarize documents."
Thanks for the feedback. I get your point but with Unriddle users curate information for the index that sits on top of GPT-4. It's just one doc for now but in the future it will be multiple - this curation of info sources is where the "own" part comes in.
The title is misleading; giving the wrong idea or impression. People who work in the field may understand what is actually going on, but I get the feeling that your target market is much broader than that, and I doubt that the majority of that market will interpret "Create your own GPT4 ..." to instead actually mean "Unriddle users curate information for the index that sits on top of a GPT-4 API that OpenAI owns and we use on your behalf" (or whatever actually goes on)
As an LLM novice, can someone explain what these "your document" apps are doing? My understanding is that GPT-4 doesn't support fine-tuning, and 50MB is too large to add to the prompt (which would be too expensive anyway).
Hey! I'm the developer of Unriddle - it works using text embeddings. The document is split into small chunks and each chunk is assigned a numerical representation, or "vector", of its semantic meaning and relation to the other chunks. When a user prompts this too is assigned a vector and then compared to the rest of the chunks. The similar chunks are then fed into GPT-4 along with the query, ensuring the total number of words doesn't exceed the context window limit.
It's just the GPT-4 API - the chunks are sent as part of a prompt. In that case it won't use data from all chunks but it will try to find any chunks that provide descriptions of the document. I've found with research papers, for example, it fetches parts of the introduction and abstract.
Oh so there is pre-processing to find the useful portions? What are you using for the pre-processing?
I feel that it's inevitable that OpenAI et al. will be able to handle large PDF documents eventually. But until then I'm sure there's a lot of value of in this kind of pre-processing/chunking.
Yeah I think you're right - the 32k context window for GPT-4 (not available for everyone yet) is already enough for research papers. I'm using a library called Langchain, there's also LlamaIndex.
Vectorisation is done via OpenAI's embedding API. And the chunking/querying is happens through the Langchain library. But there are a few different ways of doing it - another good library is LLamaIndex.
Thanks a lot! Do you _have_ to do vectorization and querying with the same LLM? Can someone do vectorization with 1 and do querying with reevant chunks with another?
Simply speaking - They chunk the document (make it smaller so that it can be sent to gpt) and then vectorize it (change it to numbers / vector array). From there that is stored in a vector store - now, when you query you first query your vector store for the context (part of the 50MB file) and then send the context along with the question to GPT.
You are right GPT-4 doesn't support fine-tuning but, I think (in general) people might be misunderstanding what fine-tuning does.
Good explanation. Thanks! Can the first part, i.e. vectorizing and finding relevant chunks be done with any LLM (e.g. a self hosted one) and the second part, i.e. querying relevant chunks be done with OpenAI?
I've never had these retrieval demos work well. Either the LLM hallucinates content that could plausibly be in the document, or it's `reasoning` is so hamstrung by the constraint that it misunderstands the meaning of the specific quote it's referencing.
I just tried this one on the NBA CBA, which I would think is an ideal use case, and it didn't answer a single question correctly. Hopefully we find a more clever way to use LLMs in conjunction with a knowledge-base.
"What's a mid-level exception?" (Initially stated the info wasn't in the provided context, but upon asking again, it got it correctly)
"How long does a team have to wait after receiving a player in a trade before trading the player again?" (It's 60 days under certain circumstances, it told me 6 months)
"What's a traded player exception?" (Claims isn't in the document)
"How are traded player exceptions created?" (Answers correctly)
So after playing with it a bit more I think I understand the sorts of questions it does and doesn't like to answer, but still, I would think fuzzy matches for section titles should work.
A question to the author. Can you perform an ablation study with respect to the chunks? In other words, if you put in the context irrelevant/random chunks from the document would the quality of answers decrease/stay similar?
Potential issue might be that chunks just serve to activate massive knowledge of GPT4 and not actually used as a basis for an answer. For example, GPT4 has surely seen Dune in its training corpus and could be answering from memory.
This is an interesting idea. I'll have a think about a way to start measuring it. In Unriddle, any responses given that aren't drawn from the document are prefaced with a message to that effect. The bot usually says something like "I appreciate your curiosity about [query], but as an AI assistant, my primary focus is to provide advice on [document description]."
I'm looking for a more holistic solution that "understands" the entirety of a book (let's say fiction to drive the point) and can "reason" about character arcs / action evolution throughout the entire narrative.
There are lots of solutions using embeddings which basically boil down to a text search and picking some number of sentences / paragraphs around the search results to construct a "context", which may work with a limited set of technical / non-fiction documents but is otherwise of narrow use.
Hey Naveed, this is a great project well done. I'm curious about the summarisation of longform content. You mention that you're using Langchain - are you using the MapReduce approach for documents that exceed the 32k context window, or some other approach? 600 pages at ~500 words/tokens a page would mean about $20 to mapreduce through a big doc, which seems crazy especially if you iterate on those 'summary' prompts. Or are you using embeddings for everything including summary prompts?
I believe that those chatbots would be a great solution for Reference Manuals / User manuals for microcontrollers. Those documents usually have thousands of pages (less than 10000) sometimes logically and sometimes absurdly illogically sorted.
But those tools are severely limited on maximum amount of pages. This tool has max 300 pages. ChatPDF (or how it was named) has limit on 2000 pages in paid version. So I will need to wait until those models are more available.
I'm just exploring whether or not it's useful. If people continue using it and/or it evolves into something more useful, I'll likely charge a monthly fee for access to certain features (e.g. longer pdfs, merging multiple pdfs into one bot, having multiple bots). The plan is to keep it partially free though
Got it. Thank you
Currently is there any limit on the pdf sizes or number of pages to upload. For example, if I upload my entire book pdf (500pages), it will cost you a lot.
Can it ingest multiple PDFs in the same 'context', or would I have to assemble it all into one (under the 50mb limit)?
What is it using to
Can we (provoke you to) set the model temperature in the conversation to either minimize the hallucinating or increase the 'conjecture/BS/marketing-claims' factor?
Right now it's just one pdf per bot but yeah you could hack it by merging the pdfs and then generating a new bot. Interesting suggestion - did you notice a particular hallucination? What kind of docs would be high vs. low temperature?
Yes. See this comment [0] Another HNer with API access tried just ingesting the paper without context and some instructions and model-temp=0 and got better results.
I've also found in my area that it'll happily hallucinate stuff -- after all, it has zero actual understanding, it just predicts the most likely filler in the given context.
Tamping that down and just getting it to cut out the BS/overconfidence response patterns and reply with "I know X and I don't know Y" would be incredibly useful.
When we get back an "IDK", we can probe in a different way, but falsely thinking that we know something when we are actually still ignorant is worse than just knowing we've not yet got an answer.