Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Unriddle – Create your own GPT-4 on top of any document (unriddle.ai)
29 points by naveedjanmo on April 3, 2023 | hide | past | favorite | 51 comments



All these "your own GPT4" and the like should really not have "own" in the description. They are just someone else using another someone else's API. A better title might be, "Use our wrapper for openai's GPT4 service to summarize documents."


I want to use somebody else's gpt-4 API token because I don't have gpt-4 API access.

I asked for whitelist access, didn't get it yet and now I won't get it because my country banned openai an openai closed my pro account.


"Leverage the power of GPT4 to" something something something sounds nice and advertisey?



*API calls to GPT-4


Thanks for the feedback. I get your point but with Unriddle users curate information for the index that sits on top of GPT-4. It's just one doc for now but in the future it will be multiple - this curation of info sources is where the "own" part comes in.


The title is misleading; giving the wrong idea or impression. People who work in the field may understand what is actually going on, but I get the feeling that your target market is much broader than that, and I doubt that the majority of that market will interpret "Create your own GPT4 ..." to instead actually mean "Unriddle users curate information for the index that sits on top of a GPT-4 API that OpenAI owns and we use on your behalf" (or whatever actually goes on)


own vs share your company secrets with the world


As an LLM novice, can someone explain what these "your document" apps are doing? My understanding is that GPT-4 doesn't support fine-tuning, and 50MB is too large to add to the prompt (which would be too expensive anyway).


Hey! I'm the developer of Unriddle - it works using text embeddings. The document is split into small chunks and each chunk is assigned a numerical representation, or "vector", of its semantic meaning and relation to the other chunks. When a user prompts this too is assigned a vector and then compared to the rest of the chunks. The similar chunks are then fed into GPT-4 along with the query, ensuring the total number of words doesn't exceed the context window limit.


//The similar chunks are then fed into GPT-4 along with the query

Since GPT can use things from his context arbitrarily ,does it solve the hallucination issue, even for ebooks?


Awesome - I knew about vectorising/embeddings for semantic search, but I hadn't thought of using the search results as a prompt prefix - clever!


Yeah it’s the pattern b all these tools are using.

Use SebtenceTransformers in python to write to the database (PineconeDB) and then do the same for queries. Use the results as context.


What OpenAI API calls allow sending these small chunks?

When you query something like "What is this research about?" is it able to use data from all chunks?


It's just the GPT-4 API - the chunks are sent as part of a prompt. In that case it won't use data from all chunks but it will try to find any chunks that provide descriptions of the document. I've found with research papers, for example, it fetches parts of the introduction and abstract.


Oh so there is pre-processing to find the useful portions? What are you using for the pre-processing?

I feel that it's inevitable that OpenAI et al. will be able to handle large PDF documents eventually. But until then I'm sure there's a lot of value of in this kind of pre-processing/chunking.


Yeah I think you're right - the 32k context window for GPT-4 (not available for everyone yet) is already enough for research papers. I'm using a library called Langchain, there's also LlamaIndex.


Can the vectorization of chunks and finding context close to query be done with any LLMs and then only relevant chunks be sent to OpenAI?


Vectorisation is done via OpenAI's embedding API. And the chunking/querying is happens through the Langchain library. But there are a few different ways of doing it - another good library is LLamaIndex.


Thanks a lot! Do you _have_ to do vectorization and querying with the same LLM? Can someone do vectorization with 1 and do querying with reevant chunks with another?


Simply speaking - They chunk the document (make it smaller so that it can be sent to gpt) and then vectorize it (change it to numbers / vector array). From there that is stored in a vector store - now, when you query you first query your vector store for the context (part of the 50MB file) and then send the context along with the question to GPT.

You are right GPT-4 doesn't support fine-tuning but, I think (in general) people might be misunderstanding what fine-tuning does.


Good explanation. Thanks! Can the first part, i.e. vectorizing and finding relevant chunks be done with any LLM (e.g. a self hosted one) and the second part, i.e. querying relevant chunks be done with OpenAI?


I've never had these retrieval demos work well. Either the LLM hallucinates content that could plausibly be in the document, or it's `reasoning` is so hamstrung by the constraint that it misunderstands the meaning of the specific quote it's referencing.

I just tried this one on the NBA CBA, which I would think is an ideal use case, and it didn't answer a single question correctly. Hopefully we find a more clever way to use LLMs in conjunction with a knowledge-base.


Can you share the link you used and a few questions/correct answers?


PDF link: https://cosmic-s3.imgix.net/3c7a0a50-8e11-11e9-875d-3d44e94a...

"What's a mid-level exception?" (Initially stated the info wasn't in the provided context, but upon asking again, it got it correctly)

"How long does a team have to wait after receiving a player in a trade before trading the player again?" (It's 60 days under certain circumstances, it told me 6 months)

"What's a traded player exception?" (Claims isn't in the document)

"How are traded player exceptions created?" (Answers correctly)

So after playing with it a bit more I think I understand the sorts of questions it does and doesn't like to answer, but still, I would think fuzzy matches for section titles should work.


A question to the author. Can you perform an ablation study with respect to the chunks? In other words, if you put in the context irrelevant/random chunks from the document would the quality of answers decrease/stay similar?

Potential issue might be that chunks just serve to activate massive knowledge of GPT4 and not actually used as a basis for an answer. For example, GPT4 has surely seen Dune in its training corpus and could be answering from memory.


This is an interesting idea. I'll have a think about a way to start measuring it. In Unriddle, any responses given that aren't drawn from the document are prefaced with a message to that effect. The bot usually says something like "I appreciate your curiosity about [query], but as an AI assistant, my primary focus is to provide advice on [document description]."


I'm looking for a more holistic solution that "understands" the entirety of a book (let's say fiction to drive the point) and can "reason" about character arcs / action evolution throughout the entire narrative.

There are lots of solutions using embeddings which basically boil down to a text search and picking some number of sentences / paragraphs around the search results to construct a "context", which may work with a limited set of technical / non-fiction documents but is otherwise of narrow use.


I was playing around with Dune inside of Unriddle yesterday and it seemed to work pretty well for describing the overarching narrative. https://app.unriddle.ai/bot/55fee905-1174-4b33-8e67-5dfe8301...

But I expect this kind of querying will be much better as the context windows for LLMs increase


I'd imagine the model already knows about Dune, that's why you were getting usable results. Try with an obscure piece of literature instead.


Hey Naveed, this is a great project well done. I'm curious about the summarisation of longform content. You mention that you're using Langchain - are you using the MapReduce approach for documents that exceed the 32k context window, or some other approach? 600 pages at ~500 words/tokens a page would mean about $20 to mapreduce through a big doc, which seems crazy especially if you iterate on those 'summary' prompts. Or are you using embeddings for everything including summary prompts?


I believe that those chatbots would be a great solution for Reference Manuals / User manuals for microcontrollers. Those documents usually have thousands of pages (less than 10000) sometimes logically and sometimes absurdly illogically sorted.

But those tools are severely limited on maximum amount of pages. This tool has max 300 pages. ChatPDF (or how it was named) has limit on 2000 pages in paid version. So I will need to wait until those models are more available.


Is there any progress on the llama front where I can use that model and update / train it against my data? (I.e. all self hosted)



Oh interesting, I hadn't come across this one. Thanks for sharing


Hey, this looks amazing!

I couldn’t find any information about pricing? This is going to be a free tool?


Thanks! It's free to use for now and will always be at least partially free :)


I am assuming you are calling openai API - which requires to pay based on tokens. Whats your business model since all the users are unpaid users?

Do you maintain the costs from ADs? Do you charge users?

Thank you


I'm just exploring whether or not it's useful. If people continue using it and/or it evolves into something more useful, I'll likely charge a monthly fee for access to certain features (e.g. longer pdfs, merging multiple pdfs into one bot, having multiple bots). The plan is to keep it partially free though


Got it. Thank you Currently is there any limit on the pdf sizes or number of pages to upload. For example, if I upload my entire book pdf (500pages), it will cost you a lot.


From your tweet, I saw you now support 2x file size limit (300 → 600 pgs). I am assuming its costing a lot for the free trial. Best of Luck :)


Does it a accept only pdf docs at the moment? What's the maximum size of pdf file that I can provide it to analyze without breaking it.


Yeah it's only pdfs for now. What other file types did you want to try? Max file size is 600 pages/50 mbs


Cool.

Can it ingest multiple PDFs in the same 'context', or would I have to assemble it all into one (under the 50mb limit)?

What is it using to

Can we (provoke you to) set the model temperature in the conversation to either minimize the hallucinating or increase the 'conjecture/BS/marketing-claims' factor?


Right now it's just one pdf per bot but yeah you could hack it by merging the pdfs and then generating a new bot. Interesting suggestion - did you notice a particular hallucination? What kind of docs would be high vs. low temperature?


Yes. See this comment [0] Another HNer with API access tried just ingesting the paper without context and some instructions and model-temp=0 and got better results.

I've also found in my area that it'll happily hallucinate stuff -- after all, it has zero actual understanding, it just predicts the most likely filler in the given context.

Tamping that down and just getting it to cut out the BS/overconfidence response patterns and reply with "I know X and I don't know Y" would be incredibly useful.

When we get back an "IDK", we can probe in a different way, but falsely thinking that we know something when we are actually still ignorant is worse than just knowing we've not yet got an answer.

[0] https://news.ycombinator.com/item?id=35392685


Is the software used for vectorizing the documents also an API, or is that open-source?


There's an OpenAI API for vectorizing which I'm accessing through the Langchain library :)


Does not work for me.


Bing AI also does that.


(((:::)))




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: