Regarding 1) you could watch https://www.youtube.com/watch?v=klTvEwg3oJ4 . Pinec...

tunesmith · on April 19, 2023

So.... if I wrote a book manuscript, and wanted an LLM to help me track plot holes by asking it questions about it, I can't do that with token limits (aside from various summarization tricks people use with ChatGPT), but I could somehow parse/train a system to represent the manuscript in the vector database and hook that up with my LLM?

mark_l_watson · on April 19, 2023

You would partition the manuscript into a sequence of chunks. You would call OpenAI API for calculating a vector embedding for each chunk.

When you want to query against your manuscript, you call the OpenAI API for calculating a vector embedding for your query, locally find the chunks "near" your query, concatenate these chunks, then pass this context text with your query to GPT-3.5turbo or GPT-4.0.

I have written up small examples for doing this in Swift [1] and Common Lisp [2].

[1] https://github.com/mark-watson/Docs_QA_Swift

[2] https://github.com/mark-watson/docs-qa

Spivak · on April 20, 2023

And the missing glue is that "vectors closest to the question string" actually produces pretty good results. You won't be Google level of relevancy but for "free" with a really dumb search algorithm you'll be at the level of a elasticsearch tuned by someone who knows what they're doing.

I think in all the chaos of the other cool stuff you can do with these models that people are just glossing over that these Llms close the loop on search based on word or sentence embedding techniques like word2vec, GloVe, ELMo, and BERT. The fact that you can actually generate quality embeddings for arbitrary text that represents their meaning semantically as a whole is cool as shit.

mark_l_watson · on April 19, 2023

BTW, I am working on an open source project that will generally make these ideas usable, at least useful for me: http://agi-assistant.org

No public code yet, but I will release it with Apache 2 license when/if it works well enough for my own daily use.