It's just OpenAI embeddings. We fetch them and just push then to Redis with a 30 day TTL. The backing data that's embedded rarely changes, so we don't need to create a new embedding for very often. We batch what needs to be embedded.
The full RAG workflow - using ADA to embed the user input, deserialize embeddings, run cosine similarity, and call gpt-3.5-turbo - is about 3 seconds end-to-end to get a result.
OpenAI embeddings are 1 per request payload, right? Have you hit any rate limits doing that?
We have a performance budget of ~1 second for the generate-index-search pipeline, which may or may not be feasible. I discounted OpenAI because it seemed like we're guaranteed to hit the rate limit if we flood them with concurrent requests for embeddings. Typical corpus size that we need to work with is 20 concurrent documents ranging from ~100kb to ~2mb. Chunking those documents to fit the 8k token context window balloons the request count further.
You absolutely want to chunk them smaller than 8k. Have you tested different chunk strategies? It can make a huge difference for actually recalling useful information in small enough chunks to be usable.
The full RAG workflow - using ADA to embed the user input, deserialize embeddings, run cosine similarity, and call gpt-3.5-turbo - is about 3 seconds end-to-end to get a result.