LLM applications can benefit from Retrieval-Augmented Generation (RAG) in a similar way that humans benefit from search engines like Google. Therefore, I believe RAG cannot be replaced by prompts or fine-tuning.
MyScaleDB utilizes approximate nearest neighbors (ANN) algorithms such as ScaNN, HNSW, and IVF. As a result, it may not achieve a 100% recall rate. However, depending on the search parameters used, it can attain recall rates of up to 95% or even 99%.
Considering that embedding vectors represent a lossy compression of the original text or images, is achieving a 100% recall necessary? I am interested in understanding its practical implications.
> Considering that embedding vectors represent a lossy compression of the original text or images, is achieving a 100% recall necessary?
For the app, maybe not. But as a database absolutist, I think you must be able to dump all rows of a table with
WITH
limit_result AS (SELECT *, {similarity} AS metric FROM table ORDER BY {similarity} ASC LIMIT 10),
dist AS (SELECT MAX(metric) AS max_m FROM limit_result)
SELECT *, {similarity} AS metric FROM table, dist WHERE {similarity} > dist.max_m
UNION ALL
SELECT * FROM limit_result
... assuming that the ordered values are unique across the table and fully sortable
A recall of <100% may skip some rows in the limit_result, which then also won't show up in the main table's scan result, thus potentially corrupting a data dump process that uses sorted output.
Since we developed our internal version before the release of usearch, we integrated hnswlib with faiss's PQ/SQ algorithms. Search performance is typically not a concern in real-world applications, especially for those using LLMs due to their high latency. We highly recommend Google's scann algorithm (integrated into MyScaleDB) as it is much faster for indexing and offers similar search performance to hnsw.
I work on MyScale (https://myscale.com), a fully-managed vector database based on ClickHouse.
Some unique features of MyScale:
1. This solution is built on ClickHouse and offers comprehensive SQL support. Our users leverage vector search for a wide range of interesting OLAP use cases.
2. We utilize a property vector search algorithm called the multi-tier tree graph (MSTG). This algorithm is significantly faster than HNSW for both vector index building and filtered vector searches.
3. We utilize NVMe SSDs for the vector index cache, which greatly reduces the cost of hosting millions of vectors.
We conducted a benchmark to evaluate the precision, throughput (QPS), insert speed, build speed, and cost-effectiveness of Pinecone, Qdrant, MyScale, Weaviate, and Zilliz (Milvus). This information will be valuable for those seeking to choose a vector database for production purposes. The results can be found at https://myscale.github.io/benchmark/.
Then you don't want an LLM at all; exact keyword matches are something we did in the mid 90s, one of the specific value-adds of an LLM is that it doesn't get stuck when you've only got a half-remembered inexact quote or vague description.