More

lqhl · 2024-05-29T10:18:01 1716977881

Oh, we missed this new standard. Our first version must have been implemented before its release. We should definitely consider it now.

lqhl · 2024-04-15T06:54:15 1713164055

LLM applications can benefit from Retrieval-Augmented Generation (RAG) in a similar way that humans benefit from search engines like Google. Therefore, I believe RAG cannot be replaced by prompts or fine-tuning.

https://myscale.com/blog/prompt-engineering-vs-finetuning-vs...

lqhl · 2024-04-03T07:24:26 1712129066

MyScaleDB utilizes approximate nearest neighbors (ANN) algorithms such as ScaNN, HNSW, and IVF. As a result, it may not achieve a 100% recall rate. However, depending on the search parameters used, it can attain recall rates of up to 95% or even 99%.

Considering that embedding vectors represent a lossy compression of the original text or images, is achieving a 100% recall necessary? I am interested in understanding its practical implications.

Disclaimer: I am an employee at MyScale.

mattashii · 2024-04-04T07:07:35 1712214455

> Considering that embedding vectors represent a lossy compression of the original text or images, is achieving a 100% recall necessary?

For the app, maybe not. But as a database absolutist, I think you must be able to dump all rows of a table with

    WITH
      limit_result AS (SELECT *, {similarity} AS metric FROM table ORDER BY {similarity} ASC LIMIT 10),
      dist AS (SELECT MAX(metric) AS max_m FROM limit_result)
    SELECT *, {similarity} AS metric FROM table, dist WHERE {similarity} > dist.max_m
    UNION ALL
    SELECT * FROM limit_result

... assuming that the ordered values are unique across the table and fully sortable

A recall of <100% may skip some rows in the limit_result, which then also won't show up in the main table's scan result, thus potentially corrupting a data dump process that uses sorted output.

lqhl · 2024-04-01T03:30:28 1711942228

Since we developed our internal version before the release of usearch, we integrated hnswlib with faiss's PQ/SQ algorithms. Search performance is typically not a concern in real-world applications, especially for those using LLMs due to their high latency. We highly recommend Google's scann algorithm (integrated into MyScaleDB) as it is much faster for indexing and offers similar search performance to hnsw.

lqhl · on Jan 5, 2024

I'm not familiar with it. maybe you can submit a PR?

lqhl · on Jan 5, 2024

A comparison matrix of some vector databases. Feel free to submit PRs on https://github.com/lqhl/vectordb-comparison/ to contribute!

lqhl · on Aug 21, 2023

I work on MyScale (https://myscale.com), a fully-managed vector database based on ClickHouse.

Some unique features of MyScale:

1. This solution is built on ClickHouse and offers comprehensive SQL support. Our users leverage vector search for a wide range of interesting OLAP use cases.

2. We utilize a property vector search algorithm called the multi-tier tree graph (MSTG). This algorithm is significantly faster than HNSW for both vector index building and filtered vector searches.

3. We utilize NVMe SSDs for the vector index cache, which greatly reduces the cost of hosting millions of vectors.

lqhl · on Aug 21, 2023

We conducted a benchmark to evaluate the precision, throughput (QPS), insert speed, build speed, and cost-effectiveness of Pinecone, Qdrant, MyScale, Weaviate, and Zilliz (Milvus). This information will be valuable for those seeking to choose a vector database for production purposes. The results can be found at https://myscale.github.io/benchmark/.

lqhl · on Aug 18, 2023

Compilance is also important. Like SOC 2

lqhl · on July 24, 2023

As jwells89 mentioned, LLMs can extract the intention from questions and generate better queries for a search engine or database.

akasakahakada · on July 24, 2023

I don't need LLM to assume my intention, I need exact matches for keywords.

ben_w · on July 24, 2023

Then you don't want an LLM at all; exact keyword matches are something we did in the mid 90s, one of the specific value-adds of an LLM is that it doesn't get stuck when you've only got a half-remembered inexact quote or vague description.