Could you turn this into a psql extension? If this is integrated into an actual ...

jarulraj · on April 30, 2023

Thanks for the helpful suggestion! EVA uses an SQL database system for managing structured data using sqlalchemy. It runs on PostgreSQL out of the box. You only need to provide the database connection url in the EVA configuration file.

Thanks for your candid comment. We take it very seriously. EVA is already being used in production by some collaborators and we would love to support more early adopters :) Please let me know if I can DM you to get more feedback.

startupsfail · on April 30, 2023

Nice.

I’ve skimmed over the documentation and it wasn’t clear. It looked like the database was designed from scratch. If this is a caching/syntactic sugar over a mix of DB and inference queries, this is interesting and feels a lot less risky.

jarulraj · on April 30, 2023

Thanks for following up on this.

We designed EVA from scratch for managing unstructured data (e.g., video, audio, images, etc.). EVA leverages relational database systems to manage structured data and widely-used libraries to manage feature embeddings (FAISS library [1]). We aim to leverage decades of experience in relational database systems and reduce risk in production deployment.

[1] https://github.com/facebookresearch/faiss

startupsfail · on April 30, 2023

Do you support weighted similarly search? I.e. when I have several embeddings and need to put a weight factor in front of the cosine similarity when I’m performing a query?

Faiss seems like an excellent choice. How do you get the vectors into it from the database? Or are they stored separately? I’m currently using pgvector and it’s not GPU optimized. But the advantage is that it enjoys the same levels of data protection as the rest of the database.

Actually, are there any vector similarity search query sample? I see the feature extractor, but can’t seem to find any similarity search samples.

jarulraj · on May 1, 2023

Great questions!

EVA does not currently support a weighted similarity search. We are working on creating a notebook to illustrate similarity queries. But, EVA already supports the queries of this form:

  -- Step 1: Extract objects in Reddit images using the YOLO object detector
  CREATE TABLE reddit_dataset_object (name, data, bboxes) 
  AS SELECT name, data, labels FROM reddit_dataset
  JOIN LATERAL UNNEST(YoloV5(data)) AS Obj(labels, bboxes, scores);

  -- Step 2: Build index over features extracted using SIFT
  CREATE INDEX reddit_sift_object_index
  ON reddit_dataset_object (SiftFeatureExtractor(Crop(data, bboxes)))
  USING HNSW;

  -- Step 3: Retrieve the top 10 most similar images
  SELECT id FROM reddit_sift_object_index
  ORDER BY Similarity(SiftFeatureExtractor(Open(”“input_img_path.jpg”)),
           SiftFeatureExtractor(data))
  LIMIT 10;

https://github.com/georgia-tech-db/eva/blob/bfd424fd5beb3cec...

EVA directly persists the feature vectors in a FAISS index. It does not use a relational database system for this purpose. FAISS supports retrieving the original vector through ID (required for similarity search).

We would love to jointly explore how to support such weighted similarity search queries. Please consider opening an issue with more details on your use case.