Pgvector – vector similarity search for Postgres

simonw · on April 22, 2021

Side observation, not meant as a criticism of this project at all (which seems to be consistently doing things the PostgreSQL way, I commend that).

The PostgreSQL world really loves operators - this extension uses <#>, <-> and <=> for example

I've been working with PostgreSQL JSON queries a bit recently which is also really operator heavy: https://www.postgresql.org/docs/13/functions-json.html - ->, ->>, #>, #>>

I have a really hard time remember what any of these do - I personally much prefer to use function equivalents like json_extract_path() if they are available.

slashdev · on April 22, 2021

I agree, I never remember the syntax for the json operators either. I'd rather functions which are readable if you don't remember what they do. Operators need to be looked up. This emphasis on terseness over readability was one of the reasons I abandoned perl long ago.

PeterisP · on April 22, 2021

JSON query operators are somewhat unusual because it's very common to need to chain many operators in a row, which would be very verbose if named functions would be used instead - the vector similarity is different in this regard, and that terseness isn't necessary.

jononor · on April 22, 2021

Yeah, I would love to be able to use functions for these vector distance operations instead of the very magical operators...

megous · on April 22, 2021

https://github.com/ankane/pgvector/blob/master/vector--0.1.0...

Nothing stops you. All operators are implemented just as procedure calls.

kakadzhun · on April 22, 2021

https://github.com/joosephook/sqlite3-numpy

Minimal working example of storing numpy vectors in sqlite3 using Python:

    import sqlite3
    from scipy.spatial.distance import cdist
    import numpy as np

    class Vector:
        def __init__(self, arg):
            self.arg = arg

        def __str__(self):
            return 'Vector({!r})'.format(self.arg)


    def adapter_func(obj: np.ndarray):
        return obj.tobytes()

    def converter_func(data: bytes):
        return np.frombuffer(data)

    sqlite3.register_adapter(Vector, adapter_func)
    sqlite3.register_converter(f"{Vector.__name__}", converter_func)

    if __name__ == '__main__':
        with sqlite3.connect(":memory:", detect_types=sqlite3.PARSE_DECLTYPES) as con:
            cur = con.cursor()
            cur.execute("create table test(v vector)")
            cur.execute("insert into test(v) values (?)", (np.random.random(1280),))
            cur.execute("insert into test(v) values (?)", (np.random.random(1280),))

            cur.execute('select v from test')
            vectors = []
            for v, in cur.fetchall():
                print(v.shape, v)
                assert isinstance(v, np.ndarray)
                vectors.append(v)

            assert len(vectors) == 2
            print(cdist(vectors[:1], vectors[1:], metric='cosine')[0])

            cur.close()

patelajay285 · on April 22, 2021

Our startup made a package powered by SQLite for this very purpose: https://github.com/plasticityai/magnitude

Might be worth checking out :)

ipsum2 · on April 23, 2021

I'm not sure how this is relevant - the difficult part is doing the retrieval based off of similarity matching in an efficient way. Calling the cdist function to compare all vectors would be very slow.

summarity · on April 22, 2021

I’ve done something similar (heh) in pure Postgres: https://github.com/turbo/pg-costop

phenkdo · on April 22, 2021

how does this do with large-scale searches >10M+ rows? any benchmarks on performance?

gk1 · on April 22, 2021

Since the answer, in minxomat's words, is "terrible," maybe look at Pinecone (https://www.pinecone.io) which makes light work of searching through 10M+ rows. It sits alongside, not inside, your Postgres (or whatever) warehouse.

Disclosure: I work there.

fzysingularity · on April 23, 2021

Out of curiosity, what kind of performance do you get with 100M rows with pinecone? Looking at your pricing tiers, ~100M rows would need ~200GB memory, and @ $0.1 / GB / hr that's $20 / hr if I'm not mistaken?

Also, can you join with existing SQL tables to do hybrid searches in pinecone?

gk1 · on April 23, 2021

Our performance is independent of the collection size, thanks to dynamic sharding. You can expect 50-100ms pretty much regardless of size.

We don't support external SQL joins yet. Depending on what you're trying to do, we have an upcoming feature that might do the job.

At 100M vectors you're well into "volume discount" territory. Even more so with 3B vectors, as you mentioned in another comment. The free trial is capped at 10GB but shoot me an email (greg@pinecone.io) to get around it for a proper test and pricing estimate.

summarity · on April 22, 2021

Oh it's terrible. More educational. You'd never ever want to do a full-scan cosine matching in production. Use locality sensitive hashing (with optimizations like amplification, SuperBit or DenseFly) for real world workloads.

fzysingularity · on April 23, 2021

Others have pointed this out already, but have a look at Milvus (https://github.com/milvus-io/milvus). I was able to get a simple version of it running with searches over ~3B vectors in under a second on a single machine (with out of the box configuration, and practically no optimization done just yet).

jononor · on April 22, 2021

Nice, useful specially on managed services where installing additional extensions may not be an option. If you want wider usage, you may want to add a section on how to install/setup this with a typical postgresql setup? And also, how was the code tested for correctness, and what is the expected performance?

phenkdo · on April 22, 2021

Love it! Need better ANNs approaches in databases, using external servers such as Milvus, faiss is a pita

related: https://github.com/netrasys/pgANN

gk1 · on April 22, 2021

Why are they a pain in the...?

EmilStenstrom · on April 22, 2021

They typically load all data in memory, so you still need persistence to handle crashes (two setups). And since data is typically huge you need servers with lots of expensive RAM.

patelajay285 · on April 22, 2021

Check out Magnitude, we built it to solve that problem: https://github.com/plasticityai/magnitude

It's still loaded from a file, but heavily uses memory-mapping and caching to be speedy and not overload your RAM immediately. And in production scenarios, multiple worker processes can share that memory due to the memory mapping.

Granted it's read-only, so might not be exactly what you are looking for.

Disclaimer: I'm the author.

rkt08 · on April 22, 2021

How about a vector oriented 'database' instead? Pinecone(https://www.pinecone.io/) does both exact and approx search and it's fully managed so you don't have to worry about reliability,availability etc.

PS: I work there

phenkdo · on April 22, 2021

the usual yet-another-moving-part (YAMP) complexity.

robertlagrant · on April 22, 2021

I like the long-running trend of using Postgres as a standard platform to build good tools on.

RicoElectrico · on April 22, 2021

Can somebody chime in and say whether this would be a useful solution to recommend similar texts if I compute fastText, doc2vec or other embeddings?

What do such vector similarity queries use in production when there's no DB support? Surely having the "real" DB like Postgres and one for vectors alongside it would be cumbersome.

BenoitP · on April 22, 2021

Neat!

I guess from the 'ivfflat' keyword that this project is serving vectors produced by FAISS[1] somehow?

https://github.com/facebookresearch/faiss

heipei · on April 22, 2021

FAISS doesn't produce vectors, it's simply a vector similarity search engine.

BenoitP · on April 22, 2021

It produces vectors, also does the (distributed) searching.

Here is a partial list of possible vector types:

https://github.com/facebookresearch/faiss/wiki/Faiss-indexes

gk1 · on April 22, 2021

Seems that way. Faiss is mentioned in the “Thanks” list.

sammorrowdrums · on April 22, 2021

I have been using Smlar for a while for cosine similarity. Might this project provide a viable alternative? Smlar can be quite slow

https://github.com/jirutka/smlar

dventimi · on May 4, 2021

This looks pretty great! Do you have any sense of its performance? Very roughly, what kind of latency can one expect over, say, 1M, 10M, or 100M vectors? Ballpark: milliseconds, seconds, days?

bobthebuilders · on April 23, 2021

Would this be a good fit to recommend books to users when I have day 3m books and 200k users? If not, what should I be looking for?