Hacker News new | past | comments | ask | show | jobs | submit login
Pgvector – vector similarity search for Postgres (github.com/ankane)
143 points by simonpure on April 22, 2021 | hide | past | favorite | 31 comments



Side observation, not meant as a criticism of this project at all (which seems to be consistently doing things the PostgreSQL way, I commend that).

The PostgreSQL world really loves operators - this extension uses <#>, <-> and <=> for example

I've been working with PostgreSQL JSON queries a bit recently which is also really operator heavy: https://www.postgresql.org/docs/13/functions-json.html - ->, ->>, #>, #>>

I have a really hard time remember what any of these do - I personally much prefer to use function equivalents like json_extract_path() if they are available.


I agree, I never remember the syntax for the json operators either. I'd rather functions which are readable if you don't remember what they do. Operators need to be looked up. This emphasis on terseness over readability was one of the reasons I abandoned perl long ago.


JSON query operators are somewhat unusual because it's very common to need to chain many operators in a row, which would be very verbose if named functions would be used instead - the vector similarity is different in this regard, and that terseness isn't necessary.


Yeah, I would love to be able to use functions for these vector distance operations instead of the very magical operators...


https://github.com/ankane/pgvector/blob/master/vector--0.1.0...

Nothing stops you. All operators are implemented just as procedure calls.


https://github.com/joosephook/sqlite3-numpy

Minimal working example of storing numpy vectors in sqlite3 using Python:

    import sqlite3
    from scipy.spatial.distance import cdist
    import numpy as np

    class Vector:
        def __init__(self, arg):
            self.arg = arg

        def __str__(self):
            return 'Vector({!r})'.format(self.arg)


    def adapter_func(obj: np.ndarray):
        return obj.tobytes()

    def converter_func(data: bytes):
        return np.frombuffer(data)

    sqlite3.register_adapter(Vector, adapter_func)
    sqlite3.register_converter(f"{Vector.__name__}", converter_func)

    if __name__ == '__main__':
        with sqlite3.connect(":memory:", detect_types=sqlite3.PARSE_DECLTYPES) as con:
            cur = con.cursor()
            cur.execute("create table test(v vector)")
            cur.execute("insert into test(v) values (?)", (np.random.random(1280),))
            cur.execute("insert into test(v) values (?)", (np.random.random(1280),))

            cur.execute('select v from test')
            vectors = []
            for v, in cur.fetchall():
                print(v.shape, v)
                assert isinstance(v, np.ndarray)
                vectors.append(v)

            assert len(vectors) == 2
            print(cdist(vectors[:1], vectors[1:], metric='cosine')[0])

            cur.close()


Our startup made a package powered by SQLite for this very purpose: https://github.com/plasticityai/magnitude

Might be worth checking out :)


I'm not sure how this is relevant - the difficult part is doing the retrieval based off of similarity matching in an efficient way. Calling the cdist function to compare all vectors would be very slow.


I’ve done something similar (heh) in pure Postgres: https://github.com/turbo/pg-costop


how does this do with large-scale searches >10M+ rows? any benchmarks on performance?


Since the answer, in minxomat's words, is "terrible," maybe look at Pinecone (https://www.pinecone.io) which makes light work of searching through 10M+ rows. It sits alongside, not inside, your Postgres (or whatever) warehouse.

Disclosure: I work there.


Out of curiosity, what kind of performance do you get with 100M rows with pinecone? Looking at your pricing tiers, ~100M rows would need ~200GB memory, and @ $0.1 / GB / hr that's $20 / hr if I'm not mistaken?

Also, can you join with existing SQL tables to do hybrid searches in pinecone?


Our performance is independent of the collection size, thanks to dynamic sharding. You can expect 50-100ms pretty much regardless of size.

We don't support external SQL joins yet. Depending on what you're trying to do, we have an upcoming feature that might do the job.

At 100M vectors you're well into "volume discount" territory. Even more so with 3B vectors, as you mentioned in another comment. The free trial is capped at 10GB but shoot me an email (greg@pinecone.io) to get around it for a proper test and pricing estimate.


Oh it's terrible. More educational. You'd never ever want to do a full-scan cosine matching in production. Use locality sensitive hashing (with optimizations like amplification, SuperBit or DenseFly) for real world workloads.


Others have pointed this out already, but have a look at Milvus (https://github.com/milvus-io/milvus). I was able to get a simple version of it running with searches over ~3B vectors in under a second on a single machine (with out of the box configuration, and practically no optimization done just yet).


Nice, useful specially on managed services where installing additional extensions may not be an option. If you want wider usage, you may want to add a section on how to install/setup this with a typical postgresql setup? And also, how was the code tested for correctness, and what is the expected performance?


Love it! Need better ANNs approaches in databases, using external servers such as Milvus, faiss is a pita

related: https://github.com/netrasys/pgANN


Why are they a pain in the...?


They typically load all data in memory, so you still need persistence to handle crashes (two setups). And since data is typically huge you need servers with lots of expensive RAM.


Check out Magnitude, we built it to solve that problem: https://github.com/plasticityai/magnitude

It's still loaded from a file, but heavily uses memory-mapping and caching to be speedy and not overload your RAM immediately. And in production scenarios, multiple worker processes can share that memory due to the memory mapping.

Granted it's read-only, so might not be exactly what you are looking for.

Disclaimer: I'm the author.


How about a vector oriented 'database' instead? Pinecone(https://www.pinecone.io/) does both exact and approx search and it's fully managed so you don't have to worry about reliability,availability etc.

PS: I work there


the usual yet-another-moving-part (YAMP) complexity.


I like the long-running trend of using Postgres as a standard platform to build good tools on.


Can somebody chime in and say whether this would be a useful solution to recommend similar texts if I compute fastText, doc2vec or other embeddings?

What do such vector similarity queries use in production when there's no DB support? Surely having the "real" DB like Postgres and one for vectors alongside it would be cumbersome.


Neat!

I guess from the 'ivfflat' keyword that this project is serving vectors produced by FAISS[1] somehow?

https://github.com/facebookresearch/faiss


FAISS doesn't produce vectors, it's simply a vector similarity search engine.


It produces vectors, also does the (distributed) searching.

Here is a partial list of possible vector types:

https://github.com/facebookresearch/faiss/wiki/Faiss-indexes


Seems that way. Faiss is mentioned in the “Thanks” list.


I have been using Smlar for a while for cosine similarity. Might this project provide a viable alternative? Smlar can be quite slow

https://github.com/jirutka/smlar


This looks pretty great! Do you have any sense of its performance? Very roughly, what kind of latency can one expect over, say, 1M, 10M, or 100M vectors? Ballpark: milliseconds, seconds, days?


Would this be a good fit to recommend books to users when I have day 3m books and 200k users? If not, what should I be looking for?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: