I have a really hard time remember what any of these do - I personally much prefer to use function equivalents like json_extract_path() if they are available.
I agree, I never remember the syntax for the json operators either. I'd rather functions which are readable if you don't remember what they do. Operators need to be looked up. This emphasis on terseness over readability was one of the reasons I abandoned perl long ago.
JSON query operators are somewhat unusual because it's very common to need to chain many operators in a row, which would be very verbose if named functions would be used instead - the vector similarity is different in this regard, and that terseness isn't necessary.
I'm not sure how this is relevant - the difficult part is doing the retrieval based off of similarity matching in an efficient way. Calling the cdist function to compare all vectors would be very slow.
Since the answer, in minxomat's words, is "terrible," maybe look at Pinecone (https://www.pinecone.io) which makes light work of searching through 10M+ rows. It sits alongside, not inside, your Postgres (or whatever) warehouse.
Out of curiosity, what kind of performance do you get with 100M rows with pinecone? Looking at your pricing tiers, ~100M rows would need ~200GB memory, and @ $0.1 / GB / hr that's $20 / hr if I'm not mistaken?
Also, can you join with existing SQL tables to do hybrid searches in pinecone?
Our performance is independent of the collection size, thanks to dynamic sharding. You can expect 50-100ms pretty much regardless of size.
We don't support external SQL joins yet. Depending on what you're trying to do, we have an upcoming feature that might do the job.
At 100M vectors you're well into "volume discount" territory. Even more so with 3B vectors, as you mentioned in another comment. The free trial is capped at 10GB but shoot me an email (greg@pinecone.io) to get around it for a proper test and pricing estimate.
Oh it's terrible. More educational. You'd never ever want to do a full-scan cosine matching in production. Use locality sensitive hashing (with optimizations like amplification, SuperBit or DenseFly) for real world workloads.
Others have pointed this out already, but have a look at Milvus (https://github.com/milvus-io/milvus). I was able to get a simple version of it running with searches over ~3B vectors in under a second on a single machine (with out of the box configuration, and practically no optimization done just yet).
Nice, useful specially on managed services where installing additional extensions may not be an option.
If you want wider usage, you may want to add a section on how to install/setup this with a typical postgresql setup?
And also, how was the code tested for correctness, and what is the expected performance?
They typically load all data in memory, so you still need persistence to handle crashes (two setups). And since data is typically huge you need servers with lots of expensive RAM.
It's still loaded from a file, but heavily uses memory-mapping and caching to be speedy and not overload your RAM immediately. And in production scenarios, multiple worker processes can share that memory due to the memory mapping.
Granted it's read-only, so might not be exactly what you are looking for.
How about a vector oriented 'database' instead? Pinecone(https://www.pinecone.io/) does both exact and approx search and it's fully managed so you don't have to worry about reliability,availability etc.
Can somebody chime in and say whether this would be a useful solution to recommend similar texts if I compute fastText, doc2vec or other embeddings?
What do such vector similarity queries use in production when there's no DB support? Surely having the "real" DB like Postgres and one for vectors alongside it would be cumbersome.
This looks pretty great! Do you have any sense of its performance? Very roughly, what kind of latency can one expect over, say, 1M, 10M, or 100M vectors? Ballpark: milliseconds, seconds, days?
The PostgreSQL world really loves operators - this extension uses <#>, <-> and <=> for example
I've been working with PostgreSQL JSON queries a bit recently which is also really operator heavy: https://www.postgresql.org/docs/13/functions-json.html - ->, ->>, #>, #>>
I have a really hard time remember what any of these do - I personally much prefer to use function equivalents like json_extract_path() if they are available.