Can someone smarter than me explain why an embedding is or isn't a poor man's form of encryption? You can retrieve the closeness to something using cosine similarity, but you can't go back to the original source (unless you store the embedding side by side, which is what most people do with those vectors).
But, isn't using an embedding model a good way to cloak your original data and still make it comparable?
Text embeddings should not be used as a form of encryption. They are optimized to contain as much information as possible from the text they represent and can often be decoded into their input: https://arxiv.org/abs/2310.06816
Seems like a better comparison is to a one-way hash function.
Given a set of vectors resulting from an embedding model, it’s cheap & easy to check if a certain document is similar to the original source, as you just run the embedding on the comparison document and choose your favorite similarity metric.
However it’s very hard to recreate the source itself — as far as I can tell, you’d basically have to run a very expensive form of blind gradient descent: generate multiple texts, run embedding on them, check similarity, pick the closest one, and repeat with that text as the starting point.
Maybe someone can correct me if there exists an efficient way of reconstructing the original text. I would be very interested to know.
Edit: I appear to have just described a super naive and slow implementation of vec2text (see sibling comment). Will have to read that paper in more detail but it looks really cool.
It is possible to reconstruct the text from an embedding but also more importantly, I believe that by embedding, the OP means sending the computed embedding matrix instead of the input. Which is a simple matrix inversion problem.
Just to add- not easily defeated. I am even hosting a bounty for whoever can break this one (the model weights and tokenizer are available on huggingface) :D
Also to clarify- privacy preserving here means protecting from the model inference provider. Sort of like, if openai trained a GPT-4 using this scheme and gave the government the key. Then the gov can use it safely even while it’s hosted on OpenAI’s servers and OpenAI on the other hand does not need to share the model weights with the gov
Can someone smarter than me explain why an embedding is or isn't a poor man's form of encryption? You can retrieve the closeness to something using cosine similarity, but you can't go back to the original source (unless you store the embedding side by side, which is what most people do with those vectors).
But, isn't using an embedding model a good way to cloak your original data and still make it comparable?