Great article. Can someone smarter than me explain why an embedding is or isn't ...

tripplyons · 2024-06-16T02:04:30

Text embeddings should not be used as a form of encryption. They are optimized to contain as much information as possible from the text they represent and can often be decoded into their input: https://arxiv.org/abs/2310.06816

xrd · 2024-06-16T02:13:49

Great answer! And your blog is great, too!

diwank · 2024-06-16T05:41:49

Yep and embeddings can be decoded back into meaningful text if you have the model weights and the tokenizer.

didgeoridoo · 2024-06-16T02:18:07

Seems like a better comparison is to a one-way hash function.

Given a set of vectors resulting from an embedding model, it’s cheap & easy to check if a certain document is similar to the original source, as you just run the embedding on the comparison document and choose your favorite similarity metric.

However it’s very hard to recreate the source itself — as far as I can tell, you’d basically have to run a very expensive form of blind gradient descent: generate multiple texts, run embedding on them, check similarity, pick the closest one, and repeat with that text as the starting point.

Maybe someone can correct me if there exists an efficient way of reconstructing the original text. I would be very interested to know.

Edit: I appear to have just described a super naive and slow implementation of vec2text (see sibling comment). Will have to read that paper in more detail but it looks really cool.

diwank · 2024-06-16T05:44:58

It is possible to reconstruct the text from an embedding but also more importantly, I believe that by embedding, the OP means sending the computed embedding matrix instead of the input. Which is a simple matrix inversion problem.

Computing output embeddings is different.