Seems like a better comparison is to a one-way hash function. Given a set of vec...

Seems like a better comparison is to a one-way hash function.

Given a set of vectors resulting from an embedding model, it’s cheap & easy to check if a certain document is similar to the original source, as you just run the embedding on the comparison document and choose your favorite similarity metric.

However it’s very hard to recreate the source itself — as far as I can tell, you’d basically have to run a very expensive form of blind gradient descent: generate multiple texts, run embedding on them, check similarity, pick the closest one, and repeat with that text as the starting point.

Maybe someone can correct me if there exists an efficient way of reconstructing the original text. I would be very interested to know.

Edit: I appear to have just described a super naive and slow implementation of vec2text (see sibling comment). Will have to read that paper in more detail but it looks really cool.