Matryoshka Representation Learning

polygamous_bat · 2024-02-01T08:50:45 1706777445

For those unaware, OpenAI recently announced [0] an API change where they said their newer models are using Matryoshka representation learning for shortening embeddings. Basically you can use a shorter prefix of the full representation to do query/lookup for cheaper without losing much quality. Quote:

“Native support for shortening embeddings:

Using larger embeddings, for example storing them in a vector store for retrieval, generally costs more and consumes more compute, memory and storage than using smaller embeddings. Both of our new embedding models were trained with a technique [Matryoshka Representation Learning] that allows developers to trade-off performance and cost of using embeddings. Specifically, developers can shorten embeddings (i.e. remove some numbers from the end of the sequence) without the embedding losing its concept-representing properties by passing in the dimensions API parameter. For example, on the MTEB benchmark, a text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming an unshortened text-embedding-ada-002 embedding with a size of 1536.”

[0] https://openai.com/blog/new-embedding-models-and-api-updates

ipsum2 · 2024-02-01T11:18:38 1706786318

Context: OpenAI was "caught" using Matryoshka embeddings in their new release. They apologized and added references to the paper in their release notes.

reerdna · 2024-02-01T12:42:22 1706791342

Analysis by dhruv___anand in https://twitter.com/dhruv___anand/status/1752641057278550199 suggests that there are three different "resolutions" in the embeddings, for the first 512, 1024 and full 1536 dimensions in text-embedding-3-small.

You can put a subset of the dimensions in your vector database, thus saving a lot of cost by reducing memory/compute when retrieving nearest neighbors.

Then you can optionally even re-rank the most promising top-k candidates by the full embeddings. At least one database supports this natively: https://twitter.com/jobergum/status/1750888083900240182

adityakusupati · 2024-02-01T22:06:02 1706825162

Co-lead of MRL and related line of work here. Happy to answer any questions!

deepnet · 2024-02-01T10:59:10 1706785150

How do they sort the dimensions so the most salient are at the front and thus retained ?

yorwba · 2024-02-01T12:39:48 1706791188

They declare by fiat that retaining only the first few dimensions should lead to low classification error and construct the training loss accordingly. (Equation 1 on page 4 of the PDF.)

radarsat1 · 2024-02-01T16:29:28 1706804968

Any relationship to residual vector quantized embeddings?

fzliu · 2024-02-01T16:37:51 1706805471

Both are methods to reduce the overall size of your embeddings, but from what I understand, quantization is generally better than dimensionality reduction, especially if training is quantization-aware.

erikig · 2024-02-01T08:29:34 1706776174

#(please summarize and share possible use cases)

hiddencost · 2024-02-01T08:48:54 1706777334

Number go up.

More seriously, it looks like a potential reduction in the cost to train a neural network.

You can think of every meaningful step forward in deep learning as a reduction in the cost of training, or an improvement in the ability of the signal to propagate.

https://arxiv.org/abs/2001.08361

jxmorris12 · 2024-02-01T13:32:53 1706794373

No neither of these explanations is right.

The paper talks about a specific type of model called an embedding model that produces vectors for datapoints that are useful for downstream tasks.

Normally you have to choose a single dimensionality for the vectors and store them all in a database that’s proportional to the size of those vectors.

The method described is for a model that has multiple “options” for the lengths of its embeddings. You can use smaller vectors to save space but they don’t work quite as well. Paper analyzes this more in-depth.