Distillation is usually used today to tame its resource problems at scale - you run BERT to squeeze out maximum signal from your training data and then distill the model e.g. into cheap CNN for inference.
Distillation reduces accuracy and removes the contextual precision. For example reducing a whole document to some N (1k or so) dimensions have worked very poorly in my experiments for short queries - typically making the relevance worse than basic keyword search.
You seem to be talking about dimensionality reduction, that's not what I was meant. Distillation is training a different model with a cheaper architecture (CNN, LSTM) on the outputs of an expensive teacher model like BERT. This has nothing to do with dimensions.
You might try vector quantization (instead of PCA) if you just need your 768 features to be smaller. ML features tend to be robust to some perturbation.
Well it’s one problem or another. If you compress too much you lose the value, and if you leave it too large you have the size problem.
Inverted indices are very efficient. How much of that can you give up at what trade off? If I’m only going to be better for 10% of queries, is that a cost effective solution? What if I spend the same amount of time tuning a traditional engine a bit more and get better accuracy for 5% of queries? Tradoffs rule the world of practical search implementations.