I've been working on using BERT for search, for research and training development, with not so great results.
Note the quote "when it comes to ranking results, BERT will help Search better understand one in 10 searches". This is because of the "keywordese" point they noted earlier in the article. Most searches are 1 or 2 words - there isn't enough to grab onto for meaningful ranking with short queries and a similarity function for longer text documents.
Also, try keeping the systems afloat to handle search like this. BERT is not practical to use for search results by anyone without the scale of a company like Google. You need to have a server farm of GPUs to translate all your documents into tensors - and then keep them around somehow! A document of 10k text will balloon to ~1MB when converted to a multitoken vector representation. BERT uncased has 768 features - thats 768 floats per token you need to keep around. If you compress it using PCA or averaging across tokens, you lose all the juicy context that you need for the matching and ranking. Also, there currently isnt a good way to keep this stuff around yet (though there are active projects ongoing to get this into Lucene [1],[2])
I think this is definitely a great achievement in NLP - but it needs breakthroughs in other areas to be useable by product teams implementing search, with any reasonably large content size.
Distillation is usually used today to tame its resource problems at scale - you run BERT to squeeze out maximum signal from your training data and then distill the model e.g. into cheap CNN for inference.
Distillation reduces accuracy and removes the contextual precision. For example reducing a whole document to some N (1k or so) dimensions have worked very poorly in my experiments for short queries - typically making the relevance worse than basic keyword search.
You seem to be talking about dimensionality reduction, that's not what I was meant. Distillation is training a different model with a cheaper architecture (CNN, LSTM) on the outputs of an expensive teacher model like BERT. This has nothing to do with dimensions.
You might try vector quantization (instead of PCA) if you just need your 768 features to be smaller. ML features tend to be robust to some perturbation.
Well it’s one problem or another. If you compress too much you lose the value, and if you leave it too large you have the size problem.
Inverted indices are very efficient. How much of that can you give up at what trade off? If I’m only going to be better for 10% of queries, is that a cost effective solution? What if I spend the same amount of time tuning a traditional engine a bit more and get better accuracy for 5% of queries? Tradoffs rule the world of practical search implementations.
Haha I wish! Too much fidelity has been lost already. The model would just be guessing.
The sniff test is if a person can’t do it, then a model can’t either. Lots of queries look fine for matching, but you really have no idea what the intent or information need of the searcher is.
I’m not sure what you mean. Keywords are keywords. The meaning behind what the user wants is in their head. You cant turn keywords into a sentence without guessing what they meant.
Note the quote "when it comes to ranking results, BERT will help Search better understand one in 10 searches". This is because of the "keywordese" point they noted earlier in the article. Most searches are 1 or 2 words - there isn't enough to grab onto for meaningful ranking with short queries and a similarity function for longer text documents.
Also, try keeping the systems afloat to handle search like this. BERT is not practical to use for search results by anyone without the scale of a company like Google. You need to have a server farm of GPUs to translate all your documents into tensors - and then keep them around somehow! A document of 10k text will balloon to ~1MB when converted to a multitoken vector representation. BERT uncased has 768 features - thats 768 floats per token you need to keep around. If you compress it using PCA or averaging across tokens, you lose all the juicy context that you need for the matching and ranking. Also, there currently isnt a good way to keep this stuff around yet (though there are active projects ongoing to get this into Lucene [1],[2])
I think this is definitely a great achievement in NLP - but it needs breakthroughs in other areas to be useable by product teams implementing search, with any reasonably large content size.
[1] https://arxiv.org/abs/1910.10208 & https://github.com/castorini/anserini/blob/master/docs/appro... [2] https://github.com/o19s/hangry