The idea of the transformer somehow being a trainable key-value store is kind of abstract and weird and has little to do with the mathematics of it. The math part of that is how the dot product encodes for similarity between vectors, but beyond that it really is a "if you get it you get it" kind of thing.