It's for KV caching. In most conversations that will mean inference. But you can do reinforcement learning using sampled sequences, and you could use KV caching to speed that up too, so that would be an instance where training could get a slight boost.