It's called hallucination. As the model is trained on unsupervised data, such errors do seldom happen. The model picks up that such phrases occur in translations and inserts them even if they do not appear in the source. This is described in the paper.
I came across it during a silent/instrumental portion in the song I was testing. I asked only because I am curious how frequently the error might show up, I don't expect it to be very common. It's looking at phrase level instead of word level timestamps which is going to make it hard to tokenize music. I asked simply because the parent comment also tested on Japanese.