It would work better if you could feed speech embeddings to the translation model directly, since it has more language knowledge to choose what the original was more likely to say.
(Of course that might just lead to picking more common translated phrases.)
(Of course that might just lead to picking more common translated phrases.)