Really interesting! Reading the paper, it sounds like the core of it is broken i...

Really interesting! Reading the paper, it sounds like the core of it is broken into two things:

1. Encoding speech sounds into an IPA-like representation, decoding IPA-like into target language

2. Extracting "tone color", removing it from the IPA-like representation, then adding it back in into the target layer (emotion, accent, rhythm, pauses, intonation)

So as a result, I am a native English speaker, but I could hear "my" voice speaking Chinese with similar tone color to my own! I wonder, if I recorded it, and then did learn to speak Chinese fluently, how similar it would be? I also wonder whether there is some kind of "tone color translator" that is needed to translate the tone color markers of American English into the relevant ones for other languages, how does that work? Or is that already learned as part of the model?