I was comparing a batch of transcriptions between these models and vosk, and noticed that the medium.en model produces some weird results compared to the others. I've seen a number of loops with one word or a small sequence of words repeating several times. It seems more prone to output that reads like nonsense than the others.
More troubling is a short audio clip that got a few full sentences back, several times the text length that comes back from the other models or vosk. The content of the sentences is extremely far from the audio content. The best alignment I can find is the first word of medium.en's interpretation is somewhat phonetically similar to the audio.
The small.en model doesn't show these behaviors, at least in this data set.
The whole value of this model is in 680 000 hours of training data and to reuse this value you need large model, not smaller ones. Smaller versions just don't have enough capacity to represent training data properly.
I get that. I'm saying the medium.en model specifically seems to have some weird edges to its behavior that is not present in the models up or down the scale from it, or similarly (the plain 'medium' model).
It's the only one that seems to be occasionally spitting out significant chunks of training data versus something that resembles the audio.
More troubling is a short audio clip that got a few full sentences back, several times the text length that comes back from the other models or vosk. The content of the sentences is extremely far from the audio content. The best alignment I can find is the first word of medium.en's interpretation is somewhat phonetically similar to the audio.
The small.en model doesn't show these behaviors, at least in this data set.