Comparison on a task that is generating text. It also depends on how the HMMs ar...

Comparison on a task that is generating text.

It also depends on how the HMMs are trained. Are they trained by reducing the loss over each document, or are they trained by just taking the frequency of all the needed transitions?

Training using joint loss will be much more effective for this problem, training conditional random fields and then sampling them will also be extremely good for this problem and allows arbitrary features for each letter.

RNNs do have a great representational power but the lack of training jointly makes them as linear as CNNs. This representational power saves the day but will never really catch long distance dependencies.

Both approaches have the so called label-bias. (if HMM is trained by frequency)

CRFs wouldn't have that problem, and RNNs somehow seem to avoid it by great representational power although the training does say that label bias exists there.