I think the jury is still out if these will actually scale to ultra-long language understanding sequences. KWKV, for example, is still trained like GPT, but is architected so it can be run as an RNN during inference time. This is awesome, but it is unclear if the training regime will limit the effective use of long-ranging recurrent context.
Training as GPT vs RNN will give you numerically identical results with RWKV, it's just two ways of computing the same thing. It's trained in GPT-mode because it's cheaper to train that way -- you can parallelize over the sequence length. In practice it isn't going to be any different than training with back-propagation through time for the same sequence length.