I didn't read through the paper (just the abstract), but isn't the whole point of the KL divergence loss to get the best compression, which is equivalent to Bayesian learning? I don't really see how this is novel, like I'm sure people were doing this with Markov chains back in the 90s.