I didn't read through the paper (just the abstract), but isn't the whole point o...

programjames 8 months ago | parent | context | favorite | on: The Matrix: A Bayesian learning model for LLMs

I didn't read through the paper (just the abstract), but isn't the whole point of the KL divergence loss to get the best compression, which is equivalent to Bayesian learning? I don't really see how this is novel, like I'm sure people were doing this with Markov chains back in the 90s.

jksk61 8 months ago [–]

in fact it is nothing new.