Hacker News new | past | comments | ask | show | jobs | submit login

I didn't read through the paper (just the abstract), but isn't the whole point of the KL divergence loss to get the best compression, which is equivalent to Bayesian learning? I don't really see how this is novel, like I'm sure people were doing this with Markov chains back in the 90s.



in fact it is nothing new.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: