Hacker News new | past | comments | ask | show | jobs | submit login

Very interesting that someone finally tries out muP in the real world. Do I understand the usage correctly:

MuP is only used to get around choosing an lr for each size? Here I wonder how it compares to standard heuristics like the one in the OG scaling laws paper by OAI and tricks like back winding a few steps after loss explosion.

For some reason muP was not trusted with the largest trainings? Why is that?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: