Very interesting that someone finally tries out muP in the real world. Do I understand the usage correctly:
MuP is only used to get around choosing an lr for each size? Here I wonder how it compares to standard heuristics like the one in the OG scaling laws paper by OAI and tricks like back winding a few steps after loss explosion.
For some reason muP was not trusted with the largest trainings? Why is that?
MuP is only used to get around choosing an lr for each size? Here I wonder how it compares to standard heuristics like the one in the OG scaling laws paper by OAI and tricks like back winding a few steps after loss explosion.
For some reason muP was not trusted with the largest trainings? Why is that?