Very interesting that someone finally tries out muP in the real world. Do I unde...

Very interesting that someone finally tries out muP in the real world. Do I understand the usage correctly:

MuP is only used to get around choosing an lr for each size? Here I wonder how it compares to standard heuristics like the one in the OG scaling laws paper by OAI and tricks like back winding a few steps after loss explosion.

For some reason muP was not trusted with the largest trainings? Why is that?