If you mean that LM-head is just inverted embedding matrix then this was already done in GPT-2.
Unfortunately, the only thing I found out about this is that bigger models benefit from separate layer. But this was only mentioned somewhere in discord, so no paper to read and my personal hunch is that it should work for bigger models too. After all, GPT-3 was just scaled GPT-2.
From my personal experiments, models learn better if you give them harder task. And tied weights could be one of such things. Multi-token prediction could be another and bitnet could be also considered such... (and dropout too)
Unfortunately, the only thing I found out about this is that bigger models benefit from separate layer. But this was only mentioned somewhere in discord, so no paper to read and my personal hunch is that it should work for bigger models too. After all, GPT-3 was just scaled GPT-2.
From my personal experiments, models learn better if you give them harder task. And tied weights could be one of such things. Multi-token prediction could be another and bitnet could be also considered such... (and dropout too)