I’m unfortunately the sickest I’ve been in years, so this will have to wait. Maybe it’s part of why my comment sounded strange.
There is an idea here, and it’s a mistake to dismiss it out of hand. Adding weights non uniformly during training (not after) is the key to smaller models that outperform present day GPT3.
A sketch of the algorithm is to start with a 2x2 block of weights, sum the gradients across 10 training steps, then subdivide the quadrant with the highest delta.
Doing this recursively is prohibitive, which is where megatexture comes in.
Many advantages. At runtime you don’t need weight compression because you can simply switch to a lower miplevel if running on a phone. Different accelerators during training can focus on different areas of the network. Weight dropout is an automatic feature. Etc.
If you’ll excuse me, it’s back to hugging the porcelain bowl.
You mention this technique has been shown to outperform GPT3, do you have a citation for that? Would love to read more details about this interesting concept.
"Adding weights non uniformly during training (not after) is the key to smaller models that outperform present day GPT3." seems to greatly imply a certainty of result that has already been discovered. Though this commenter is familiar to me, and I know he has made silly claims in other threads throughout the years here.
I ended up calling an ambulance so I’ll postpone this until later. Feeling a little better but a full explanation will have to wait.
The answer is that of course it’s not proved yet, since no one has implemented it (or at least efficiently). It’s fine to be skeptical.
Current techniques are blocked by the technical challenge of getting 10GB+ to fit on a pod. Very few people have those skills. If there’s even a chance that this will work, it’s worth exploring, so I will be.
Sounds kinda like progressive growing except you're not doubling the resolution uniformly. See ProGAN and its successors. You'd still need to add a large block of weights at a time for performance reasons.
Edit: Ah I checked your profile and you already know all this. You probably should have mentioned that lol
There is an idea here, and it’s a mistake to dismiss it out of hand. Adding weights non uniformly during training (not after) is the key to smaller models that outperform present day GPT3.
A sketch of the algorithm is to start with a 2x2 block of weights, sum the gradients across 10 training steps, then subdivide the quadrant with the highest delta.
Doing this recursively is prohibitive, which is where megatexture comes in.
Many advantages. At runtime you don’t need weight compression because you can simply switch to a lower miplevel if running on a phone. Different accelerators during training can focus on different areas of the network. Weight dropout is an automatic feature. Etc.
If you’ll excuse me, it’s back to hugging the porcelain bowl.