So, that sounds like a good reason to iterate over a bunch of random initializations and train them up to find those awesome resulting networks that are smaller but have the same accuracy (if I'm understanding that correctly and it's actually possible to isolate or take advantage of those subnetworks for a final smaller size).
Or... has anyone tries training a network for determining the initialization state of another network? And now I'm wondering if those good starting initialization values only work well for a specific task or if they seem to span all or a subset of tasks, and maybe there's some inherent quality in how the values relate that we can tease out...
Or... has anyone tries training a network for determining the initialization state of another network? And now I'm wondering if those good starting initialization values only work well for a specific task or if they seem to span all or a subset of tasks, and maybe there's some inherent quality in how the values relate that we can tease out...