2^12 wasn’t a constraint. The constraint was the clock speed of 3MHz. But making the frame size 1500 Bytes was nice because you get a clean power of two for frame transmission time.
You could lower or raise the frame size and have it take any amount of microseconds you want. Perhaps ~4ms was selected with human interaction times in mind. You can send a frame, process for a few ms, then send one frame back quick enough for feedback to appear as instant.
And if a second was defined centuries ago to be longer or shorter, we'd end up with some other length of time that was a power of 2 close to 4ms, resulting in a different MTU.
If that's indeed the reasoning, then it's amazing how arbitrary decisions made in the past end up deciding today's standards.