Ding, ding, ding and to be fair this is the problem in both methodologies. The only way to accurately perform this test is to spider a bunch of sites, save the contents to a locally hosted HTTPD and ensure all third-party JS calls are resolved locally and test both against the exact same sites as they were spidered at a point in time. You can simply not account for changes which may happen to the markup, ad network beacons, ads, metrics code, or even network routing, all of which could influence the test in one way or the other if not run from an identical static cache in a controlled environment. If I were doing this test I'd then skip the whole scripting automation part and simply add a meta-refresh to every page in the cache to sequentially take the browser through the content, giving each page something like 10 seconds to load and render. Simple, simple, and far more accurate.
You don't need to go to such complicated lengths. Just perform enough tests (as in, a statistically large enough amount) and a distribution will form. That also captures the variability of real world network effects.
What about different ads served to different browsers? Someone running, say, Opera will have a different ad profile than a Chrome user even when completely blank cookie-wise.
It's hardly complicated. I've put such tests together in an afternoon. In fact, whatever is added in complexity is gained by the fact fewer tests are necessary. Via this mechanism you can also remove any questions about compression, use of HTTP/2, etc., which could impact the tests based on server-side choices when it comes to serving data to either platform. Equal always equals better.
But those metrics are important, if servers serve more optimized pages to Edge users for some reason that a freaking important fact to know.
This is about real world data and real experiences and how it affects actual real users.
You can normalize the tests to the point where there is absolutely zero difference between the browsers, of that I'm sure, but that will not reflect any actual cases that real users experience.
Within those specifications, the objection about the ad-block become irrelevant. If the browser justs works better, then users don't care and can simply enjoy more battery time.
The case for more normalized tests is to find out which browser is factually better designed/written.
Ads are not that much of a problem ads will even themselves out and if for some reason MSFT Edge users receive less ads or ads that are less resource intensive it's also an important metric.
I don't see anything that would somehow create a bias in favor of a specific browser as far as ad networks goes, if anything the stigma/stereotyping of IE/Edge users would probably mean that ad networks are more incentivized of sending the baity apps towards those browsers.
As for the network part well again that's an important metric if certain browsers perform better at adverse network conditions it's an important factor to know, you do not want to give them the best case scenario every time.
Giving a page a fixed amount of seconds to load is also completely the wrong approach you want to see how browsers behave when they can't load a page properly or when it takes more time than usual, maybe some browsers expend more resources by resubmitting the entire request, maybe some browsers do not parse the DOM tree from scratch when some of the requests stall, maybe some browsers have less resource intensive placeholders for DOM elements, maybe some browsers are better at adjusting the DOM preprocessor for network congestion than others.
So no I can't really see how would your approach would be any better, the approach that MSFT took was quite good, netflix, wikipedia, youtube, facebook etc. with what seems to be realistic user behaviour.
What you want to do is to put in test that would produce fair results for fairness sakes that's not how you evaluate anything because it would not yield you any real world data.