really cool project. nice high-level overview of all the components. However, I still don't understand the impact measurement -- how do you measure the impact of this against the baseline? I didn't get that part in the effectiveness section. Maybe I'm too newb -- but you could A/B test this, right? 50% of PRs are subjected to automated tooling, 50% manual and compare compute cost and failures b/w the two?
That's what the shadow scheduler is measuring. If you run a superset of the AI- scheduled set, you can compute how well the AI is doing. Even if you don't run a superset, you can infer the results from following test runs (on a tree with the changeset in question, plus a few more, applied. You just have to be careful to make sure you don't blame later breakage on your changeset.)