Similarly, back when I used to maintain Stackless Python and do merges in from mainline Python, my benchmark was to only have the same set of failing tests post-merge for the same merged revision.
More often than not I would have failing tests in official Python release version tags, which became par for the course.
More often than not I would have failing tests in official Python release version tags, which became par for the course.