Depending on the context the model ends up being used in something that appears good may not be. For example the fellowship training thing - these non-fellowship trained radiologists are doing this task now, so it is absolutely reasonable to assess against them to test real-world performance.
It would be interesting to see if the fellowship trained radiologists did actually perform better in all circumstances (in some fields the better trained radiologists end up not using their skills on as broad a range of patients, so their performance is actually worse one some subsets of data).
+1 was mostly to indicate whether you should upgrade or downgrade the reported result to be comparable with other studies. I didn't mean to imply whether it improves clinical relevancy.
Depending on the context the model ends up being used in something that appears good may not be. For example the fellowship training thing - these non-fellowship trained radiologists are doing this task now, so it is absolutely reasonable to assess against them to test real-world performance.
It would be interesting to see if the fellowship trained radiologists did actually perform better in all circumstances (in some fields the better trained radiologists end up not using their skills on as broad a range of patients, so their performance is actually worse one some subsets of data).