Due to the cliffhanger abstract, here is a part from the discussion that may help.
> In our case, the manual curation of a proportion of triples revealed that Sherpa was able to extract more triples categorized as correct or partially correct. However, when compared
to the manually curated gold standard, the performance of all automated
tools remains subpar.
I didn't see UMLS in the paper, but I've tried some of their human-created biomedical knowledge graphs, and they were too full of errors to be used. I imagine different ones have different levels of accuracy.