You could weigh insertions by how much perplexity they add (sum), deletions by h...

You could weigh insertions by how much perplexity they add (sum), deletions by how much perplexity they remove (-sum), and replacements by how big the ppl difference is in the replaced word (abs(sum)). And report this as a 4-part score (combined mean, then separate i/d/r). Lower is better.

Theory being you don't want to add or remove confusing words, but common stop words are less of an issue.

I'm not sure how this interacts with a multi word replacement, where the new words together make sense but independently make no sense to the LM.