That's not an accurate representation of the problem. It's more like:
Comment A: Muslims are evil.
Comment B: Christians are evil.
It's a terrible idea to treat Comment A and B differently. The same applies whether they are talking about religion, gender, race, nationality, or anything else. You have thoroughly failed when you have built in discrimination into your content moderation system.
Set A: A representative sample of jokes and stereotypes about black people found on the internet
Set B: A representative sample of jokes and stereotypes about Scandinavians found on the internet
Why on earth would its prior for "stereotypical Scandinavian" being potentially hateful be the same as for "stereotypical Black person"?
(And that's before you get into a model likely being deep enough to also draw inferences from the prevalence and content of material about the existence and impact of hatred of black people and Scandinavians respectively...)
Isn't ChatGPT American? Weren't black people deemed inferior by law such as slavery and civil rights? Aren't the stereotypes about Scandinavians meant to be positive as opposed to hateful stereotypes?
It seems that is the key different the creators of the tool are taking into account.
The mere fact that you needed to use quotes should clue you in that the scale was fundamentally different between groups. It's not as if th Irish were literal slaves for centuries in America.