The problem isn't fine-tuning the model, the problem is that there isn't an obje...

klipt · on Feb 2, 2023

> about as unbiased as we're going to get.

You can easily force the model to be more unbiased. Just add a filter that flips the gender of words, evaluates the hate score for both the original and flipped version, and averages the results.

Guaranteed to give the same score regardless of the gender mentioned.

ineptech · on Feb 2, 2023

Clever idea, but I don't think this would work very well on real posts. Consider a model that rates "typical woman driver" as hateful, because that phrase appears in a lot of argument threads with lots of downvotes. Your approach would average its score with that of "typical man driver", which will presumably be very low, not because it's less hateful but because it just rarely shows up in the training corpus.

klipt · on Feb 2, 2023

If you're worried about the average score being too low, you could just take the maximum of the two scores instead?