Hacker News new | past | comments | ask | show | jobs | submit login

> Usually people are concerned about maximising prediction accuracy, and never stop to think about what correlations is the model finding down below, and the human biases present in the data annotations.

Because maximizing prediction accuracy is inherently unbiased. Bias is when the predictions made are inaccurate to the detriment of a group of people. If you had a prediction algorithm that functioned using time travel to tell you with 100.0% accuracy who would pay back their loans, there would be a racial disparity in the result, but the cause of it is not the fault of the algorithm.

And you can't fix it there because that's not where the problem is.

Suppose you have a group of 800 white middle managers, 100 white sales clerks and 100 black sales clerks. Clearly the algorithm is going to have a racial disparity in outcome if the middle managers are at 500% of the cutoff and the sales clerks are right on the line, because it will accept all of the middle managers and half of the sales clerks which means it will accept >94% of the white applicants and 50% of the black applicants.

But the source of the disparity is that black people are underrepresented as middle managers and overrepresented as sales clerks. The algorithm is just telling you that. It can't change it.

And inserting bias into the algorithm to "balance" the outcome doesn't actually do that, all you're doing is creating bias against white sales clerks who had previously been on the same footing as black sales clerks. The white middle managers will be unaffected because they're sufficiently far above the cutoff that the change doesn't affect them, even though that they're the source of the imbalance.




You entirely miss the point! The point is that in supervised learning for example, if you optimize prediction accuracy with respect to your human generated examples, you will get a model that exactly reproduces the racist judgment of the human that generated your training set.


> The point is that in supervised learning for example, if you optimize prediction accuracy with respect to your human generated examples, you will get a model that exactly reproduces the racist judgment of the human that generated your training set.

In which case you aren't optimizing prediction accuracy. Prediction accuracy is measured by whether the predictions are true. If you have bias in the predictions which doesn't exist in the actual outcomes then there is money to be made by eliminating it.

It seems like the strangest place to put an objection where the profit motive is directly aligned with the desired behavior.


You need to think about how we measure truth and even what truth is.

In machine learning we tend to assume the annotations and labels are "true" and build a system towards that version of the "truth".

> Prediction accuracy is measured by whether the predictions are true.

The more I think about this sentence, the less sense it makes. Prediction accuracy can only be measured against records of something, and that record will be a distortion and simplification of reality.


> Prediction accuracy can only be measured against records of something, and that record will be a distortion and simplification of reality.

Prediction accuracy can be measured against what actually happens. If the algorithm says that 5% of people like Bob will default and you give loans to people like Bob and 7% of them default then the algorithm is off by 2%.


You are still assuming everything that is recorded to be "like Bob" is the truth and captures reality clearly.

Moreover you need to give loans to everybody in order to check the accuracy of the algorithm. You can't just check a non random subset and expect to get non biased results.


> You are still assuming everything that is recorded to be "like Bob" is the truth and captures reality clearly.

Nope, just finding correlations between "records say Bob has bit 24 set" and "Bob paid his loans." The data could say that Bob is a pink space alien from Andromeda and the algorithm can still do something useful. Because if the data is completely random then it will determine that that field is independent from whether Bob will pay his loans and ignore it, but if it correlates with paying back loans then it has predictive power. The fact that you're really measuring something other than what you thought you were doesn't change that.

> Moreover you need to give loans to everybody in order to check the accuracy of the algorithm. You can't just check a non random subset and expect to get non biased results.

What you can do is give loans to a random subset of the people you wouldn't have to see what happens.

But even that isn't usually necessary, because in reality there isn't a huge cliff right at the point where you decide whether to give the loan or not, and different variables will place on opposite sides of the decision. There will be people you decide to give the loan to even though their income was on the low side because their repayment history was very good. If more of those people than expected repay their loans then you know that repayment history is a stronger predictor than expected and income is a weaker one, and if fewer then the opposite.


You have described the reason why many of us don't consider maximizing profit to be a desired behavior in all circumstances.


I think you misunderstand. In this case for once the profit-maximizing thing is the thing we want them to do and vice versa.

You can make a legitimate objection if the algorithm predicts that 20% of black people will default on their loans and in reality only 10% of black people default on their loans. But if the algorithm is doing that then it's losing the bank money. They're giving loans to some other less creditworthy people instead of those more creditworthy people, or not giving profit-generating loans at all even though they have money to lend. A purely profit-motivated investor is not interested in that happening.

But if it happens that disproportionately many of some group of people are in actual fact uncreditworthy, giving the uncreditworthy people credit anyway is crazy. It's the thing that caused the housing crisis. An excessive number of them will default, lose the lender's money and ruin their own credit even further. It only hurts everybody.


Thats silly though. You should use ground truth of actual outcomes, not predict that useless humans would do. But even then I bet the algorithm would be less racist than the humans, if you don't give them race as a feature.


Yes, and you can get similar effects with unsupervised learning if the data set is biased.


Then the problem is in the racist human(s), not the algorithm.


But the algorithm enshrines and possibly amplifies it. Or as the old saying goes:

> To err is human, to really foul things up requires a computer.


Humans have much more potential for racism in making predictions about the future than in judging things that have already happened. So while I agree that racism through biased input data is a problem, I think that even with that problem machines should be substantially less racist in their judgement than the humans they're replacing even if they're not perfect.


> maximizing prediction accuracy is inherently unbiased

This assumes that you're actually maximizing prediction accuracy, rather than taking the easiest route toward sufficiently high predictive power.

Not-so-hypothetical: you can invest $N and create a profitable model that (unfairly and inaccurately) discriminates directly based upon race, or you can invest $N*M and create a profitable model that does not discriminate on race (regardless of whether its results are racially equitable). Given the choice, a lot of people will choose the $N approach.


> regardless of whether its results are racially equitable

Some would call this discrimination, if not outright racism.

As others on this thread have pointed out, at some point the algorithm follows the underlying data, which may also be discriminatory (zip code, etc.).




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: