Comment by wongarsu

Comment by wongarsu 6 months ago

1 reply

A big part of the difficulty of such an attempt is that we don't know the ground truth. A model is fair or unbiased if its performance is equally good for all groups. Meaning e.g. if 90% of cases of Arabs committing fraud are flagged as fraud, then 90% of cases of Danish people committing fraud should be flagged as fraud. The paper agrees on this.

The issue is that we don't know how many Danish commit fraud, and we don't know how many Arabs commit fraud, because we don't trust the old process to be unbiased. So how are we supposed to judge if the new model is unbiased? This seems fundamentally impossible without improving our ground truth in some way.

The project presented here instead tries to do some mental gymnastics to define a version of "fair" that doesn't require that better ground truth. They were able to evaluate their results on the false-positive rate by investigating the flagged cases, but they were completely in the dark about the false-negative rate.

In the end, the new model was just as biased, but in the other direction, and performance was simply worse:

> In addition to the reappearance of biases, the model’s performance in the pilot also deteriorated. Crucially, the model was meant to lead to fewer investigations and more rejections. What happened instead was mostly an increase in investigations , while the likelihood to find investigation worthy applications barely changed in comparison to the analogue process. In late November 2023, the city announced that it would shelve the pilot.