Information retrieval and recommender systems based on machine learning can be used to make decisions about people. Government agencies can use such systems to detect welfare fraud, insurers can use them to predict risks and to set insurance premiums, and companies can use them to select the best people from a list job applicants. Such systems can lead to more efficiency, and could improve our society in many ways. However, such AI-driven decision-making also brings risks. This project focuses on the risk that such AI systems lead to illegal discrimination, for instance harming people of a certain ethnicity, or other types of unfairness. A different type of unfairness could concern, for instance, a system that reinforces financial inequality in society. Recent machine learning work on measures of fairness has resulted in several competing approaches for measuring fairness. There is no consensus on what is the best way to measure fairness and the measures often depend on the type of machine learning that is applied. Based on the application of existing measures on real-world data, we suspect that many proposed measures are not that helpful in practice. In this project, you will study measures of fairness, answering questions such as the following. To what extent can legal non-discrimination norms be translated into fairness measures for machine learning? Can we measure fairness independently of the machine learning approach? Can we show which machine learning methods are the most appropriate to achieve non-discrimination and fairness? The project concerns primarily machine learning for information retrieval and recommendation, but is interdisciplinary, as it is also informed by legal norms. The project will be supervised by Professor Hiemstra, professor of data science and federated search, and Professor Zuiderveen Borgesius, professor of ICT and law.
You hold a completed Master’s Degree or Research Master’s degree in computer science, data science, machine learning, artificial intelligence, or a related discipline.
You have good programming skills.
You have good command of spoken and written English.
We encourage you to apply even if you think you do not meet all the requirements.
Was fairness in IR discussed by Cooper and Robertson in the 1970’s?
As many people, I love to do an “ego search” in Google1, to see what comes up when I search my name. When Latanya Sweeny did such an ego search about a decade ago, she was shocked to find advertisements for background checks with the headline “Latanya Sweeny, Arrested?”. Sweeny, professor at Harvard, was never arrested. One of her colleagues suggested that the advertisement came up because of her “black name” – Latanya is a popular name among Americans of African descent – and Google’s advertisement search algorithm is racist. Motivated by this incident, Sweeny (2013) investigated the Google results for more than 2,000 racially associated personal names, and showed that Google’s advertisement are indeed systematically racially biased. Sweeny’s work was pivotal in putting bias and fairness of algorithms on the global research agenda.
The harm that (search) algorithms may do is substantial, especially if the algorithms are opaque, and if clicks on the (racist, unfair) results are fed back into the algorithm, thereby creating a destructive feedback loop where clicks on unfair results further reinforce the system’s unfairness. Cathy O’Neil compared such algorithms to weapons of mass destruction, because their destruction scales to hundreds of millions of (Google) users. O’Neill (2016), wittingly called her book, which is highly recommended, Weapons of Math Destruction.
Consider the following omniscient variant of the naive algorithm that ranks the articles by their true average relevance (i.e. the true fraction of users who want to read each article). How can this ranking be unfair? Let us assume that we have two groups of articles, Gright and Gleft, with 10 items each (i.e. articles from politically right-and left-leaning sources). 51% of the users (right-leaning) want to read the articles in group Gright, but not the articles in group Gleft. In reverse, the remaining 49% of the users (left-leaning) like only the articles in Gleft. Ranking articles solely by their true average relevance puts items from Gright into positions 1-10 and the items from Gleft in positions 11-20. This means the platform gives the articles in Gleft vastly less exposure than those in Gright. We argue that this can be considered unfair since the two groups receive disproportionately different outcomes despite having similar merit (i.e. relevance). Here, a 2% difference in average relevance leads to a much larger difference in exposure between the groups.
This example clearly shows a problem with fairness since right-leaning users have all their preferred documents ranked before the documents that are preferred by left-leaning users. Documents from the minority group (left-leaning in the example) are never even shown on the first results page. The example furthermore suggests that the ranking is optimal given the “true relevance” of the items, but is it really? Let’s have a look at some well-known evaluation measures for the ranking presented in the example, and for a fairer ranking where we interleave right-leaning and left-leaning documents, starting with a right-leaning document.
relevance ranking (unfair)
interleaved ranking (fair)
Table 1, evaluation results for the example of Morik et al. (2020)
Table 1 shows the expected evaluation results if, as stated in the example, 51% of the users like the right-leaning documents and 49% of the users like the left-leaning documents. For instance, the expected reciprocal rank (RR) for the relevance ranking in the example is 0.51 times 1 (51% of the users are satisfied with the first result returned) plus 0.49 times 1/11 (49% of the users are dissatisfied with the first 10 results, but satisfied with the eleventh result). The table also shows expected average precision (AP) and the normalized discounted cumulative gain (nDCG). So, if we are interested in the rank of the first relevant result (RR), then the example ranking is not only unfair, it is also of lower overall quality. If we are more interested in recall as measured by AP, then the relevance ranking indeed outperforms the interleaved ranking. Finally, in case of nDCG, the results are practically equal (the relevance ranking outperforms the interleaved ranking in the third digit). NDCG is normally used in cases where we have graded relevance judgments. If we additionally assume that one of the right-leaning documents and one of the left-leaning is more relevant (relevance score 2) than the other relevant documents (relevance score 1), then the fair, interleaved ranking outperforms the unfair, relevance ranking: 0.78 vs. 0.76. So, depending on our evaluation measures, the ranking by the “true average relevance” may actually not give the best quality search engine (besides the clearly unfair results).
Interestingly, rankings where two groups of users prefer different sets of documents were already discussed more than 44 years ago by Stephen Robertson when he introduced the probability ranking principle. Robertson (1977) contributed the principle to William Cooper. The paper’s appendix contains the following counter-example to the probability ranking principle, which Robertson also contributed to Cooper. The example follows the above example closely, but with different statistics for the two groups of users:
Cooper considers the problem of ranking the output of a system in response to a given request. Thus he is concerned with the class of users who put the same request to the system, and with a ranking of the documents in response to this one request which will optimize performance for this class of users. Consider, then, the following situation. The class of users (associated with this one request) consists of two sub-classes, U1 and U2; U1 has twice as many members as U2: Any user from U1 would be satisfied with any one of the documents D1–D9, but with no others. Any user U2 would be satisfied with document D10, but with no others. Hence: any document from D1–D9, considered on its own, has a probability of 2/3 of satisfying the next user who puts this request to the system. D10 has a probability of 1/3 of satisfying him/her; all other documents have probability zero. The probability ranking principle therefore says that D1–D9 should be given joint rank 1, D10 rank 2, and all others rank 3. But this means that while U1 users are satisfied with the first document they receive, U2 users have to reject nine documents before they reach the one they want. One could readily improve on the probability ranking, by giving D1 (say) rank 1, D10 rank 2, and D2–D9 and all others rank 3. Then U1 users are still satisfied with the first document, but U2 users are now satisfied with the second. Thus the ranking specified by the probability-ranking principle is not optimal. Such is Cooper’s counter-example.
Let’s again look at the evaluation results for the rankings presented in the example, the relevance ranking and the improved ranking, which we indicate as above as interleaved.
relevance ranking (unfair)
interleaved ranking (fair)
Table 2, evaluation results for Cooper’s example (Robertson 1977)
The example shows that the unfair ranker, that ranks all documents preferred by users from group U1 above those preferred by users from group U2, not only treats the minority group U2 unfairly, it also produces lower quality results on all three evaluation measures. But, why would a search engine prefer this so-called relevance ranking? and why did Morik et al. (2020) call this ranking a ranking by the “true average relevance”?
To understand this, we have to dig a bit deeper into Robertson’s probability ranking principle. The principle states that under certain conditions, a ranking by the probability of relevance as done by Morik et al. (2020) will produce the best overall effectiveness of the system to its users that is obtainable on the basis of the data. Those conditions are the following:
The relevance of a document to a request does not depend on the other documents in the collection;
The proof relates only to a single request;
Relevance is a dichotomous variable.
Condition 1 is clearly violated in our examples. For instance in the example with right-leaning and left-leaning users, knowing that a user likes one right-leaning document should drastically change the probability of relevance for the other documents. Condition 3 is violated if we use graded relevance and evaluation measures like (n)DCG. If our aim is to build a fair ranker, then we cannot blindly apply the probability ranking principle.2
My conclusion? We’ve known about the problem of unfair rankings for a long time. If the conditions for the probability ranking principle are not met, then we a) may not get the overall best quality ranking; and b) instead get a biased ranking that systematically and unfairly favours the majority group of users over the minority group.
Sadly, what happened to Latanya Sweeny may very well have been the following: Google optimizes its advertisement revenue using the click-through rate, i.e., Google uses a click-based relevance estimator that ranks advertisements by their probability of relevance under the conditions of the probability of ranking principle.3 These conditions are not met. There are at least two groups of people: 1) A racist majority group that clicks background checks for “black names”, and 2) A minority group that clicks advertisements for connecting on social media. Even though both groups may be roughly equal in size, Google only showed the top advertisements of the majority group. Google thereby showed biased results that adversely impact the minority group, and furthermore probably did not even optimize for advertisement revenue.
The most important message here: The relevance of the results of a search algorithm (and therefore the search engine’s revenue) is not necessarily at odds with the fairness of the results. Cooper’s example shows that there are cases where improving the quality of the results (measured in RR, AP or nDCG) also improves the fairness of the results.4
1. I use DuckDuckGo for all my other searches. 2. I don’t want to be overly critical about a SIGIR best paper, but curiously, Morik et al. (2020) (incorrectly) cite Robertson’s probability ranking principle paper as follows: “Fortunately, it is easy to show (Robertson 1977) that sorting-based policies π(x) = argsortd∈DR(d|x) (…) are optimal for virtually all [evaluation measures] commonly used in IR (e.g. DCG).” 3. Google is evil is another explanation. 4. Note that to get a truly fair ranking, we should frequently switch both groups when interleaving the documents, starting with the minority group with a probability proportional to the size of the group. This will somewhat negatively impact the expected search quality.
In this research an approach for bias reduction, while still maintaining usability of the classifier, is proposed. The approach for bias reduction requires all preprocessing to be done, include one-hot encoding and making the training and test set split. The approach then requires a banned feature, a feature that has for example been deemed morally irrelevant for the classification purpose. For the bias reduction, the proposal is to use the KS-score obtained from the two sample KS-test to determine how well a feature contributes to classification and how well it contributes to the bias of the banned feature. So that means that all features present in the dataset that are not the label(L) or the banned feature(B), that the following holds for feature X to be safe to use in the training dataset:
KS–score(X|L=1, X|L=0) > KS–score(X|B=1, X|B=0)
After all features are checked, the unsafe (or flagged) features need to be removed from both the training and the test set in order to make the classifier as fair as possible. The datasets that have been used are the Titanic dataset, with as banned feature the passenger class and a Financial survey, with as banned feature the race. The results have shown that the overall bias has been reduced for both the Titanic dataset and the Financial survey. However in terms of relative fairness, the Financial survey is the only one that became less fair for a certain banned feature value (Race = White). All other values became fairer for both the Financial survey and the Titanic dataset.