Was fairness in IR discussed by Cooper and Robertson in the 1970’s?
As many people, I love to do an “ego search” in Google1, to see what comes up when I search my name. When Latanya Sweeney did such an ego search about a decade ago, she was shocked to find advertisements for background checks with the headline “Latanya Sweeny, Arrested?”. Sweeney, professor at Harvard, was never arrested. One of her colleagues suggested that the advertisement came up because of her “black name” – Latanya is a popular name among Americans of African descent – and Google’s advertisement search algorithm is racist. Motivated by this incident, Sweeney (2013) investigated the Google results for more than 2,000 racially associated personal names, and showed that Google’s advertisement are indeed systematically racially biased. Sweeney’s work was pivotal in putting bias and fairness of algorithms on the global research agenda.
The harm that (search) algorithms may do is substantial, especially if the algorithms are opaque, and if clicks on the (racist, unfair) results are fed back into the algorithm, thereby creating a destructive feedback loop where clicks on unfair results further reinforce the system’s unfairness. Cathy O’Neil compared such algorithms to weapons of mass destruction, because their destruction scales to hundreds of millions of (Google) users. O’Neill (2016), wittingly called her book, which is highly recommended, Weapons of Math Destruction.
Let’s discuss fairness following example by Morik et al. (2020), which was awarded the best paper at SIGIR 2020. They present the following motivating example:
Consider the following omniscient variant of the naive algorithm that ranks the articles by their true average relevance (i.e. the true fraction of users who want to read each article). How can this ranking be unfair? Let us assume that we have two groups of articles, Gright and Gleft, with 10 items each (i.e. articles from politically right-and left-leaning sources). 51% of the users (right-leaning) want to read the articles in group Gright, but not the articles in group Gleft. In reverse, the remaining 49% of the users (left-leaning) like only the articles in Gleft. Ranking articles solely by their true average relevance puts items from Gright into positions 1-10 and the items from Gleft in positions 11-20. This means the platform gives the articles in Gleft vastly less exposure than those in Gright. We argue that this can be considered unfair since the two groups receive disproportionately different outcomes despite having similar merit (i.e. relevance). Here, a 2% difference in average relevance leads to a much larger difference in exposure between the groups.
(Morik et al. 2020)
This example clearly shows a problem with fairness since right-leaning users have all their preferred documents ranked before the documents that are preferred by left-leaning users. Documents from the minority group (left-leaning in the example) are never even shown on the first results page. The example furthermore suggests that the ranking is optimal given the “true relevance” of the items, but is it really? Let’s have a look at some well-known evaluation measures for the ranking presented in the example, and for a fairer ranking where we interleave right-leaning and left-leaning documents, starting with a right-leaning document.
Algorithm | RR | AP | nDCG |
relevance ranking (unfair) | 0.55 | 0.59 | 0.78 |
interleaved ranking (fair) | 0.76 | 0.45 | 0.78 |
Table 1 shows the expected evaluation results if, as stated in the example, 51% of the users like the right-leaning documents and 49% of the users like the left-leaning documents. For instance, the expected reciprocal rank (RR) for the relevance ranking in the example is 0.51 times 1 (51% of the users are satisfied with the first result returned) plus 0.49 times 1/11 (49% of the users are dissatisfied with the first 10 results, but satisfied with the eleventh result). The table also shows expected average precision (AP) and the normalized discounted cumulative gain (nDCG). So, if we are interested in the rank of the first relevant result (RR), then the example ranking is not only unfair, it is also of lower overall quality. If we are more interested in recall as measured by AP, then the relevance ranking indeed outperforms the interleaved ranking. Finally, in case of nDCG, the results are practically equal (the relevance ranking outperforms the interleaved ranking in the third digit). NDCG is normally used in cases where we have graded relevance judgments. If we additionally assume that one of the right-leaning documents and one of the left-leaning is more relevant (relevance score 2) than the other relevant documents (relevance score 1), then the fair, interleaved ranking outperforms the unfair, relevance ranking: 0.78 vs. 0.76. So, depending on our evaluation measures, the ranking by the “true average relevance” may actually not give the best quality search engine (besides the clearly unfair results).
Interestingly, rankings where two groups of users prefer different sets of documents were already discussed more than 44 years ago by Stephen Robertson when he introduced the probability ranking principle. Robertson (1977) contributed the principle to William Cooper. The paper’s appendix contains the following counter-example to the probability ranking principle, which Robertson also contributed to Cooper. The example follows the above example closely, but with different statistics for the two groups of users:
Cooper considers the problem of ranking the output of a system in response to a given request. Thus he is concerned with the class of users who put the same request to the system, and with a ranking of the documents in response to this one request which will optimize performance for this class of users. Consider, then, the following situation. The class of users (associated with this one request) consists of two sub-classes, U1 and U2; U1 has twice as many members as U2: Any user from U1 would be satisfied with any one of the documents D1–D9, but with no others. Any user U2 would be satisfied with document D10, but with no others. Hence: any document from D1–D9, considered on its own, has a probability of 2/3 of satisfying the next user who puts this request to the system. D10 has a probability of 1/3 of satisfying him/her; all other documents have probability zero. The probability ranking principle therefore says that D1–D9 should be given joint rank 1, D10 rank 2, and all others rank 3. But this means that while U1 users are satisfied with the first document they receive, U2 users have to reject nine documents before they reach the one they want. One could readily improve on the probability ranking, by giving D1 (say) rank 1, D10 rank 2, and D2–D9 and all others rank 3. Then U1 users are still satisfied with the first document, but U2 users are now satisfied with the second. Thus the ranking specified by the probability-ranking principle is not optimal. Such is Cooper’s counter-example.
(Robertson 1977)
Let’s again look at the evaluation results for the rankings presented in the example, the relevance ranking and the improved ranking, which we indicate as above as interleaved.
Algorithm | RR | AP | nDCG |
relevance ranking (unfair) | 0.70 | 0.70 | 0.76 |
interleaved ranking (fair) | 0.83 | 0.72 | 0.82 |
The example shows that the unfair ranker, that ranks all documents preferred by users from group U1 above those preferred by users from group U2, not only treats the minority group U2 unfairly, it also produces lower quality results on all three evaluation measures. But, why would a search engine prefer this so-called relevance ranking? and why did Morik et al. (2020) call this ranking a ranking by the “true average relevance”?
To understand this, we have to dig a bit deeper into Robertson’s probability ranking principle. The principle states that under certain conditions, a ranking by the probability of relevance as done by Morik et al. (2020) will produce the best overall effectiveness of the system to its users that is obtainable on the basis of the data. Those conditions are the following:
- The relevance of a document to a request does not depend on the other documents in the collection;
- The proof relates only to a single request;
- Relevance is a dichotomous variable.
Condition 1 is clearly violated in our examples. For instance in the example with right-leaning and left-leaning users, knowing that a user likes one right-leaning document should drastically change the probability of relevance for the other documents. Condition 3 is violated if we use graded relevance and evaluation measures like (n)DCG. If our aim is to build a fair ranker, then we cannot blindly apply the probability ranking principle.2
My conclusion? We’ve known about the problem of unfair rankings for a long time. If the conditions for the probability ranking principle are not met, then we a) may not get the overall best quality ranking; and b) instead get a biased ranking that systematically and unfairly favours the majority group of users over the minority group.
Sadly, what happened to Latanya Sweeney may very well have been the following: Google optimizes its advertisement revenue using the click-through rate, i.e., Google uses a click-based relevance estimator that ranks advertisements by their probability of relevance under the conditions of the probability of ranking principle.3 These conditions are not met. There are at least two groups of people: 1) A racist majority group that clicks background checks for “black names”, and 2) A minority group that clicks advertisements for connecting on social media. Even though both groups may be roughly equal in size, Google only showed the top advertisements of the majority group. Google thereby showed biased results that adversely impact the minority group, and furthermore probably did not even optimize for advertisement revenue.
The most important message here: The relevance of the results of a search algorithm (and therefore the search engine’s revenue) is not necessarily at odds with the fairness of the results. Cooper’s example shows that there are cases where improving the quality of the results (measured in RR, AP or nDCG) also improves the fairness of the results.4
References
- Morik, M., Singh, A., Hong, J., Joachims, T. (2020) Controlling Fairness and Bias in Dynamic Learning-to-Rank. In: Proceedings of ACM SIGIR, 429-438
- O’Neil, C. (2016) Weapons of math destruction: how big data increases inequality and threatens democracy. Crown.
- Robertson, S.E. (1977) The probability ranking principle in IR, Journal of documentation, 33(4)
- Sweeney, L. (2013). Discrimination in online ad delivery. Communications of the ACM, 56(5), 44-54. (arXiv:1301.6822)
1. I use DuckDuckGo for all my other searches.
2. I don’t want to be overly critical about a SIGIR best paper, but curiously, Morik et al. (2020) (incorrectly) cite Robertson’s probability ranking principle paper as follows: “Fortunately, it is easy to show (Robertson 1977) that sorting-based policies π(x) = argsortd∈DR(d|x) (…) are optimal for virtually all [evaluation measures] commonly used in IR (e.g. DCG).”
3. Google is evil is another explanation.
4. Note that to get a truly fair ranking, we should frequently switch both groups when interleaving the documents, starting with the minority group with a probability proportional to the size of the group. This will somewhat negatively impact the expected search quality.