Information retrieval and recommender systems based on machine learning can be used to make decisions about people. Government agencies can use such systems to detect welfare fraud, insurers can use them to predict risks and to set insurance premiums, and companies can use them to select the best people from a list job applicants. Such systems can lead to more efficiency, and could improve our society in many ways. However, such AI-driven decision-making also brings risks. This project focuses on the risk that such AI systems lead to illegal discrimination, for instance harming people of a certain ethnicity, or other types of unfairness. A different type of unfairness could concern, for instance, a system that reinforces financial inequality in society. Recent machine learning work on measures of fairness has resulted in several competing approaches for measuring fairness. There is no consensus on what is the best way to measure fairness and the measures often depend on the type of machine learning that is applied. Based on the application of existing measures on real-world data, we suspect that many proposed measures are not that helpful in practice. In this project, you will study measures of fairness, answering questions such as the following. To what extent can legal non-discrimination norms be translated into fairness measures for machine learning? Can we measure fairness independently of the machine learning approach? Can we show which machine learning methods are the most appropriate to achieve non-discrimination and fairness? The project concerns primarily machine learning for information retrieval and recommendation, but is interdisciplinary, as it is also informed by legal norms. The project will be supervised by Professor Hiemstra, professor of data science and federated search, and Professor Zuiderveen Borgesius, professor of ICT and law.
You hold a completed Master’s Degree or Research Master’s degree in computer science, data science, machine learning, artificial intelligence, or a related discipline.
You have good programming skills.
You have good command of spoken and written English.
We encourage you to apply even if you think you do not meet all the requirements.
Now that electronic health records are commonly used, the availability of clinical texts is growing. This workshop discusses the automatic analysis of textual clinical health data to advance medical research and improve healthcare related services. We especially encourage presentations discussing possibilities to share clinical texts, models and tools for clinical natural language processing (NLP). In practice, privacy- and legal regulations prevent the free sharing and combination of electronic health records themselves, but de-identified texts, NLP tools and intermediate results may be shared. We hope that sharing will promote cooperation within the Dutch-speaking countries, as well as advance the research in Clinical NLP in those countries. Relevant topics include, but are not limited to:
Data sets with clinical texts
Open source tools for Clinical NLP
Information extraction from clinical text
Information retrieval for clinical text
Adapting standard NLP tools for clinical text
De-identification and ways to preserve privacy in clinical data
Using medical terminologies and ontologies
Annotation schemes and annotation methodology for clinical data
Evaluation methods for the clinical domain
Text-based clinical prediction models
Speech recognition for clinical text
We solicit short presentations (15 to 20 minutes) from researchers covering recent work, including work in progress and work that was recently published at journals and/or conferences in the field or made available via data and software sharing platforms like Zenodo or Github. Please email the title and abstract of your presentation to: email@example.com before 12 October 2021.
Analyzing the Diversity and Performance of BERT in Unified Mobile Search
by Negin Ghasemi, Mohammad Aliannejadi, and Djoerd Hiemstra
A unified mobile search framework aims to identify the mobile apps that can satisfy a user’s information need and route the user’s query to them. Previous work has shown that resource descriptions for mobile apps are sparse as they rely on the app’s previous queries. This problem puts certain apps in dominance and leaves out the resource-scarce apps from the top ranks. In this case, we need a ranker that goes beyond simple lexical matching. Therefore, our goal is to study the extent of a BERT-based ranker’s ability to improve the quality and diversity of app selection. To this end, we compare the results of the BERT-based ranker with other information retrieval models, focusing on the analysis of selected apps diversification. Our analysis shows that the BERT-based ranker selects more diverse apps while improving the quality of baseline results by selecting the relevant apps such as Facebook and Contacts for more personal queries and decreasing the bias towards the dominant resources such as the Google Search app.
The concept schedule for the new SIKS course “Advances in Information Retrieval” is out, featuring the best IR research in the Netherlands, if I may say so. More information at: http://www.siks.nl/IR-2021.php
Monday 4 October 2021
Welcome / coffee
10:00h – 11:45h.
(2 x 45 minutes)
Mohammad Aliannejadi and Antonis Krasakis (Univerity of Amsterdam) Conversational Search
12:00 – 12:45h.
Rolf Jagerman (Google)and Harrie
Oosterhuis (Radboud University) Unbiased Learning to Rank, Part 1
12:45 – 13:45h.
13:45 – 16:15h.
(3 x 40 minutes)
Harrie Oosterhuis (Radboud University) and Rolf Jagerman (Google) Unbiased Learning to Rank, Part 2
Medication annotation in medical reports using weak supervision
by Fien Ockers
By detecting textual references to medication in the daily reports written in different healthcare institutions, the resulting medication information can be used for research purposes like detecting common occurring adverse events or executing a comparative study into the effectiveness of different treatments. In this project, 4 different models, including a CRF model and three BERT-based models, are used to solve this medication detection task. They are not only trained on a smaller manually annotated train set but also on two extended train sets that are created using two weak supervision systems, Snorkel and Skweak. It is found that the CRF model and RobBERT are the best performing models, and that performance is structurally higher for models trained on the manually annotated train set than the extended train sets. However, model performance for the extended train sets does not fall behind far, showing the potential of using a weak supervision system. Future research could either focus on training a BERT-based tokenizer and model further on the medical domain or focus on expanding the labelling functions used in the weak supervision systems to improve recall or generalize to other medication-related entities such as dosages or modes of administration.
Earlier this year I cancelled my ACM membership. Here’s why:
Dear Vicki Hanson, dear ACM renewal,
A few months ago, my ACM membership expired. I am a senior member of ACM and I’ve been a member since the early 2000s. For several years now, I am having doubts about renewing my membership, because the ACM keeps its publications closed access in its Digital Library (or “clopen” according to Moshe Vardi’s infamous 2009 CACM article). For the past 25 years, I see SIGIR stagnating (even though web search became *the* killer application on the web!), while related fields like Machine Learning (ML) and Natural Language Processing (NLP) thrive – communities that embraced open access early and fully. When I started my career, the ML and NLP communities were similar in size or smaller than the (SIG-)IR community, but now their communities are much bigger, more diverse (including many researchers from low-income countries that cannot afford subscriptions), and their papers cited and downloaded more. Of course, I cannot make a scientific claim about the different developments of ML and NLP vs. (SIG-)IR, but the positive effect of open access on citations and downloads is well-researched and well-documented in many fields, including computer science.
Dear Vicki, your email below did not have the intended effect of pursuing me to renew. On the contrary, I felt that the “commitment to open the DL and make ACM’s publications freely available to all” you mentioned was misleading given ACM’s history and given the fact that ACM publicly denounced open access by signing the letter of the AAP last year.
I would like to change my membership to a SIGIR-only membership, because I really value the SIGIR officers’ commitment to the field. They are doing their work as volunteers, even taking vacation time from their jobs to work on making SIGIR and the SIGIR conferences a success. For years, the officers too, have been misled by the ACM, for instance by the openTOC (open table of contents) policy that puts the burden of open access publishing on the conference organizers and volunteers that change every year. This year, confused by ACM’s misleading statements, the SIGIR executive committee claimed at the SIGIR business meeting that “all ACM SIGIR publications are permanent open access on the DL”. Needless to say, this is not the case, and will not be the case for another 5 years, if I read your email right.
I followed the ACM discussions on open access quite closely. I believe ACM Open, ACM’s transition to an “author-subscription” fee, is problematic in at least two ways. First, it’s risky, because a small number of institutions, the tier 1 institutions, can blow up the deal. Institutions like UC Berkely and MIT know that they’re doing most of the work, and this business model gives them a very strong bargaining position. Second, and more importantly, the $700 to $1,700 article-processing fees for authors of non-participating institutions will hurt the researchers of institutions in low-income countries and the global south. This model (like the current closed model) is effectively excluding researchers from Africa, Middle and South America, South Asia, Eastern Europe, the Caribbean, etc. I know that ACM is not getting many subscriptions fees from institutions in these countries now. Therefore, ACM is not having many members from those countries. That’s one of the problems we need to fix.
ACM needs a volunteer-led, diamond open access digital library, where the author does not pay, the reader does not pay, and the entire mechanism is self-funded, running on the volunteer work by authors, reviewers, editors, technicians, admins, and on micro-donations by friend organizations such as universities and research centers. Such a DL fully aligns with ACM’s member-driven and volunteer-led activities. Sure, this means that ACM will have less income, but our colleagues at related professional societies and journals, such as the ACL Anthology and the Journal of Machine Learning Research, show that this is a viable business model for scientific publishing that in the end benefits the community, and the society’s members, the most. I will re-apply as a full ACM member once that happens.
Yours sincerely, Djoerd Hiemstra Radboud University
Programmatically generating annotations for de-identification of clinical data
by Ismail Güçlü
Clinical records may contain protected health information (PHI) which are privacy sensitive information. It is important to annotate and replace PHI in unstructured medical records, before being able to share the data for other research purposes. Machine learning models are quick to implement and can achieve competitive results (micro-averaged F1-scores Dutch radiology dataset: 0.88 and English i2b2 dataset: 0.87). However, to develop machine learning models, we need training data. In this project, we applied weak supervision to annotate and collect training data for de-identification of medical records. It is essential to automate this process as manual annotation is a laborious and repetitive task. We used the two human annotated datasets, where we ‘removed’ the gold annotations to weakly tag PHI instances in medical records, where we unified the output labels using two different aggregation models: aggregation at the token level (Snorkel) and sequential labelling (Skweak). The output is then used to train a discriminative end model where we achieve competitive results on the Dutch dataset (micro-averaged F1 score: 0.76) whereas performance on the English dataset is sub-optimal (micro-averaged F1-score: 0.49). The results indicate that on structured PHI tags we approach human annotated results, but more complicated entities still need more attention.
Was fairness in IR discussed by Cooper and Robertson in the 1970’s?
As many people, I love to do an “ego search” in Google1, to see what comes up when I search my name. When Latanya Sweeny did such an ego search about a decade ago, she was shocked to find advertisements for background checks with the headline “Latanya Sweeny, Arrested?”. Sweeny, professor at Harvard, was never arrested. One of her colleagues suggested that the advertisement came up because of her “black name” – Latanya is a popular name among Americans of African descent – and Google’s advertisement search algorithm is racist. Motivated by this incident, Sweeny (2013) investigated the Google results for more than 2,000 racially associated personal names, and showed that Google’s advertisement are indeed systematically racially biased. Sweeny’s work was pivotal in putting bias and fairness of algorithms on the global research agenda.
The harm that (search) algorithms may do is substantial, especially if the algorithms are opaque, and if clicks on the (racist, unfair) results are fed back into the algorithm, thereby creating a destructive feedback loop where clicks on unfair results further reinforce the system’s unfairness. Cathy O’Neil compared such algorithms to weapons of mass destruction, because their destruction scales to hundreds of millions of (Google) users. O’Neill (2016), wittingly called her book, which is highly recommended, Weapons of Math Destruction.
Consider the following omniscient variant of the naive algorithm that ranks the articles by their true average relevance (i.e. the true fraction of users who want to read each article). How can this ranking be unfair? Let us assume that we have two groups of articles, Gright and Gleft, with 10 items each (i.e. articles from politically right-and left-leaning sources). 51% of the users (right-leaning) want to read the articles in group Gright, but not the articles in group Gleft. In reverse, the remaining 49% of the users (left-leaning) like only the articles in Gleft. Ranking articles solely by their true average relevance puts items from Gright into positions 1-10 and the items from Gleft in positions 11-20. This means the platform gives the articles in Gleft vastly less exposure than those in Gright. We argue that this can be considered unfair since the two groups receive disproportionately different outcomes despite having similar merit (i.e. relevance). Here, a 2% difference in average relevance leads to a much larger difference in exposure between the groups.
This example clearly shows a problem with fairness since right-leaning users have all their preferred documents ranked before the documents that are preferred by left-leaning users. Documents from the minority group (left-leaning in the example) are never even shown on the first results page. The example furthermore suggests that the ranking is optimal given the “true relevance” of the items, but is it really? Let’s have a look at some well-known evaluation measures for the ranking presented in the example, and for a fairer ranking where we interleave right-leaning and left-leaning documents, starting with a right-leaning document.
relevance ranking (unfair)
interleaved ranking (fair)
Table 1, evaluation results for the example of Morik et al. (2020)
Table 1 shows the expected evaluation results if, as stated in the example, 51% of the users like the right-leaning documents and 49% of the users like the left-leaning documents. For instance, the expected reciprocal rank (RR) for the relevance ranking in the example is 0.51 times 1 (51% of the users are satisfied with the first result returned) plus 0.49 times 1/11 (49% of the users are dissatisfied with the first 10 results, but satisfied with the eleventh result). The table also shows expected average precision (AP) and the normalized discounted cumulative gain (nDCG). So, if we are interested in the rank of the first relevant result (RR), then the example ranking is not only unfair, it is also of lower overall quality. If we are more interested in recall as measured by AP, then the relevance ranking indeed outperforms the interleaved ranking. Finally, in case of nDCG, the results are practically equal (the relevance ranking outperforms the interleaved ranking in the third digit). NDCG is normally used in cases where we have graded relevance judgments. If we additionally assume that one of the right-leaning documents and one of the left-leaning is more relevant (relevance score 2) than the other relevant documents (relevance score 1), then the fair, interleaved ranking outperforms the unfair, relevance ranking: 0.78 vs. 0.76. So, depending on our evaluation measures, the ranking by the “true average relevance” may actually not give the best quality search engine (besides the clearly unfair results).
Interestingly, rankings where two groups of users prefer different sets of documents were already discussed more than 44 years ago by Stephen Robertson when he introduced the probability ranking principle. Robertson (1977) contributed the principle to William Cooper. The paper’s appendix contains the following counter-example to the probability ranking principle, which Robertson also contributed to Cooper. The example follows the above example closely, but with different statistics for the two groups of users:
Cooper considers the problem of ranking the output of a system in response to a given request. Thus he is concerned with the class of users who put the same request to the system, and with a ranking of the documents in response to this one request which will optimize performance for this class of users. Consider, then, the following situation. The class of users (associated with this one request) consists of two sub-classes, U1 and U2; U1 has twice as many members as U2: Any user from U1 would be satisfied with any one of the documents D1–D9, but with no others. Any user U2 would be satisfied with document D10, but with no others. Hence: any document from D1–D9, considered on its own, has a probability of 2/3 of satisfying the next user who puts this request to the system. D10 has a probability of 1/3 of satisfying him/her; all other documents have probability zero. The probability ranking principle therefore says that D1–D9 should be given joint rank 1, D10 rank 2, and all others rank 3. But this means that while U1 users are satisfied with the first document they receive, U2 users have to reject nine documents before they reach the one they want. One could readily improve on the probability ranking, by giving D1 (say) rank 1, D10 rank 2, and D2–D9 and all others rank 3. Then U1 users are still satisfied with the first document, but U2 users are now satisfied with the second. Thus the ranking specified by the probability-ranking principle is not optimal. Such is Cooper’s counter-example.
Let’s again look at the evaluation results for the rankings presented in the example, the relevance ranking and the improved ranking, which we indicate as above as interleaved.
relevance ranking (unfair)
interleaved ranking (fair)
Table 2, evaluation results for Cooper’s example (Robertson 1977)
The example shows that the unfair ranker, that ranks all documents preferred by users from group U1 above those preferred by users from group U2, not only treats the minority group U2 unfairly, it also produces lower quality results on all three evaluation measures. But, why would a search engine prefer this so-called relevance ranking? and why did Morik et al. (2020) call this ranking a ranking by the “true average relevance”?
To understand this, we have to dig a bit deeper into Robertson’s probability ranking principle. The principle states that under certain conditions, a ranking by the probability of relevance as done by Morik et al. (2020) will produce the best overall effectiveness of the system to its users that is obtainable on the basis of the data. Those conditions are the following:
The relevance of a document to a request does not depend on the other documents in the collection;
The proof relates only to a single request;
Relevance is a dichotomous variable.
Condition 1 is clearly violated in our examples. For instance in the example with right-leaning and left-leaning users, knowing that a user likes one right-leaning document should drastically change the probability of relevance for the other documents. Condition 3 is violated if we use graded relevance and evaluation measures like (n)DCG. If our aim is to build a fair ranker, then we cannot blindly apply the probability ranking principle.2
My conclusion? We’ve known about the problem of unfair rankings for a long time. If the conditions for the probability ranking principle are not met, then we a) may not get the overall best quality ranking; and b) instead get a biased ranking that systematically and unfairly favours the majority group of users over the minority group.
Sadly, what happened to Latanya Sweeny may very well have been the following: Google optimizes its advertisement revenue using the click-through rate, i.e., Google uses a click-based relevance estimator that ranks advertisements by their probability of relevance under the conditions of the probability of ranking principle.3 These conditions are not met. There are at least two groups of people: 1) A racist majority group that clicks background checks for “black names”, and 2) A minority group that clicks advertisements for connecting on social media. Even though both groups may be roughly equal in size, Google only showed the top advertisements of the majority group. Google thereby showed biased results that adversely impact the minority group, and furthermore probably did not even optimize for advertisement revenue.
The most important message here: The relevance of the results of a search algorithm (and therefore the search engine’s revenue) is not necessarily at odds with the fairness of the results. Cooper’s example shows that there are cases where improving the quality of the results (measured in RR, AP or nDCG) also improves the fairness of the results.4
1. I use DuckDuckGo for all my other searches. 2. I don’t want to be overly critical about a SIGIR best paper, but curiously, Morik et al. (2020) (incorrectly) cite Robertson’s probability ranking principle paper as follows: “Fortunately, it is easy to show (Robertson 1977) that sorting-based policies π(x) = argsortd∈DR(d|x) (…) are optimal for virtually all [evaluation measures] commonly used in IR (e.g. DCG).” 3. Google is evil is another explanation. 4. Note that to get a truly fair ranking, we should frequently switch both groups when interleaving the documents, starting with the minority group with a probability proportional to the size of the group. This will somewhat negatively impact the expected search quality.
On 31 March 2021, the Wednesday morning of ECIR 2021, the conference participants joined with seven panellists in a discussion on Open Access and Information Retrieval (IR), or more accurately, on the lack of open access publishing in IR. Discussion topics included the experience of researchers with open access in Africa; business models for open access, in particular how to run a sustainable open access conference like ECIR; open access plans at Springer, the BCS and the ACM; and finally, experience with open access publishing in related fields, notably in Computational Linguistics.