Programmatically generating annotations for de-identification of clinical data
by Ismail Güçlü
Clinical records may contain protected health information (PHI) which are privacy sensitive information. It is important to annotate and replace PHI in unstructured medical records, before being able to share the data for other research purposes. Machine learning models are quick to implement and can achieve competitive results (micro-averaged F1-scores Dutch radiology dataset: 0.88 and English i2b2 dataset: 0.87). However, to develop machine learning models, we need training data. In this project, we applied weak supervision to annotate and collect training data for de-identification of medical records. It is essential to automate this process as manual annotation is a laborious and repetitive task. We used the two human annotated datasets, where we ‘removed’ the gold annotations to weakly tag PHI instances in medical records, where we unified the output labels using two different aggregation models: aggregation at the token level (Snorkel) and sequential labelling (Skweak). The output is then used to train a discriminative end model where we achieve competitive results on the Dutch dataset (micro-averaged F1 score: 0.76) whereas performance on the English dataset is sub-optimal (micro-averaged F1-score: 0.49). The results indicate that on structured PHI tags we approach human annotated results, but more complicated entities still need more attention.
Was fairness in IR discussed by Cooper and Robertson in the 1970’s?
As many people, I love to do an “ego search” in Google1, to see what comes up when I search my name. When Latanya Sweeny did such an ego search about a decade ago, she was shocked to find advertisements for background checks with the headline “Latanya Sweeny, Arrested?”. Sweeny, professor at Harvard, was never arrested. One of her colleagues suggested that the advertisement came up because of her “black name” – Latanya is a popular name among Americans of African descent – and Google’s advertisement search algorithm is racist. Motivated by this incident, Sweeny (2013) investigated the Google results for more than 2,000 racially associated personal names, and showed that Google’s advertisement are indeed systematically racially biased. Sweeny’s work was pivotal in putting bias and fairness of algorithms on the global research agenda.
The harm that (search) algorithms may do is substantial, especially if the algorithms are opaque, and if clicks on the (racist, unfair) results are fed back into the algorithm, thereby creating a destructive feedback loop where clicks on unfair results further reinforce the system’s unfairness. Cathy O’Neil compared such algorithms to weapons of mass destruction, because their destruction scales to hundreds of millions of (Google) users. O’Neill (2016), wittingly called her book, which is highly recommended, Weapons of Math Destruction.
Consider the following omniscient variant of the naive algorithm that ranks the articles by their true average relevance (i.e. the true fraction of users who want to read each article). How can this ranking be unfair? Let us assume that we have two groups of articles, Gright and Gleft, with 10 items each (i.e. articles from politically right-and left-leaning sources). 51% of the users (right-leaning) want to read the articles in group Gright, but not the articles in group Gleft. In reverse, the remaining 49% of the users (left-leaning) like only the articles in Gleft. Ranking articles solely by their true average relevance puts items from Gright into positions 1-10 and the items from Gleft in positions 11-20. This means the platform gives the articles in Gleft vastly less exposure than those in Gright. We argue that this can be considered unfair since the two groups receive disproportionately different outcomes despite having similar merit (i.e. relevance). Here, a 2% difference in average relevance leads to a much larger difference in exposure between the groups.
This example clearly shows a problem with fairness since right-leaning users have all their preferred documents ranked before the documents that are preferred by left-leaning users. Documents from the minority group (left-leaning in the example) are never even shown on the first results page. The example furthermore suggests that the ranking is optimal given the “true relevance” of the items, but is it really? Let’s have a look at some well-known evaluation measures for the ranking presented in the example, and for a fairer ranking where we interleave right-leaning and left-leaning documents, starting with a right-leaning document.
relevance ranking (unfair)
interleaved ranking (fair)
Table 1, evaluation results for the example of Morik et al. (2020)
Table 1 shows the expected evaluation results if, as stated in the example, 51% of the users like the right-leaning documents and 49% of the users like the left-leaning documents. For instance, the expected reciprocal rank (RR) for the relevance ranking in the example is 0.51 times 1 (51% of the users are satisfied with the first result returned) plus 0.49 times 1/11 (49% of the users are dissatisfied with the first 10 results, but satisfied with the eleventh result). The table also shows expected average precision (AP) and the normalized discounted cumulative gain (nDCG). So, if we are interested in the rank of the first relevant result (RR), then the example ranking is not only unfair, it is also of lower overall quality. If we are more interested in recall as measured by AP, then the relevance ranking indeed outperforms the interleaved ranking. Finally, in case of nDCG, the results are practically equal (the relevance ranking outperforms the interleaved ranking in the third digit). NDCG is normally used in cases where we have graded relevance judgments. If we additionally assume that one of the right-leaning documents and one of the left-leaning is more relevant (relevance score 2) than the other relevant documents (relevance score 1), then the fair, interleaved ranking outperforms the unfair, relevance ranking: 0.78 vs. 0.76. So, depending on our evaluation measures, the ranking by the “true average relevance” may actually not give the best quality search engine (besides the clearly unfair results).
Interestingly, rankings where two groups of users prefer different sets of documents were already discussed more than 44 years ago by Stephen Robertson when he introduced the probability ranking principle. Robertson (1977) contributed the principle to William Cooper. The paper’s appendix contains the following counter-example to the probability ranking principle, which Robertson also contributed to Cooper. The example follows the above example closely, but with different statistics for the two groups of users:
Cooper considers the problem of ranking the output of a system in response to a given request. Thus he is concerned with the class of users who put the same request to the system, and with a ranking of the documents in response to this one request which will optimize performance for this class of users. Consider, then, the following situation. The class of users (associated with this one request) consists of two sub-classes, U1 and U2; U1 has twice as many members as U2: Any user from U1 would be satisfied with any one of the documents D1–D9, but with no others. Any user U2 would be satisfied with document D10, but with no others. Hence: any document from D1–D9, considered on its own, has a probability of 2/3 of satisfying the next user who puts this request to the system. D10 has a probability of 1/3 of satisfying him/her; all other documents have probability zero. The probability ranking principle therefore says that D1–D9 should be given joint rank 1, D10 rank 2, and all others rank 3. But this means that while U1 users are satisfied with the first document they receive, U2 users have to reject nine documents before they reach the one they want. One could readily improve on the probability ranking, by giving D1 (say) rank 1, D10 rank 2, and D2–D9 and all others rank 3. Then U1 users are still satisfied with the first document, but U2 users are now satisfied with the second. Thus the ranking specified by the probability-ranking principle is not optimal. Such is Cooper’s counter-example.
Let’s again look at the evaluation results for the rankings presented in the example, the relevance ranking and the improved ranking, which we indicate as above as interleaved.
relevance ranking (unfair)
interleaved ranking (fair)
Table 2, evaluation results for Cooper’s example (Robertson 1977)
The example shows that the unfair ranker, that ranks all documents preferred by users from group U1 above those preferred by users from group U2, not only treats the minority group U2 unfairly, it also produces lower quality results on all three evaluation measures. But, why would a search engine prefer this so-called relevance ranking? and why did Morik et al. (2020) call this ranking a ranking by the “true average relevance”?
To understand this, we have to dig a bit deeper into Robertson’s probability ranking principle. The principle states that under certain conditions, a ranking by the probability of relevance as done by Morik et al. (2020) will produce the best overall effectiveness of the system to its users that is obtainable on the basis of the data. Those conditions are the following:
The relevance of a document to a request does not depend on the other documents in the collection;
The proof relates only to a single request;
Relevance is a dichotomous variable.
Condition 1 is clearly violated in our examples. For instance in the example with right-leaning and left-leaning users, knowing that a user likes one right-leaning document should drastically change the probability of relevance for the other documents. Condition 3 is violated if we use graded relevance and evaluation measures like (n)DCG. If our aim is to build a fair ranker, then we cannot blindly apply the probability ranking principle.2
My conclusion? We’ve known about the problem of unfair rankings for a long time. If the conditions for the probability ranking principle are not met, then we a) may not get the overall best quality ranking; and b) instead get a biased ranking that systematically and unfairly favours the majority group of users over the minority group.
Sadly, what happened to Latanya Sweeny may very well have been the following: Google optimizes its advertisement revenue using the click-through rate, i.e., Google uses a click-based relevance estimator that ranks advertisements by their probability of relevance under the conditions of the probability of ranking principle.3 These conditions are not met. There are at least two groups of people: 1) A racist majority group that clicks background checks for “black names”, and 2) A minority group that clicks advertisements for connecting on social media. Even though both groups may be roughly equal in size, Google only showed the top advertisements of the majority group. Google thereby showed biased results that adversely impact the minority group, and furthermore probably did not even optimize for advertisement revenue.
The most important message here: The relevance of the results of a search algorithm (and therefore the search engine’s revenue) is not necessarily at odds with the fairness of the results. Cooper’s example shows that there are cases where improving the quality of the results (measured in RR, AP or nDCG) also improves the fairness of the results.4
1. I use DuckDuckGo for all my other searches. 2. I don’t want to be overly critical about a SIGIR best paper, but curiously, Morik et al. (2020) (incorrectly) cite Robertson’s probability ranking principle paper as follows: “Fortunately, it is easy to show (Robertson 1977) that sorting-based policies π(x) = argsortd∈DR(d|x) (…) are optimal for virtually all [evaluation measures] commonly used in IR (e.g. DCG).” 3. Google is evil is another explanation. 4. Note that to get a truly fair ranking, we should frequently switch both groups when interleaving the documents, starting with the minority group with a probability proportional to the size of the group. This will somewhat negatively impact the expected search quality.
On 31 March 2021, the Wednesday morning of ECIR 2021, the conference participants joined with seven panellists in a discussion on Open Access and Information Retrieval (IR), or more accurately, on the lack of open access publishing in IR. Discussion topics included the experience of researchers with open access in Africa; business models for open access, in particular how to run a sustainable open access conference like ECIR; open access plans at Springer, the BCS and the ACM; and finally, experience with open access publishing in related fields, notably in Computational Linguistics.
People use interactive systems, such as search engines, as the main tool to obtain information. To satisfy the information needs, such systems usually provide a list of items that are selected out of a large candidate set and then sorted in the decreasing order of their usefulness. The result lists are generated by a ranking algorithm, called ranker, which takes the request of user and candidate items as the input and decides the order of candidate items. The quality of these systems depends on the underlying rankers.
There are two main approaches to optimize the ranker in an interactive system: using data annotated by humans or using the interactive user feedback. The first approach has been widely studied in history, also called offline learning to rank, and is the industry standard. However, the annotated data may not well represent information needs of users and are not timely. Thus, the first approaches may lead to suboptimal rankers. The second approach optimizes rankers by using interactive feedback. This thesis considers the second approach, learning from the interactive feedback. The reasons are two-fold:
Everyday, millions of users interact with the interactive systems and generate a huge number of interactions, from which we can extract the information needs of users.
Learning from the interactive data have more potentials to assist in designing the online algorithms.
Specifically, this thesis considers the task of learning from the user click feedback. The main contribution of this thesis is proposing a safe online learning to re-rank algorithm, named BubbleRank, which addresses one main disadvantage of online learning, i.e., the safety issue, by combining the advantages of both offline and online learning to rank algorithms. The thesis also proposes three other online algorithms, each of which solves unique online ranker optimization problems. All the proposed algorithms are theoretically sound and empirically effective.
Most publications in Information Retrieval are available via subscriptions. These include the ECIR proceedings published by Springer on behalf of the BCS, and the SIGIR proceedings published by the ACM. There is a trend to gradually change this situation to open access publishing. At Springer this is done by giving authors the choice to pay for open access, and by international agreements like Springer’s Compact. At ACM, this is also done by giving authors the choice to pay, and by agreements between ACM and individual institutions.
The panel discusses the effects of this situation on inclusiveness of the field, in particular on how we can support researchers from low income countries. We discuss the experience of researchers with open access in Africa; We discuss business models for open access, in particular how to run a sustainable open access conference like ECIR; We discuss open access plans at Springer, the BCS and the ACM; Finally, we discuss experience with open access publishing in related fields, in particular in Computational Linguistics. The discussion panel consists of:
Hassina Aliane | CERIST, Algeria
Ralf Gerstner | Springer Heidelberg, Germany
Min-Yen Kan | National University of Singapore
Haiming Liu | University of Bedfordshire, United Kingdom
Joao Magalhaes | Universidade Nova de Lisboa, Portugal
Hussein Suleman | University of Cape Town, South Africa
Uncovering the Properties of Full Ranking on Fully Labeled Data
by Negin Ghasemi and Djoerd Hiemstra
Recently, various information retrieval models have been proposed based on pre-trained BERT models, achieving outstanding performance. The majority of such models have been tested on data collections with partial relevance labels, where various potentially relevant documents have not been exposed to the annotators. Therefore, evaluating BERT-based rankers may lead to biased and unfair evaluation results, simply because a relevant document has not been exposed to the annotators while creating the collection. In our work, we aim to better understand a BERT-based ranker’s strengths compared to a BERT-based re-ranker and the initial ranker. To this aim, we investigate BERT-based rankers performance on the Cranfield collection, which comes with full relevance judgment on all documents in the collection. Our results demonstrate the BERT-based full ranker’s effectiveness, as opposed to the BERT-based re-ranker and BM25. Also, analysis shows that there are documents that the BERT-based full-ranker finds that were not found by the initial ranker.
To be presented at the Conference of the European Chapter of the Association for Computational Linguistics EACL Student Workshop on 22 April 2021.
Slow, content-based, federated, explainable, and fair
Access to information on the world wide web is dominated by two monopolists, Google and Facebook, that decide most of the information we see. Their business models are based on “surveillance capitalism”, that is, profiting from getting to know as much as possible about individuals that use the platforms. The information about individuals is used to maximize their engagement thereby maximizing the number of targeted advertisements shown to these individuals. Google’s and Facebook’s financial success has influenced many other online businesses as well as a substantial part of the academic research agenda in machine learning and information retrieval, that increasingly focuses on training on huge datasets, literally building on the success of Google and Facebook by using their pre-trained models (e.g. BERT and ELMo). Large pre-trained models and algorithms that maximize engagement come with many societal problems: They have been shown to discriminate minority groups, to manipulate elections, to radicalize users, and even to enable genocide. Looking forward to 2021-2027, we aim to research the following technical alternatives that do not exhibit these problems: 1) slow, content-based, learning maximizing user satisfaction instead of fast, click-based learning maximizing user engagement; 2) federated information access and search instead of centralized access and search; 3) explainable, fair approaches instead of black-box, biased approaches.
The Dutch government has set the target that by 2020, 100% of scientific publications financed with public money must be open access. As iCIS, we are not even half way. In the Radboud Repository less than 50% of the publications by Data Science, Software Science, and Digital Security are listed as open access. The slides below make a case for a new Open Access Strategy at iCIS that involves:
Putting all iCIS publications on-line after a reasonable time (as permitted by Dutch copyright law), preferably in the Radboud Repository;
Encouraging so-called diamond open access publishing (where open access publications are paid by donations and volunteer work from authors, editors, peer reviewers, and web masters);
Discouraging closed access as well as so-called gold open access publishing (where authors pay expensive article processing charges);
Query autocompletions help users of search engines to speed up their searches by recommending completions of partially typed queries in a drop down box. These recommended query autocompletions are usually based on large logs of queries that were previously entered by the search engine’s users. Therefore, misinformation entered — either accidentally or purposely to manipulate the search engine — might end up in the search engine’s recommendations, potentially harming organizations, individuals, and groups of people. This paper proposes an alternative approach for generating query autocompletions by extracting anchor texts from a large web crawl, without the need to use query logs. Our evaluation shows that even though query log autocompletions perform better for shorter queries, anchor text autocompletions outperform query log autocompletions for queries of 2 words or more.
To be presented at the 2nd International Symposium on Open Search Technology (OSSYM 2020), 12-14 October 2020, CERN, Geneva, Switzerland.