Maurice Verbrugge graduates on the BERT Ranking Paradigm

The BERT Ranking Paradigm: Training Strategies Evaluated

by Maurice Verbrugge

This thesis researches the most recent paradigm in information retrieval, which applies the neural language representation model BERT to rank relevant passages out of a corpus. The research focuses on a re-ranker scheme that uses BM25 to pre-rank the corpus followed by BERT-based ranking, exploring better fine-tuning methodology for a pre-trained BERT. This goal is pursued in two parts, in the first, all methods rely on binary relevance labels, while the second part applies methods that rely on multiple relevance labels instead. Part one researches methods that apply training data enhancement and the application of inductive transfer learning methods. Part two researches the application of single class multi label methods, multi class multi label methods and label-based regression. In all parts, the methods were evaluated on the fully annotated Cranfield dataset.
This thesis demonstrates that applying inductive transfer learning with the Next Sentence Prediction task improves the baseline by presenting various methods to enrich the fine-tuning data for different levels of the BM25-BERT ranking pipeline. Also, this thesis demonstrates that application of a regression method results in above baseline performance. This indicates the superiority of this method over rule-based filtering of classifier results.

[download pdf]

Casper van Aarle graduates on Federated Regression Analysis

Federated Regression Analysis on Personal Data Stores: Improving the Personal Health Train

by Casper van Aarle

Due to regulations and increased privacy awareness, patients may be reticent in sharing data with any institution. The Personal Health Train is an initiative to connect different data institutions for data analysis while maintaining full authority over their data. The Personal Health Train may not only connect larger institutions but also connect smaller, possibly on-device personal data stores, where data is safely and separately stored.
This thesis explores possible solutions in the literature that guarantee data-privacy and model-privacy, and it shows the practical feasibility when learning over a large number of personal data stores. We specifically regard the generation of linear regression and logistic regression models over personal data stores. We experiment with different design choices to optimise the convergence of our training architecture.
We discuss the PrivFL protocol* which takes into account both data-privacy and model-privacy when learning a regression model and is applicable to personal data stores. We further propose a standardisation protocol, Secure Scaling Operation, that guarantees data-privacy for patients and experiments
concluded that it improves convergence better than an adaptive gradient.
We implement an architecture that can learn over personal data stores and which preserves user privacy in FedLinReg-v2 and FedLogReg-v2. While, in theory, no convergence is guaranteed, training over various datasets shows a difference of 0 to 0.33% in loss differences over both training and test sets compared to models that are centrally optimised. No parameter optimisation was necessary. The coefficients however may deviate from centrally trained models. We were able to train regression models while preserving data-privacy over 150 personal data stores in minutes. An even higher level of data-privacy will cause a strong linear increase in computation-time in relation to the amount of personal data stores included.

[download pdf]

Vacancy: PhD Candidate for Fairness and Non-discrimination in Machine Learning for Retrieval and Recommendation

Information retrieval and recommender systems based on machine learning can be used to make decisions about people. Government agencies can use such systems to detect welfare fraud, insurers can use them to predict risks and to set insurance premiums, and companies can use them to select the best people from a list job applicants. Such systems can lead to more efficiency, and could improve our society in many ways. However, such AI-driven decision-making also brings risks. This project focuses on the risk that such AI systems lead to illegal discrimination, for instance harming people of a certain ethnicity, or other types of unfairness. A different type of unfairness could concern, for instance, a system that reinforces financial inequality in society. Recent machine learning work on measures of fairness has resulted in several competing approaches for measuring fairness. There is no consensus on what is the best way to measure fairness and the measures often depend on the type of machine learning that is applied. Based on the application of existing measures on real-world data, we suspect that many proposed measures are not that helpful in practice. In this project, you will study measures of fairness, answering questions such as the following. To what extent can legal non-discrimination norms be translated into fairness measures for machine learning? Can we measure fairness independently of the machine learning approach? Can we show which machine learning methods are the most appropriate to achieve non-discrimination and fairness? The project concerns primarily machine learning for information retrieval and recommendation, but is interdisciplinary, as it is also informed by legal norms. The project will be supervised by Professor Hiemstra, professor of data science and federated search, and Professor Zuiderveen Borgesius, professor of ICT and law.

Profile

  • You hold a completed Master’s Degree or Research Master’s degree in computer science, data science, machine learning, artificial intelligence, or a related discipline.
  • You have good programming skills.
  • You have good command of spoken and written English.
  • We encourage you to apply even if you think you do not meet all the requirements.

More information at: https://www.ru.nl/english/working-at/vacature/details-vacature/?recid=1171943

2nd Dutch meeting on Clinical NLP

Now that electronic health records are commonly used, the availability of clinical texts is growing. This workshop discusses the automatic analysis of textual clinical health data to advance medical research and improve healthcare related services. We especially encourage presentations discussing possibilities to share clinical texts, models and tools for clinical natural language processing (NLP). In practice, privacy- and legal regulations prevent the free sharing and combination of electronic health records themselves, but de-identified texts, NLP tools and intermediate results may be shared. We hope that sharing will promote cooperation within the Dutch-speaking countries, as well as advance the research in Clinical NLP in those countries. Relevant topics include, but are not limited to:

  • Data sets with clinical texts
  • Open source tools for Clinical NLP
  • Information extraction from clinical text
  • Information retrieval for clinical text
  • Adapting standard NLP tools for clinical text
  • De-identification and ways to preserve privacy in clinical data
  • Using medical terminologies and ontologies
  • Annotation schemes and annotation methodology for clinical data
  • Evaluation methods for the clinical domain
  • Text-based clinical prediction models
  • Speech recognition for clinical text

We solicit short presentations (15 to 20 minutes) from researchers covering recent work, including work in progress and work that was recently published at journals and/or conferences in the field or made available via data and software sharing platforms like Zenodo or Github. Please email the title and abstract of your presentation before 12 October 2021.

More information at: https://clinical-nlp.cs.ru.nl

BERT for Target Apps Selection

Analyzing the Diversity and Performance of BERT in Unified Mobile Search

by Negin Ghasemi, Mohammad Aliannejadi, and Djoerd Hiemstra

A unified mobile search framework aims to identify the mobile apps that can satisfy a user’s information need and route the user’s query to them. Previous work has shown that resource descriptions for mobile apps are sparse as they rely on the app’s previous queries. This problem puts certain apps in dominance and leaves out the resource-scarce apps from the top ranks. In this case, we need a ranker that goes beyond simple lexical matching. Therefore, our goal is to study the extent of a BERT-based ranker’s ability to improve the quality and diversity of app selection. To this end, we compare the results of the BERT-based ranker with other information retrieval models, focusing on the analysis of selected apps diversification. Our analysis shows that the BERT-based ranker selects more diverse apps while improving the quality of baseline results by selecting the relevant apps such as Facebook and Contacts for more personal queries and decreasing the bias towards the dominant resources such as the Google Search app.

[More info]

SIKS course Advances in Information Retrieval

The concept schedule for the new SIKS course “Advances in Information Retrieval” is out, featuring the best IR research in the Netherlands, if I may say so. More information at: http://www.siks.nl/IR-2021.php

Monday 4 October 2021

9:30h.

Welcome / coffee


10:00h – 11:45h.

Lecture 1
(2 x 45 minutes)

Mohammad Aliannejadi and Antonis Krasakis (Univerity of Amsterdam)
Conversational Search

12:00 – 12:45h.

Lecture 2
(45 minutes)

Rolf Jagerman (Google) and Harrie Oosterhuis (Radboud University)
Unbiased Learning to Rank, Part 1

12:45 – 13:45h.

Lunch


13:45 – 16:15h.

Lecture 2
(3 x 40 minutes)

Harrie Oosterhuis (Radboud University) and Rolf Jagerman (Google)
Unbiased Learning to Rank, Part 2

16:30 – 18:00h.

Lecture 3
(2 x 40 minutes)

Christine Bauer (Utrecht University)
Multi-method evaluation

18:30h.

Dinner


Tuesday 5 October 2021

8:30h. – 10:15h.

Lecture 4
(2 x 45 minutes)

Faegheh Hasibi (Radboud University)
Knowledge graphs & semantic search

10:30 – 12:15h.

Lecture 5
(2 x 45 minutes)

Jaap Kamps (University of Amsterdam)
Neural Information Retrieval

12:15 – 13:30h.

Lunch


13:30 – 15:15h.

Lecture 6
(2 x 45 minutes)

David Maxwell (Delft University of Technology)
Interactive Information Retrieval

15:30h.

Closing


Fien Ockers graduates on medication annotation using weak supervision

Medication annotation in medical reports using weak
supervision

by Fien Ockers

By detecting textual references to medication in the daily reports written in different healthcare institutions, the resulting medication information can be used for research purposes like detecting common occurring adverse events or executing a comparative study into the effectiveness of different treatments. In this project, 4 different models, including a CRF model and three BERT-based models, are used to solve this medication detection task. They are not only trained on a smaller manually annotated train set but also on two extended train sets that are created using two weak supervision systems, Snorkel and Skweak. It is found that the CRF model and RobBERT are the best performing models, and that performance is structurally higher for models trained on the manually annotated train set than the extended train sets. However, model performance for the extended train sets does not fall behind far, showing the potential of using a weak supervision system. Future research could either focus on training a BERT-based tokenizer and model further on the medical domain or focus on expanding the labelling functions used in the weak supervision systems to improve recall or generalize to other medication-related entities such as dosages or modes of administration.

Open ACM membership cancellation letter

Earlier this year I cancelled my ACM membership. Here’s why:

Dear Vicki Hanson, dear ACM renewal,

A few months ago, my ACM membership expired. I am a senior member of ACM and I’ve been a member since the early 2000s. For several years now, I am having doubts about renewing my membership, because the ACM keeps its publications closed access in its Digital Library (or “clopen” according to Moshe Vardi’s infamous 2009 CACM article). For the past 25 years, I see SIGIR stagnating (even though web search became *the* killer application on the web!), while related fields like Machine Learning (ML) and Natural Language Processing (NLP) thrive – communities that embraced open access early and fully. When I started my career, the ML and NLP communities were similar in size or smaller than the (SIG-)IR community, but now their communities are much bigger, more diverse (including many researchers from low-income countries that cannot afford subscriptions), and their papers cited and downloaded more. Of course, I cannot make a scientific claim about the different developments of ML and NLP vs. (SIG-)IR, but the positive effect of open access on citations and downloads is well-researched and well-documented in many fields, including computer science.

Dear Vicki, your email below did not have the intended effect of pursuing me to renew. On the contrary, I felt that the “commitment to open the DL and make ACM’s publications freely available to all” you mentioned was misleading given ACM’s history and given the fact that ACM publicly denounced open access by signing the letter of the AAP last year.

I would like to change my membership to a SIGIR-only membership, because I really value the SIGIR officers’ commitment to the field. They are doing their work as volunteers, even taking vacation time from their jobs to work on making SIGIR and the SIGIR conferences a success. For years, the officers too, have been misled by the ACM, for instance by the openTOC (open table of contents) policy that puts the burden of open access publishing on the conference organizers and volunteers that change every year. This year, confused by ACM’s misleading statements, the SIGIR executive committee claimed at the SIGIR business meeting that “all ACM SIGIR publications are permanent open access on the DL”. Needless to say, this is not the case, and will not be the case for another 5 years, if I read your email right.

I followed the ACM discussions on open access quite closely. I believe ACM Open, ACM’s transition to an “author-subscription” fee, is problematic in at least two ways. First, it’s risky, because a small number of institutions, the tier 1 institutions, can blow up the deal. Institutions like UC Berkely and MIT know that they’re doing most of the work, and this business model gives them a very strong bargaining position. Second, and more importantly, the $700 to $1,700 article-processing fees for authors of non-participating institutions will hurt the researchers of institutions in low-income countries and the global south. This model (like the current closed model) is effectively excluding researchers from Africa, Middle and South America, South Asia, Eastern Europe, the Caribbean, etc. I know that ACM is not getting many subscriptions fees from institutions in these countries now. Therefore, ACM is not having many members from those countries. That’s one of the problems we need to fix.

ACM needs a volunteer-led, diamond open access digital library, where the author does not pay, the reader does not pay, and the entire mechanism is self-funded, running on the volunteer work by authors, reviewers, editors, technicians, admins, and on micro-donations by friend organizations such as universities and research centers. Such a DL fully aligns with ACM’s member-driven and volunteer-led activities. Sure, this means that ACM will have less income, but our colleagues at related professional societies and journals, such as the ACL Anthology and the Journal of Machine Learning Research, show that this is a viable business model for scientific publishing that in the end benefits the community, and the society’s members, the most. I will re-apply as a full ACM member once that happens.

Yours sincerely,
Djoerd Hiemstra
Radboud University

Ismail Güçlü graduates on programmatically generating annotations for clinical data

Programmatically generating annotations for de-identification
of clinical data

by Ismail Güçlü

Clinical records may contain protected health information (PHI) which are privacy sensitive information. It is important to annotate and replace PHI in unstructured medical records, before being able to share the data for other research purposes. Machine learning models are quick to implement and can achieve competitive results (micro-averaged F1-scores Dutch radiology dataset: 0.88 and English i2b2 dataset: 0.87). However, to develop machine learning models, we need training data. In this project, we applied weak supervision to annotate and collect training data for de-identification of medical records. It is essential to automate this process as manual annotation is a laborious and repetitive task. We used the two human annotated datasets, where we ‘removed’ the gold annotations to weakly tag PHI instances in medical records, where we unified the output labels using two different aggregation models: aggregation at the token level (Snorkel) and sequential labelling (Skweak). The output is then used to train a discriminative end model where we achieve competitive results on the Dutch dataset (micro-averaged F1 score: 0.76) whereas performance on the English dataset is sub-optimal (micro-averaged F1-score: 0.49). The results indicate that on structured PHI tags we approach human annotated results, but more complicated entities still need more attention.

[more information]