Dutch-Belgian Information Retrieval Workshop 2021½

The program for DIR2021½ is out. DIR 2021½ will run on four consecutive Fridays as online Search Engine Amsterdam meetups. Register now!

Session 1, 4 February 2022

  • Keynote 1 by Maria Maistro (Uni. of Copenhagen): How can we measure reproducibility of IR experiments?

Session 2, 11 February 2022

  • Ali Vardasbi (University of Amsterdam): Mixture-Based Correction for Position and Trust Bias in Counterfactual Learning to Rank
  • Sepideh Mesbah (Randstad Groep): Using RobBERT and eXtreme Multi-Label Classification to Extract Implicit and Explicit Skills From Dutch Job Descriptions
  • Hideaki Joko (Radboud University): Conversational Entity Linking: Problem Definition and Datasets
  • Liesbeth Allein (KU Leuven): Time-aware evidence ranking for fact-checking
  • Mozhdeh Ariannezhad (University of Amsterdam): Understanding Multi-channel Customer Behavior in Retail

Session 3, 18 February 2022

  • Garett Allen (TU Delft): Supercalifragilisticexpialidocious: Why Using the “Right” Readability Formula in Children’s Web Search Matters
  • Carsten Schnober (WizeNoze): Neural Information Retrieval for Educational Resources
  • Olivier Jeunen (Amazon): Embarrassingly shallow auto-encoders for dynamic collaborative filtering
  • Zhe Roger (TU Delft): Leave No User Behind: Towards Improving the Utility of Recommender Systems for Non-mainstream Users
  • Harrie Oosterhuis (Radboud University): Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness

Session 4, 25 February 2022

  • Keynote 2 by Gabriella Kazai (Microsoft Research): IR Evaluation – An Industry Perspective

Web Analytics & Privacy workshop

On Thursday 23 December, the NoGA team organizes the first Web Analytics and Privacy workshop with in the morning a demonstration of the open source analytics system Matomo, and in the afternoon two excellent guest speakers: Frederik Zuiderveen Borgesius and Güneş Acar.

Frederik Zuiderveen Borgesius will talk about behavioural targeting, privacy, and the law, discussesing the troubled relationship between contemporary advertising technology (adtech) systems, in particular systems of real-time bidding (RTB, also known as programmatic advertising) underpinning much behavioural targeting on the web and through mobile applications.

Güneş Acar will talk about browser fingerprinting and personal data exfiltration on the web, discussing the results of a study into data exfiltration by third-party scripts directly embedded on web pages. Specifically, Güneş will discuss three attacks: misuse of browsers’ internal login managers, social data exfiltration, and whole-DOM exfiltration.

More information at: https://nogadata.nl/wap2021.html

Maurice Verbrugge graduates on the BERT Ranking Paradigm

The BERT Ranking Paradigm: Training Strategies Evaluated

by Maurice Verbrugge

This thesis researches the most recent paradigm in information retrieval, which applies the neural language representation model BERT to rank relevant passages out of a corpus. The research focuses on a re-ranker scheme that uses BM25 to pre-rank the corpus followed by BERT-based ranking, exploring better fine-tuning methodology for a pre-trained BERT. This goal is pursued in two parts, in the first, all methods rely on binary relevance labels, while the second part applies methods that rely on multiple relevance labels instead. Part one researches methods that apply training data enhancement and the application of inductive transfer learning methods. Part two researches the application of single class multi label methods, multi class multi label methods and label-based regression. In all parts, the methods were evaluated on the fully annotated Cranfield dataset.
This thesis demonstrates that applying inductive transfer learning with the Next Sentence Prediction task improves the baseline by presenting various methods to enrich the fine-tuning data for different levels of the BM25-BERT ranking pipeline. Also, this thesis demonstrates that application of a regression method results in above baseline performance. This indicates the superiority of this method over rule-based filtering of classifier results.

[download pdf]

Casper van Aarle graduates on Federated Regression Analysis

Federated Regression Analysis on Personal Data Stores: Improving the Personal Health Train

by Casper van Aarle

Due to regulations and increased privacy awareness, patients may be reticent in sharing data with any institution. The Personal Health Train is an initiative to connect different data institutions for data analysis while maintaining full authority over their data. The Personal Health Train may not only connect larger institutions but also connect smaller, possibly on-device personal data stores, where data is safely and separately stored.
This thesis explores possible solutions in the literature that guarantee data-privacy and model-privacy, and it shows the practical feasibility when learning over a large number of personal data stores. We specifically regard the generation of linear regression and logistic regression models over personal data stores. We experiment with different design choices to optimise the convergence of our training architecture.
We discuss the PrivFL protocol* which takes into account both data-privacy and model-privacy when learning a regression model and is applicable to personal data stores. We further propose a standardisation protocol, Secure Scaling Operation, that guarantees data-privacy for patients, and experiments concluded that it improves convergence better than an adaptive gradient.
We implement an architecture that can learn over personal data stores and which preserves user privacy in FedLinReg-v2 and FedLogReg-v2. While, in theory, no convergence is guaranteed, training over various datasets shows a difference of 0 to 0.33% in loss differences over both training and test sets compared to models that are centrally optimised. No parameter optimisation was necessary. The coefficients however may deviate from centrally trained models. We were able to train regression models while preserving data-privacy over 150 personal data stores in minutes. An even higher level of data-privacy will cause a strong linear increase in computation-time in relation to the amount of personal data stores included.

[download pdf]

Vacancy: PhD Candidate for Fairness and Non-discrimination in Machine Learning for Retrieval and Recommendation

Information retrieval and recommender systems based on machine learning can be used to make decisions about people. Government agencies can use such systems to detect welfare fraud, insurers can use them to predict risks and to set insurance premiums, and companies can use them to select the best people from a list job applicants. Such systems can lead to more efficiency, and could improve our society in many ways. However, such AI-driven decision-making also brings risks. This project focuses on the risk that such AI systems lead to illegal discrimination, for instance harming people of a certain ethnicity, or other types of unfairness. A different type of unfairness could concern, for instance, a system that reinforces financial inequality in society. Recent machine learning work on measures of fairness has resulted in several competing approaches for measuring fairness. There is no consensus on what is the best way to measure fairness and the measures often depend on the type of machine learning that is applied. Based on the application of existing measures on real-world data, we suspect that many proposed measures are not that helpful in practice. In this project, you will study measures of fairness, answering questions such as the following. To what extent can legal non-discrimination norms be translated into fairness measures for machine learning? Can we measure fairness independently of the machine learning approach? Can we show which machine learning methods are the most appropriate to achieve non-discrimination and fairness? The project concerns primarily machine learning for information retrieval and recommendation, but is interdisciplinary, as it is also informed by legal norms. The project will be supervised by Professor Hiemstra, professor of data science and federated search, and Professor Zuiderveen Borgesius, professor of ICT and law.

Profile

  • You hold a completed Master’s Degree or Research Master’s degree in computer science, data science, machine learning, artificial intelligence, or a related discipline.
  • You have good programming skills.
  • You have good command of spoken and written English.
  • We encourage you to apply even if you think you do not meet all the requirements.

More information at: https://www.ru.nl/english/working-at/vacature/details-vacature/?recid=1171943

2nd Dutch meeting on Clinical NLP

Now that electronic health records are commonly used, the availability of clinical texts is growing. This workshop discusses the automatic analysis of textual clinical health data to advance medical research and improve healthcare related services. We especially encourage presentations discussing possibilities to share clinical texts, models and tools for clinical natural language processing (NLP). In practice, privacy- and legal regulations prevent the free sharing and combination of electronic health records themselves, but de-identified texts, NLP tools and intermediate results may be shared. We hope that sharing will promote cooperation within the Dutch-speaking countries, as well as advance the research in Clinical NLP in those countries. Relevant topics include, but are not limited to:

  • Data sets with clinical texts
  • Open source tools for Clinical NLP
  • Information extraction from clinical text
  • Information retrieval for clinical text
  • Adapting standard NLP tools for clinical text
  • De-identification and ways to preserve privacy in clinical data
  • Using medical terminologies and ontologies
  • Annotation schemes and annotation methodology for clinical data
  • Evaluation methods for the clinical domain
  • Text-based clinical prediction models
  • Speech recognition for clinical text

We solicit short presentations (15 to 20 minutes) from researchers covering recent work, including work in progress and work that was recently published at journals and/or conferences in the field or made available via data and software sharing platforms like Zenodo or Github. Please email the title and abstract of your presentation before 12 October 2021.

More information at: https://clinical-nlp.cs.ru.nl

BERT for Target Apps Selection

Analyzing the Diversity and Performance of BERT in Unified Mobile Search

by Negin Ghasemi, Mohammad Aliannejadi, and Djoerd Hiemstra

A unified mobile search framework aims to identify the mobile apps that can satisfy a user’s information need and route the user’s query to them. Previous work has shown that resource descriptions for mobile apps are sparse as they rely on the app’s previous queries. This problem puts certain apps in dominance and leaves out the resource-scarce apps from the top ranks. In this case, we need a ranker that goes beyond simple lexical matching. Therefore, our goal is to study the extent of a BERT-based ranker’s ability to improve the quality and diversity of app selection. To this end, we compare the results of the BERT-based ranker with other information retrieval models, focusing on the analysis of selected apps diversification. Our analysis shows that the BERT-based ranker selects more diverse apps while improving the quality of baseline results by selecting the relevant apps such as Facebook and Contacts for more personal queries and decreasing the bias towards the dominant resources such as the Google Search app.

[More info]

SIKS course Advances in Information Retrieval

The concept schedule for the new SIKS course “Advances in Information Retrieval” is out, featuring the best IR research in the Netherlands, if I may say so. More information at: http://www.siks.nl/IR-2021.php

Monday 4 October 2021

9:30h.

Welcome / coffee


10:00h – 11:45h.

Lecture 1
(2 x 45 minutes)

Mohammad Aliannejadi and Antonis Krasakis (Univerity of Amsterdam)
Conversational Search

12:00 – 12:45h.

Lecture 2
(45 minutes)

Rolf Jagerman (Google) and Harrie Oosterhuis (Radboud University)
Unbiased Learning to Rank, Part 1

12:45 – 13:45h.

Lunch


13:45 – 16:15h.

Lecture 2
(3 x 40 minutes)

Harrie Oosterhuis (Radboud University) and Rolf Jagerman (Google)
Unbiased Learning to Rank, Part 2

16:30 – 18:00h.

Lecture 3
(2 x 40 minutes)

Christine Bauer (Utrecht University)
Multi-method evaluation

18:30h.

Dinner


Tuesday 5 October 2021

8:30h. – 10:15h.

Lecture 4
(2 x 45 minutes)

Faegheh Hasibi (Radboud University)
Knowledge graphs & semantic search

10:30 – 12:15h.

Lecture 5
(2 x 45 minutes)

Jaap Kamps (University of Amsterdam)
Neural Information Retrieval

12:15 – 13:30h.

Lunch


13:30 – 15:15h.

Lecture 6
(2 x 45 minutes)

David Maxwell (Delft University of Technology)
Interactive Information Retrieval

15:30h.

Closing


Fien Ockers graduates on medication annotation using weak supervision

Medication annotation in medical reports using weak
supervision

by Fien Ockers

By detecting textual references to medication in the daily reports written in different healthcare institutions, the resulting medication information can be used for research purposes like detecting common occurring adverse events or executing a comparative study into the effectiveness of different treatments. In this project, 4 different models, including a CRF model and three BERT-based models, are used to solve this medication detection task. They are not only trained on a smaller manually annotated train set but also on two extended train sets that are created using two weak supervision systems, Snorkel and Skweak. It is found that the CRF model and RobBERT are the best performing models, and that performance is structurally higher for models trained on the manually annotated train set than the extended train sets. However, model performance for the extended train sets does not fall behind far, showing the potential of using a weak supervision system. Future research could either focus on training a BERT-based tokenizer and model further on the medical domain or focus on expanding the labelling functions used in the weak supervision systems to improve recall or generalize to other medication-related entities such as dosages or modes of administration.