The BERT Ranking Paradigm: Training Strategies Evaluated
by Maurice Verbrugge
This thesis researches the most recent paradigm in information retrieval, which applies the neural language representation model BERT to rank relevant passages out of a corpus. The research focuses on a re-ranker scheme that uses BM25 to pre-rank the corpus followed by BERT-based ranking, exploring better fine-tuning methodology for a pre-trained BERT. This goal is pursued in two parts, in the first, all methods rely on binary relevance labels, while the second part applies methods that rely on multiple relevance labels instead. Part one researches methods that apply training data enhancement and the application of inductive transfer learning methods. Part two researches the application of single class multi label methods, multi class multi label methods and label-based regression. In all parts, the methods were evaluated on the fully annotated Cranfield dataset.
This thesis demonstrates that applying inductive transfer learning with the Next Sentence Prediction task improves the baseline by presenting various methods to enrich the fine-tuning data for different levels of the BM25-BERT ranking pipeline. Also, this thesis demonstrates that application of a regression method results in above baseline performance. This indicates the superiority of this method over rule-based filtering of classifier results.
Federated Regression Analysis on Personal Data Stores: Improving the Personal Health Train
by Casper van Aarle
Due to regulations and increased privacy awareness, patients may be reticent in sharing data with any institution. The Personal Health Train is an initiative to connect different data institutions for data analysis while maintaining full authority over their data. The Personal Health Train may not only connect larger institutions but also connect smaller, possibly on-device personal data stores, where data is safely and separately stored.
This thesis explores possible solutions in the literature that guarantee data-privacy and model-privacy, and it shows the practical feasibility when learning over a large number of personal data stores. We specifically regard the generation of linear regression and logistic regression models over personal data stores. We experiment with different design choices to optimise the convergence of our training architecture.
We discuss the PrivFL protocol* which takes into account both data-privacy and model-privacy when learning a regression model and is applicable to personal data stores. We further propose a standardisation protocol, Secure Scaling Operation, that guarantees data-privacy for patients and experiments
concluded that it improves convergence better than an adaptive gradient.
We implement an architecture that can learn over personal data stores and which preserves user privacy in FedLinReg-v2 and FedLogReg-v2. While, in theory, no convergence is guaranteed, training over various datasets shows a difference of 0 to 0.33% in loss differences over both training and test sets compared to models that are centrally optimised. No parameter optimisation was necessary. The coefficients however may deviate from centrally trained models. We were able to train regression models while preserving data-privacy over 150 personal data stores in minutes. An even higher level of data-privacy will cause a strong linear increase in computation-time in relation to the amount of personal data stores included.
Information retrieval and recommender systems based on machine learning can be used to make decisions about people. Government agencies can use such systems to detect welfare fraud, insurers can use them to predict risks and to set insurance premiums, and companies can use them to select the best people from a list job applicants. Such systems can lead to more efficiency, and could improve our society in many ways. However, such AI-driven decision-making also brings risks. This project focuses on the risk that such AI systems lead to illegal discrimination, for instance harming people of a certain ethnicity, or other types of unfairness. A different type of unfairness could concern, for instance, a system that reinforces financial inequality in society. Recent machine learning work on measures of fairness has resulted in several competing approaches for measuring fairness. There is no consensus on what is the best way to measure fairness and the measures often depend on the type of machine learning that is applied. Based on the application of existing measures on real-world data, we suspect that many proposed measures are not that helpful in practice. In this project, you will study measures of fairness, answering questions such as the following. To what extent can legal non-discrimination norms be translated into fairness measures for machine learning? Can we measure fairness independently of the machine learning approach? Can we show which machine learning methods are the most appropriate to achieve non-discrimination and fairness? The project concerns primarily machine learning for information retrieval and recommendation, but is interdisciplinary, as it is also informed by legal norms. The project will be supervised by Professor Hiemstra, professor of data science and federated search, and Professor Zuiderveen Borgesius, professor of ICT and law.
- You hold a completed Master’s Degree or Research Master’s degree in computer science, data science, machine learning, artificial intelligence, or a related discipline.
- You have good programming skills.
- You have good command of spoken and written English.
- We encourage you to apply even if you think you do not meet all the requirements.
More information at: https://www.ru.nl/english/working-at/vacature/details-vacature/?recid=1171943
Medication annotation in medical reports using weak
by Fien Ockers
By detecting textual references to medication in the daily reports written in different healthcare institutions, the resulting medication information can be used for research purposes like detecting common occurring adverse events or executing a comparative study into the effectiveness of different treatments. In this project, 4 different models, including a CRF model and three BERT-based models, are used to solve this medication detection task. They are not only trained on a smaller manually annotated train set but also on two extended train sets that are created using two weak supervision systems, Snorkel and Skweak. It is found that the CRF model and RobBERT are the best performing models, and that performance is structurally higher for models trained on the manually annotated train set than the extended train sets. However, model performance for the extended train sets does not fall behind far, showing the potential of using a weak supervision system. Future research could either focus on training a BERT-based tokenizer and model further on the medical domain or focus on expanding the labelling functions used in the weak supervision systems to improve recall or generalize to other medication-related entities such as dosages or modes of administration.
Programmatically generating annotations for de-identification
of clinical data
by Ismail Güçlü
Clinical records may contain protected health information (PHI) which are privacy sensitive information. It is important to annotate and replace PHI in unstructured medical records, before being able to share the data for other research purposes. Machine learning models are quick to implement and can achieve competitive results (micro-averaged F1-scores Dutch radiology dataset: 0.88 and English i2b2 dataset: 0.87). However, to develop machine learning models, we need training data. In this project, we applied weak supervision to annotate and collect training data for de-identification of medical records. It is essential to automate this process as manual annotation is a laborious and repetitive task. We used the two human annotated datasets, where we ‘removed’ the gold annotations to weakly tag PHI instances in medical records, where we unified the output labels using two different aggregation models: aggregation at the token level (Snorkel) and sequential labelling (Skweak). The output is then used to train a discriminative end model where we achieve competitive results on the Dutch dataset (micro-averaged F1 score: 0.76) whereas performance on the English dataset is sub-optimal (micro-averaged F1-score: 0.49). The results indicate that on structured PHI tags we approach human annotated results, but more complicated entities still need more attention.
Optimizing Ranking Systems Online as Bandits
by Chang Li
People use interactive systems, such as search engines, as the main tool to obtain information. To satisfy the information needs, such systems usually provide a list of items that are selected out of a large candidate set and then sorted in the decreasing order of their usefulness. The result lists are generated by a ranking algorithm, called ranker, which takes the request of user and candidate items as the input and decides the order of candidate items. The quality of these systems depends on the underlying rankers.
There are two main approaches to optimize the ranker in an interactive system: using data annotated by humans or using the interactive user feedback. The first approach has been widely studied in history, also called offline learning to rank, and is the industry standard. However, the annotated data may not well represent information needs of users and are not timely. Thus, the first approaches may lead to suboptimal rankers. The second approach optimizes rankers by using interactive feedback. This thesis considers the second approach, learning from the interactive feedback. The reasons are two-fold:
- Everyday, millions of users interact with the interactive systems and generate a huge number of interactions, from which we can extract the information needs of users.
- Learning from the interactive data have more potentials to assist in designing the online algorithms.
Specifically, this thesis considers the task of learning from the user click feedback. The main contribution of this thesis is proposing a safe online learning to re-rank algorithm, named BubbleRank, which addresses one main disadvantage of online learning, i.e., the safety issue, by combining the advantages of both offline and online learning to rank algorithms. The thesis also proposes three other online algorithms, each of which solves unique online ranker optimization problems. All the proposed algorithms are theoretically sound and empirically effective.
Image by @firstname.lastname@example.org
for a MSc thesis project on:
Generating synthetic clinical data for shared Machine Learning tasks
Goal: We want to develop methods for researchers to work on shared tasks for which we cannot share the real data because of privacy concerns, in particular clinical data. The envisioned approach is to share synthetic data that is programmatically generated using large-scale language representations like GPT-2 that are fine-tuned to the real data using proper anonymization safe-guards. Additionally, we will research programmatically generating annotations for this data to support shared machine learning and natural language processing tasks using for instance the approaches from Snorkel.
This way researchers and practitioners from different institutions can cooperate on a classification, pseudonimization or tagging task, by working on the synthetic data, possibly using a competitive “Kaggle” approach. Some research questions we want to tackle are:
- Can we generate convincing data? (and how to measure this?)
- Does it prevent private data leakage?
- Can we generate correct annotations of the data?
- How much manual labour is needed, if any?
- Can the synthetic data be used to train AI, and do the trained models work on the real data?
This is a project in cooperation with RUMC, Nedap and Leiden University.
by Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra
Unstructured information in electronic health records provide an invaluable resource for medical research. To protect the confidentiality of patients and to conform to privacy regulations, de-identification methods automatically remove personally identifying information from these medical records. However, due to the unavailability of labeled data, most existing research is constrained to English medical text and little is known about the generalizability of de-identification methods across languages and domains. In this study, we construct a varied dataset consisting of the medical records of 1260 patients by sampling data from 9 institutes and three domains of Dutch healthcare. We test the generalizability of three de-identification methods across languages and domains. Our experiments show that an existing rule-based method specifically developed for the Dutch language fails to generalize to this new data. Furthermore, a state-of-the-art neural architecture performs strongly across languages and domains, even with limited training data. Compared to feature-based and rule-based methods the neural method requires significantly less configuration effort and domain knowledge. We make all code and pre-trained de-identification models available to the research community, allowing practitioners to apply them to their datasets and to enable future benchmarks.
To be presented at the ACM WSDM Health Search and Data Mining Workshop HSDM 2020 on 3 February 2020 in Houston, USA.
[download preprint] [download from arXiv]
Source code is available as deidentify. We aimed to make it easy for others to apply the pre-trained models to new data, so we bundled the code as Python package which can be installed with pip.
Our paper received the Best paper award!
by Somtochukwu Enendu, Johannes Scholtes, Jeroen Smeets, Djoerd Hiemstra, and Mariet Theune
This paper describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the neural network method slightly outperforms the Conditional Random Field method with limited training data available. Regarding generalizability, our experiments show that the inclusion of textual features aids performance improvements.
Presented at The Conference on Natural Language Processing (“Konferenz zur Verarbeitung natürlicher Sprache”, KONVENS) on 9-11 October in Nürnberg, Germany
Predicting Semantic Labels of Text Regions in Heterogeneous Document Images
by Somtochukwu Enendu
This MSc thesis describes the use of sequence labeling methods in predicting the semantic labels of extracted text regions of heterogeneous electronic documents, by utilizing features related to each semantic label. In this study, we construct a novel dataset consisting of real world documents from multiple domains. We test the performance of the methods on the dataset and offer a novel investigation into the influence of textual features on performance across multiple domains. The results of the experiments show that the Conditional Random Field method is robust, outperforming the neural network when limited training data is available. Regarding generalizability, our experiments show that the inclusion of textual features does not guarantee performance improvements.