Optimizing Ranking Systems Online as Bandits
by Chang Li
People use interactive systems, such as search engines, as the main tool to obtain information. To satisfy the information needs, such systems usually provide a list of items that are selected out of a large candidate set and then sorted in the decreasing order of their usefulness. The result lists are generated by a ranking algorithm, called ranker, which takes the request of user and candidate items as the input and decides the order of candidate items. The quality of these systems depends on the underlying rankers.
There are two main approaches to optimize the ranker in an interactive system: using data annotated by humans or using the interactive user feedback. The first approach has been widely studied in history, also called offline learning to rank, and is the industry standard. However, the annotated data may not well represent information needs of users and are not timely. Thus, the first approaches may lead to suboptimal rankers. The second approach optimizes rankers by using interactive feedback. This thesis considers the second approach, learning from the interactive feedback. The reasons are two-fold:
- Everyday, millions of users interact with the interactive systems and generate a huge number of interactions, from which we can extract the information needs of users.
- Learning from the interactive data have more potentials to assist in designing the online algorithms.
Specifically, this thesis considers the task of learning from the user click feedback. The main contribution of this thesis is proposing a safe online learning to re-rank algorithm, named BubbleRank, which addresses one main disadvantage of online learning, i.e., the safety issue, by combining the advantages of both offline and online learning to rank algorithms. The thesis also proposes three other online algorithms, each of which solves unique online ranker optimization problems. All the proposed algorithms are theoretically sound and empirically effective.
Uncovering the Properties of Full Ranking on Fully Labeled Data
by Negin Ghasemi and Djoerd Hiemstra
Recently, various information retrieval models have been proposed based on pre-trained BERT models, achieving outstanding performance. The majority of such models have been tested on data collections with partial relevance labels, where various potentially relevant documents have not been exposed to the annotators. Therefore, evaluating BERT-based rankers may lead to biased and unfair evaluation results, simply because a relevant document has not been exposed to the annotators while creating the collection. In our work, we aim to better understand a BERT-based ranker’s strengths compared to a BERT-based re-ranker and the initial ranker. To this aim, we investigate BERT-based rankers performance on the Cranfield collection, which comes with full relevance judgment on all documents in the collection. Our results demonstrate the BERT-based full ranker’s effectiveness, as opposed to the BERT-based re-ranker and BM25. Also, analysis shows that there are documents that the BERT-based full-ranker finds that were not found by the initial ranker.
To be presented at the Conference of the European Chapter of the Association for Computational Linguistics EACL Student Workshop on April 19-20, 2021.
The Dutch government has set the target that by 2020, 100% of scientific publications financed with public money must be open access. As iCIS, we are not even half way. In the Radboud Repository less than 50% of the publications by Data Science, Software Science, and Digital Security are listed as open access. The slides below make a case for a new Open Access Strategy at iCIS that involves:
- Putting all iCIS publications on-line after a reasonable time (as permitted by Dutch copyright law), preferably in the Radboud Repository;
- Encouraging so-called diamond open access publishing (where open access publications are paid by donations and volunteer work from authors, editors, peer reviewers, and web masters);
- Discouraging closed access as well as so-called gold open access publishing (where authors pay expensive article processing charges);
- Complementing the iCIS Research Data Management policy and protocol.
Presented at the iCIS strategy day on 20 October 2020.
Update: iCIS may participate in the You Share, We Care project.
by Djoerd Hiemstra
Query autocompletions help users of search engines to speed up their searches by recommending completions of partially typed queries in a drop down box. These recommended query autocompletions are usually based on large logs of queries that were previously entered by the search engine’s users. Therefore, misinformation entered — either accidentally or purposely to manipulate the search engine — might end up in the search engine’s recommendations, potentially harming organizations, individuals, and groups of people. This paper proposes an alternative approach for generating query autocompletions by extracting anchor texts from a large web crawl, without the need to use query logs. Our evaluation shows that even though query log autocompletions perform better for shorter queries, anchor text autocompletions outperform query log autocompletions for queries of 2 words or more.
To be presented at the 2nd International Symposium on Open Search Technology (OSSYM 2020), 12-14 October 2020, CERN, Geneva, Switzerland.
by Djoerd Hiemstra, Marie-Francine Moens, Raffaele Perego, and Fabrizio Sebastiani
Almost all of the important literature on Information Retrieval (IR) is published in subscription-based journals and digital libraries. We argue that the lack of open access publishing in IR is seriously hampering progress and inclusiveness of the field. We propose that the IR community starts working on a road map for transitioning the IR literature to a fully, “diamond”, open access model.
Published in SIGIR Forum 54(1).
for a MSc thesis project on:
Generating synthetic clinical data for shared Machine Learning tasks
Goal: We want to develop methods for researchers to work on shared tasks for which we cannot share the real data because of privacy concerns, in particular clinical data. The envisioned approach is to share synthetic data that is programmatically generated using large-scale language representations like GPT-2 that are fine-tuned to the real data using proper anonymization safe-guards. Additionally, we will research programmatically generating annotations for this data to support shared machine learning and natural language processing tasks using for instance the approaches from Snorkel.
This way researchers and practitioners from different institutions can cooperate on a classification, pseudonimization or tagging task, by working on the synthetic data, possibly using a competitive “Kaggle” approach. Some research questions we want to tackle are:
- Can we generate convincing data? (and how to measure this?)
- Does it prevent private data leakage?
- Can we generate correct annotations of the data?
- How much manual labour is needed, if any?
- Can the synthetic data be used to train AI, and do the trained models work on the real data?
This is a project in cooperation with RUMC, Nedap and Leiden University.
ECIR 2020 was the very first online Information Retrieval conference and it was an amazing online experience: Great papers, excellent organization! See you all next year, hopefully in real live, in Lucca, Italy. The new website is online at: https://ecir2021.eu.
Welcome to the Data Science group, Negin Ghasemi! Negin will work on Transfer Learning for Federated Search.
The Data Science section of the Radboud University seeks to appoint an Assistant Professor and an Associate Professor in Machine Learning for Data Science. Deadline: 31 March.
To strengthen and expand the Data Science section’s research, we seek to appoint an Assistant Professor and an Associate Professor in Machine Learning for Data Science. Also, these positions will be pivotal for supporting our Bachelor’s programme and our Data Science Master’s specialisations, in particular for Master’s courses that attract many students. The main goal of Machine Learning for Data Science is to develop machine learning approaches and techniques of broader applicability outside a specific application domain. Machine Learning for Data Science involves the study, development and application of machine learning techniques in order to tackle real-life problems involving challenging learning tasks and/or type of data.