BERT meets Cranfield

Uncovering the Properties of Full Ranking on Fully Labeled Data

by Negin Ghasemi and Djoerd Hiemstra

Recently, various information retrieval models have been proposed based on pre-trained BERT models, achieving outstanding performance. The majority of such models have been tested on data collections with partial relevance labels, where various potentially relevant documents have not been exposed to the annotators. Therefore, evaluating BERT-based rankers may lead to biased and unfair evaluation results, simply because a relevant document has not been exposed to the annotators while creating the collection. In our work, we aim to better understand a BERT-based ranker’s strengths compared to a BERT-based re-ranker and the initial ranker. To this aim, we investigate BERT-based rankers performance on the Cranfield collection, which comes with full relevance judgment on all documents in the collection. Our results demonstrate the BERT-based full ranker’s effectiveness, as opposed to the BERT-based re-ranker and BM25. Also, analysis shows that there are documents that the BERT-based full-ranker finds that were not found by the initial ranker.

To be presented at the Conference of the European Chapter of the Association for Computational Linguistics EACL Student Workshop on 22 April 2021.

[download pdf]

A research agenda

Slow, content-based, federated, explainable, and fair

Access to information on the world wide web is dominated by monopolists, (Google and Facebook) that decide most of the information we see. Their business models are based on “surveillance capitalism”, that is, profiting from getting to know as much as possible about individuals that use the platforms. The information about individuals is used to maximize their engagement thereby maximizing the number of targeted advertisements shown to these individuals. Google’s and Facebook’s financial success has influenced many other online businesses as well as a substantial part of the academic research agenda in machine learning and information retrieval, that increasingly focuses on training on huge datasets, literally building on the success of Google and Facebook by using their pre-trained models (e.g. BERT and ELMo). Large pre-trained models and algorithms that maximize engagement come with many societal problems: They have been shown to discriminate minority groups, to manipulate elections, to radicalize users, and even to enable genocide. Looking forward to 2021-2027, we aim to research the following technical alternatives that do not exhibit these problems: 1) slow, content-based, learning that maximizes user satisfaction instead of fast, click-based learning that maximizes user engagement; 2) federated information access and search instead of centralized access and search; 3) explainable, fair approaches instead of black-box, biased approaches.

An Open Access Strategy for iCIS

The Dutch government has set the target that by 2020, 100% of scientific publications financed with public money must be open access. As iCIS, we are not even half way. In the Radboud Repository less than 50% of the publications by Data Science, Software Science, and Digital Security are listed as open access. The slides below make a case for a new Open Access Strategy at iCIS that involves:

  1. Putting all iCIS publications on-line after a reasonable time (as permitted by Dutch copyright law), preferably in the Radboud Repository;
  2. Encouraging so-called diamond open access publishing (where open access publications are paid by donations and volunteer work from authors, editors, peer reviewers, and web masters);
  3. Discouraging closed access as well as so-called gold open access publishing (where authors pay expensive article processing charges);
  4. Complementing the iCIS Research Data Management policy and protocol.

Presented at the iCIS strategy day on 20 October 2020.

[download slides]

Update: iCIS may participate in the You Share, We Care project.

Reducing Misinformation in Query Autocompletions

by Djoerd Hiemstra

Query autocompletions help users of search engines to speed up their searches by recommending completions of partially typed queries in a drop down box. These recommended query autocompletions are usually based on large logs of queries that were previously entered by the search engine’s users. Therefore, misinformation entered — either accidentally or purposely to manipulate the search engine — might end up in the search engine’s recommendations, potentially harming organizations, individuals, and groups of people. This paper proposes an alternative approach for generating query autocompletions by extracting anchor texts from a large web crawl, without the need to use query logs. Our evaluation shows that even though query log autocompletions perform better for shorter queries, anchor text autocompletions outperform query log autocompletions for queries of 2 words or more.

To be presented at the 2nd International Symposium on Open Search Technology (OSSYM 2020), 12-14 October 2020, CERN, Geneva, Switzerland.

[download pdf]

Transitioning the Information Retrieval Literature to a Fully Open Access Model

by Djoerd Hiemstra, Marie-Francine Moens, Raffaele Perego, and Fabrizio Sebastiani

Almost all of the important literature on Information Retrieval (IR) is published in subscription-based journals and digital libraries. We argue that the lack of open access publishing in IR is seriously hampering progress and inclusiveness of the field. We propose that the IR community starts working on a road map for transitioning the IR literature to a fully, “diamond”, open access model.

Published in SIGIR Forum 54(1).

[download preprint]

WANTED: MSc students Data Science or AI

for a MSc thesis project on:

Generating synthetic clinical data for shared Machine Learning tasks

Goal: We want to develop methods for researchers to work on shared tasks for which we cannot share the real data because of privacy concerns, in particular clinical data. The envisioned approach is to share synthetic data that is programmatically generated using large-scale language representations like GPT-2 that are fine-tuned to the real data using proper anonymization safe-guards. Additionally, we will research programmatically generating annotations for this data to support shared machine learning and natural language processing tasks using for instance the approaches from Snorkel.

This way researchers and practitioners from different institutions can cooperate on a classification, pseudonimization or tagging task, by working on the synthetic data, possibly using a competitive “Kaggle” approach. Some research questions we want to tackle are:

  1. Can we generate convincing data? (and how to measure this?)
  2. Does it prevent private data leakage?
  3. Can we generate correct annotations of the data?
  4. How much manual labour is needed, if any?
  5. Can the synthetic data be used to train AI, and do the trained models work on the real data?

This is a project in cooperation with RUMC, Nedap and Leiden University.

Professor positions in Machine Learning for Data Science

The Data Science section of the Radboud University seeks to appoint an Assistant Professor and an Associate Professor in Machine Learning for Data Science. Deadline: 31 March.

To strengthen and expand the Data Science section’s research, we seek to appoint an Assistant Professor and an Associate Professor in Machine Learning for Data Science. Also, these positions will be pivotal for supporting our Bachelor’s programme and our Data Science Master’s specialisations, in particular for Master’s courses that attract many students. The main goal of Machine Learning for Data Science is to develop machine learning approaches and techniques of broader applicability outside a specific application domain. Machine Learning for Data Science involves the study, development and application of machine learning techniques in order to tackle real-life problems involving challenging learning tasks and/or type of data.

[More information]

NoGA: No Google Analytics

We will start a project to investigate alternatives for web analytics of Big Tech companies (Google) as part of the SIDN fonds program Je data de baas (“Mastering your data”). There are 10 projects in the program.

In this project, the Data Science group works together with the Marketing and ICT of the University to find out whether it is a serious option for Radboud University (and other medium to large-scale organizations such as municipalities and hospitals) to take Web analytics in their own hands, instead of outsourcing it to Google.

More information at: https://nogadata.nl