On 31 March 2021, the Wednesday morning of ECIR 2021, the conference participants joined with seven panellists in a discussion on Open Access and Information Retrieval (IR), or more accurately, on the lack of open access publishing in IR. Discussion topics included the experience of researchers with open access in Africa; business models for open access, in particular how to run a sustainable open access conference like ECIR; open access plans at Springer, the BCS and the ACM; and finally, experience with open access publishing in related fields, notably in Computational Linguistics.
Appeared in BCS-IRSG Informer Spring 2021.
Optimizing Ranking Systems Online as Bandits
by Chang Li
People use interactive systems, such as search engines, as the main tool to obtain information. To satisfy the information needs, such systems usually provide a list of items that are selected out of a large candidate set and then sorted in the decreasing order of their usefulness. The result lists are generated by a ranking algorithm, called ranker, which takes the request of user and candidate items as the input and decides the order of candidate items. The quality of these systems depends on the underlying rankers.
There are two main approaches to optimize the ranker in an interactive system: using data annotated by humans or using the interactive user feedback. The first approach has been widely studied in history, also called offline learning to rank, and is the industry standard. However, the annotated data may not well represent information needs of users and are not timely. Thus, the first approaches may lead to suboptimal rankers. The second approach optimizes rankers by using interactive feedback. This thesis considers the second approach, learning from the interactive feedback. The reasons are two-fold:
- Everyday, millions of users interact with the interactive systems and generate a huge number of interactions, from which we can extract the information needs of users.
- Learning from the interactive data have more potentials to assist in designing the online algorithms.
Specifically, this thesis considers the task of learning from the user click feedback. The main contribution of this thesis is proposing a safe online learning to re-rank algorithm, named BubbleRank, which addresses one main disadvantage of online learning, i.e., the safety issue, by combining the advantages of both offline and online learning to rank algorithms. The thesis also proposes three other online algorithms, each of which solves unique online ranker optimization problems. All the proposed algorithms are theoretically sound and empirically effective.
Discussion Panel at ECIR 2021
Most publications in Information Retrieval are available via subscriptions. These include the ECIR proceedings published by Springer on behalf of the BCS, and the SIGIR proceedings published by the ACM. There is a trend to gradually change this situation to open access publishing. At Springer this is done by giving authors the choice to pay for open access, and by international agreements like Springer’s Compact. At ACM, this is also done by giving authors the choice to pay, and by agreements between ACM and individual institutions.
The panel discusses the effects of this situation on inclusiveness of the field, in particular on how we can support researchers from low income countries. We discuss the experience of researchers with open access in Africa; We discuss business models for open access, in particular how to run a sustainable open access conference like ECIR; We discuss open access plans at Springer, the BCS and the ACM; Finally, we discuss experience with open access publishing in related fields, in particular in Computational Linguistics. The discussion panel consists of:
- Hassina Aliane | CERIST, Algeria
- Ralf Gerstner | Springer Heidelberg, Germany
- Min-Yen Kan | National University of Singapore
- Haiming Liu | University of Bedfordshire, United Kingdom
- Joao Magalhaes | Universidade Nova de Lisboa, Portugal
- Hussein Suleman | University of Cape Town, South Africa
- Min Zhang | Tsinghua University, China
The panel takes place online on Wednesday 31 March at 9:00 UTC+2. More information at: https://www.ecir2021.eu/open-access-and-ir-panel/
Uncovering the Properties of Full Ranking on Fully Labeled Data
by Negin Ghasemi and Djoerd Hiemstra
Recently, various information retrieval models have been proposed based on pre-trained BERT models, achieving outstanding performance. The majority of such models have been tested on data collections with partial relevance labels, where various potentially relevant documents have not been exposed to the annotators. Therefore, evaluating BERT-based rankers may lead to biased and unfair evaluation results, simply because a relevant document has not been exposed to the annotators while creating the collection. In our work, we aim to better understand a BERT-based ranker’s strengths compared to a BERT-based re-ranker and the initial ranker. To this aim, we investigate BERT-based rankers performance on the Cranfield collection, which comes with full relevance judgment on all documents in the collection. Our results demonstrate the BERT-based full ranker’s effectiveness, as opposed to the BERT-based re-ranker and BM25. Also, analysis shows that there are documents that the BERT-based full-ranker finds that were not found by the initial ranker.
To be presented at the Conference of the European Chapter of the Association for Computational Linguistics EACL Student Workshop on 22 April 2021.
Slow, content-based, federated, explainable, and fair
Access to information on the world wide web is dominated by two monopolists, Google and Facebook, that decide most of the information we see. Their business models are based on “surveillance capitalism”, that is, profiting from getting to know as much as possible about individuals that use the platforms. The information about individuals is used to maximize their engagement thereby maximizing the number of targeted advertisements shown to these individuals. Google’s and Facebook’s financial success has influenced many other online businesses as well as a substantial part of the academic research agenda in machine learning and information retrieval, that increasingly focuses on training on huge datasets, literally building on the success of Google and Facebook by using their pre-trained models (e.g. BERT and ELMo). Large pre-trained models and algorithms that maximize engagement come with many societal problems: They have been shown to discriminate minority groups, to manipulate elections, to radicalize users, and even to enable genocide. Looking forward to 2021-2027, we aim to research the following technical alternatives that do not exhibit these problems: 1) slow, content-based, learning maximizing user satisfaction instead of fast, click-based learning maximizing user engagement; 2) federated information access and search instead of centralized access and search; 3) explainable, fair approaches instead of black-box, biased approaches.
The Dutch government has set the target that by 2020, 100% of scientific publications financed with public money must be open access. As iCIS, we are not even half way. In the Radboud Repository less than 50% of the publications by Data Science, Software Science, and Digital Security are listed as open access. The slides below make a case for a new Open Access Strategy at iCIS that involves:
- Putting all iCIS publications on-line after a reasonable time (as permitted by Dutch copyright law), preferably in the Radboud Repository;
- Encouraging so-called diamond open access publishing (where open access publications are paid by donations and volunteer work from authors, editors, peer reviewers, and web masters);
- Discouraging closed access as well as so-called gold open access publishing (where authors pay expensive article processing charges);
- Complementing the iCIS Research Data Management policy and protocol.
Presented at the iCIS strategy day on 20 October 2020.
Update: iCIS may participate in the You Share, We Care project.
by Djoerd Hiemstra
Query autocompletions help users of search engines to speed up their searches by recommending completions of partially typed queries in a drop down box. These recommended query autocompletions are usually based on large logs of queries that were previously entered by the search engine’s users. Therefore, misinformation entered — either accidentally or purposely to manipulate the search engine — might end up in the search engine’s recommendations, potentially harming organizations, individuals, and groups of people. This paper proposes an alternative approach for generating query autocompletions by extracting anchor texts from a large web crawl, without the need to use query logs. Our evaluation shows that even though query log autocompletions perform better for shorter queries, anchor text autocompletions outperform query log autocompletions for queries of 2 words or more.
To be presented at the 2nd International Symposium on Open Search Technology (OSSYM 2020), 12-14 October 2020, CERN, Geneva, Switzerland.
by Djoerd Hiemstra, Marie-Francine Moens, Raffaele Perego, and Fabrizio Sebastiani
Almost all of the important literature on Information Retrieval (IR) is published in subscription-based journals and digital libraries. We argue that the lack of open access publishing in IR is seriously hampering progress and inclusiveness of the field. We propose that the IR community starts working on a road map for transitioning the IR literature to a fully, “diamond”, open access model.
Published in SIGIR Forum 54(1).
for a MSc thesis project on:
Generating synthetic clinical data for shared Machine Learning tasks
Goal: We want to develop methods for researchers to work on shared tasks for which we cannot share the real data because of privacy concerns, in particular clinical data. The envisioned approach is to share synthetic data that is programmatically generated using large-scale language representations like GPT-2 that are fine-tuned to the real data using proper anonymization safe-guards. Additionally, we will research programmatically generating annotations for this data to support shared machine learning and natural language processing tasks using for instance the approaches from Snorkel.
This way researchers and practitioners from different institutions can cooperate on a classification, pseudonimization or tagging task, by working on the synthetic data, possibly using a competitive “Kaggle” approach. Some research questions we want to tackle are:
- Can we generate convincing data? (and how to measure this?)
- Does it prevent private data leakage?
- Can we generate correct annotations of the data?
- How much manual labour is needed, if any?
- Can the synthetic data be used to train AI, and do the trained models work on the real data?
This is a project in cooperation with RUMC, Nedap and Leiden University.