An open source implementation of web clustering algorithms for selective search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In distributed search, a document collection is partitioned across several shards, which can be queried independently to speed up query processing. Selective search builds upon this infrastructure, but reduces the required resources further by only querying a small number of the index shards. A resource selection algorithm is used to predict which shards are relevant for a given query. To ensure that this works effectively, the shards are usually created using a topic-driven clustering algorithm, so that different documents that are relevant for the same query are more likely to be assigned to the same shard. To make the topic-driven clustering algorithms usable by the general public, and make it easier for researchers or search engine developers to implement and experiment with selective search systems, we release an open source implementation of SB2 K-means, including the extensions QKLD and QInit. Our implementation will be published as a Python package on PyPI.

The be presented at the 6th International Open Search Symposium #OSSYM24 on 9-11 October 2024 in Munich, Germany

[download pdf] [git repo]

Zaheer Babar defends PhD thesis on radiology report generation systems

Evaluating the impact of Radiology Reports Structure on AI-Powered Radiology Report Generation Systems

by Zaheer Babar

Radiology reports play an essential role in diagnosing and monitoring various diseases and conditions, from pneumonia to lung cancer and bone conditions. The ability to convey findings clearly and comprehensively is paramount, and producing well-structured, clear, and clinically well-focused radiology reports is essential for high-quality patient diagnosis and care. High-quality patient diagnosis and care can be achieved using a computer-aided radiology report system, which assists radiologists in producing well-structured, clear, and clinically well-focused radiology reports. Deep learning has made significant strides in image caption generation, but it has remained a highly challenging task in the medical domain.
One main challenge is understanding and linking complicated medical observations detected in given images with accurate natural language descriptions. Radiologists follow a standard way of writing these reports, describing a fixed set of diseases and conditions, indicating whether it is normal or abnormal. As a result, medical reports usually overlap with each other due to the common content of anatomy. This standardized way of reporting makes it challenging for the machine learning model to capture the prominent problems and abnormalities indicated in radiology reports. This impact can be felt across various aspects of the task, ranging from the utilization of validation metrics to the performance of the model and the use of different components within it. In this thesis, we study this impact on different levels and demonstrate that our research will lead to reliable progress in automatic radiology report generation.

[more information]

Sensitivity of Automated SQL Grading in Computer Science Courses

by Benard Wanjiru, Patrick van Bommel, and Djoerd Hiemstra

Previous research has primarily relied on fixed procedures when implementing partial grading systems. As a result, the sensitivity of such systems in terms of error analysis becomes inflexible as well. In this paper, we employ a software correctness model that allows for a dynamic and flexible approach for adjusting the sensitivity of a grading system based on the user’s needs and goals. We show how partial grading can be used to award fair grades and also categorize students into groups based on their strengths and weaknesses observed in their answers. Furthermore, we show how the sensitivity of a grading system can be varied to allow such grouping. To illustrate this, we analysed more than 2000 answers for 6 SQL programming assignments. An implication of this study is that instructors can carry out more effective partial grading of SQL queries as well as adjust learning material based on the needs of a particular group of students. They can address the observed limitations, thereby bridging the gap between high-performing students and those that require additional attention.

To be presented at the third International Conference on Innovations in Computing Research (ICR) on August 12–14, 2024 in Athens, Greece.

[download pdf]

Team OpenWebSearch at CLEF 2024

LongEval

by Daria Alexander, Maik Fröbe, Gijs Hendriksen, Ferdinand Schlatt, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Arjen de Vries

We describe the OpenWebSearch group’s participation in the CLEF 2024 LongEval IR track. Our submitted runs explore how historical data from the past can be transferred into future retrieval systems. Therefore, we incorporate relevance information from past click logs into the query reformulation process via keyqueries and into the indexing process via a reverted index and ultimately incorporate both into learning-to-rank pipelines to ensure that retrieval is also possible for novel queries that were not seen before. Our evaluation shows that keyqueries substantially outperform other approaches for queries with historical click data available.

To be presented at CLEF 2024: Conference and Labs of the Evaluation Forum on 9-12 September in Grenoble, France.

[download pdf]

Announcing IRRJ

Today at SIGIR 2024, the Information Retrieval Research Journal (IRRJ) will be informally announced:

  • Open Access,
  • No article processing charges,
  • Papers in all areas Information Retrieval (IR),
  • First issue planned end of 2024,
  • Submissions open in September,
  • To enlarge IR with researchers from low-income countries!

Editorial board:

  • Djoerd Hiemstra (Radboud University, the Netherlands)
  • Vanessa Murdock (Amazon, USA)
  • Johanne Trippas (RMIT, Australia)
  • Makoto Kato, (University of Tsukuba, Japan)
  • Ismail Sengor Altingovde (Middle East Technical University, Turkiye)
  • Monica Paramita (University of Sheffield, UK)
  • Negin Rahimi (University of Massachusetts, Amherst, USA)
  • Ben He (University Chinese Academy of Sciences, China)
  • Shangsong Liang (Mohamed bin Zayed University of Artificial Intelligence, UAE)
  • Haiming Liu (University of Southampton, UK)
  • Debarshi Kumar Sanyal, (Indian Association for the Cultivation of Science, India)
  • Daniela Godoy (National Council for Scientific and Technological Research, Argentina)
  • Barbara Poblete (DCC University, Chile)

Advisory board:

  • Paul Kantor (Emeritus, Rutgers University, USA)
  • Stephen Robertson (formerly Microsoft Research, UK)

More information follows soon!

Nirmal Roy defends PhD thesis on the effects of interfaces on search

Exploring the effects of interactive interfaces on user search behaviour

by Nirmal Roy

Interactive information retrieval (IIR) is a user-centered approach to information seeking and retrieval. In this paradigm, the search process is not confined to a single query and a static set of results. Instead, it emphasises the active involvement of users in refining their information needs, iteratively modifying queries, and exploring retrieved content. IIR studies research how to facilitate a more tailored and practical search experience, adapting to the evolving requirements and preferences of users. In this thesis, we focus on four distinct yet interrelated areas in the domain of IIR to have a better understanding of the interaction between the user and the information retrieval system.

[Read more]

Tom Rust graduates on Learned Sparse Retrieval

by Tom Rust

Machine learning algorithms are achieving better results each day and are gaining popularity. The top-performing models are usually deep learning models. These models can absorb vast amounts of training data, improving prediction results. Unfortunately, these models consume a large amount of energy, which is something that not everyone is aware of. In information retrieval, large language models are used to provide extra context to queries and documents. Since information retrieval systems typically have large datasets, a suitable deep learning model must be chosen to find a balance between accuracy and energy usage. Learned sparse retrieval models are an example of these deep learning models. These models work by expanding all documents to create the optimal document representation that allows this document to be found correctly. This step is done before creating the inverted index, allowing for conventional ranking methods such as BM25. With this research, we compare different learned sparse retrieval models in terms of accuracy, speed, size and energy usage. We also compare them with a full-text index. We see that on MS Marco, the learned sparse retrievers outperform the full-text index on all popular evaluation benchmarks. However, the learned sparse retrievers can consume up to 100 times more energy whilst creating the index, which then has a higher query latency, and it uses more disk space. For WT10g we see that the full-text index gives us the highest accuracies whilst also being more energy efficient, using less disk space and having a lower query latency.
We conclude that learned sparse retrieval has the potential to improve accuracy on certain datasets, but a trade-off is necessary between the improved accuracy and the cost of increased storage, latency, and energy consumption.

Proceedings of WOWS 2024

The Proceedings of the first Workshop on Open Web Search (WOWS), which took place on 28 March 2024 in Glasgow, UK, are now published in the CEUR Workshop Series as Volume 3689.

WOWS 2024 had two calls for contributions. The first call targets scientific contributions on cooperative search engine development. This includes cooperative crawling of the web and cooperative deployment and evaluation of search engines. We specifically highlight the potential of enabling public and commercial organizations to use an indexed web crawl as a resource to create innovative search engines tailored to specific user groups, instead of relying on one search engine provider. The second call aims at gaining practical experience with joint, cooperative evaluation of search engine prototypes and their components using the Information Retrieval Experiment Platform TIREx. The workshop involved a keynote by Negar Arabzadeh from the University of Waterloo, 8 paper presentations (5 full papers and 3 short papers accepted out of 13 submissions), and a breakout session with participant discussions. WOWS received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070014. We would like to thank the Program Committee members for helpful reviews and suggestions to improve the contributions to the workshop. Special thanks go to Christine Plote, Managing Director of the Open Search Foundation for the WOWS 2024 website.

https://ceur-ws.org/Vol-3689/

[download pdf]

Semere Bitew defends PhD thesis on Language Models for Education

Language Model Adaptation with Applications in AI for Education

by Semere Kiros Bitew

The overall theme of my dissertation is in adapting language models mainly for applications in AI in education to automatically create educational content. It addresses the challenges in formulating test and exercise questions in educational settings, which traditionally require significant training, experience, time, and resources. This is particularly critical in high-stakes environments like certifications and tests, where questions cannot be reused. In particular, the primary research is focused on two educational tasks: distractor generation and gap-filling exercise generation. Distractor generation task refers to generating plausible but incorrect answers in multiple-choice questions, while gap-filling exercise generation refers to inducing well-chosen gaps to generate grammar exercises from existing texts. These tasks, although extensively researched, present unexplored avenues that recent advancements in language models can address. As a secondary objective, I explore the adaptation of coreference resolution to new languages. Coreference resolution is a key NLP task that involves clustering mentions in a text that refer to the same real-world entities, a process vital for understanding and generating coherent language.

Read more