DB guest lecture by Hannes Mühleisen

We are proud to announce that Hannes Mühleisen will give a guest lecture on Tuesday 10 December at 15:30h. in EOS N 01.630 for the course Information Modelling and Databases. Hannes Mühleisen is professor of Data Engineering at Radboud University, the creator of DuckDB and co-founder and CEO of DuckDB Labs. Students of the course use DuckDB to practice their SQL skills.

Analytical Query Processing and the DuckDB System

by Hannes Mühleisen

DBMSs have historically been created to support transactional (OLTP) workloads. However, a second use case, analytical data analysis (OLAP), quickly appeared. These workloads are characterised by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables. Its rather impossible for an OLTP-focused DBMS to perform well in OLAP scenarios, which is why specialised systems have been developed. In this lecture, I will introduce analytical query processing, give an overview over the state of the art in research and industry, and describe our own analytical DBMS, DuckDB.

Introducing Zoekeend

We made a little tool for running information retrieval experiments using DuckDB which we appropriately called Zoekeend (Dutch for “search duck”). Zoekeend will be presented at DuckCon #6 in Amsterdam on 31 January 2025.

I will present several reproduced experiments, such as ranking using (small) language models, imports of indexes in the common index file format (CIFF), and the CIFF tokenizer based on tokenizers of large language models, all elegantly defined as SQL queries. I will further present ongoing work on new types of indexes for search engines, such as the score-fitted index, the constant length index and the term-grouped index, all of which would be extremely cumbersome to implement in existing search engines like Lucene, but can be easily defined as SQL queries in DuckDB. Zoekeend will greatly simplify information retrieval experimentation. Zoekeend is open source and available from: https://gitlab.science.ru.nl/informagus/zoekeend/

Alisa Rieger defends PhD thesis on responsible opinion formation

Striving for responsible opinion formation in web search on debated topics

by Alisa Rieger

Web search plays an important role in the contemporary information landscape, shaping individual and collective knowledge by providing fast and effortless access to vast amounts of resources. We rely on web search engines for various information needs, some of which can carry serious consequences. This is particularly evident when searching for information on debated topics, which can shape opinions and practical decisions. Debated topics are characterized by diverse and often opposing perspectives linked to different values and interests. Ideally, individuals would diligently engage with different perspectives to become well-informed and form opinions responsibly. However, engaging with information on debated topics can be cognitively demanding and subject to emotionally charged and biased behavior. When resorting to web search to find information on debated topics, searchers may be confronted with further obstacles. For instance, search engines are known to apply opaque ranking criteria, may not provide sufficient viewpoint diversity, and might foster over-reliance.

In this dissertation, we present different user studies aimed at better understanding the challenges of web search on debated topics and identifying measures to help searchers overcome these challenges. We first explored whether and how factors inherent to the searcher and search interface affect search behavior. Then, we investigated the risks and benefits of interventions to guide search behavior as well as empower searchers, aiming at supporting unbiased and diligent search interactions without restricting searcher autonomy. Our findings underscore the unique characteristics of web search on debated topics and provide a foundation for designing, tailoring, and evaluating interventions to support searchers. Considering the overall insights gained through our user studies, it becomes clear that the most pivotal challenges of web search on debated topics arise from the complex searcher-system interplay. Rather than turning to simple fixes, there is a need to acknowledge the complexity of the issue and commit to comprehensive investigations and solutions to avoid inadvertently exacerbating risks. Laying the groundwork for future investigations, we provide an extensive review of interdisciplinary literature with a detailed account of challenges and research opportunities.

With this dissertation, we raise awareness for the pressing socio-technical issues related to digital media and opinion formation and aspire to encourage interdisciplinary research teams, practitioners, and policymakers to join forces in establishing web search environments that foster individual and societal well-being.

[more information]

Welcome to Databases

Welcome to Information Modelling & Databases, Part B, Databases! We will resume Tuesday 5 November with a lecture at 15:30h. in EOS N 01.630.

The Databases part contains mandatory, individual quizzes, for which the following honour code applies:

  • You do not share the solutions;
  • The solutions to the quizzes should be your own work;
  • You do not post the quizzes, nor the solutions anywhere online;
  • You do not use instruction-tuned large language models like Github Copilot or ChatGPT;
  • You are allowed, and encouraged, to discuss the quizzes, and to ask clarifying questions to your fellow students; Please use the Brightspace Discussion Forum to reach out to me, the teaching assistants and your fellow students.

New this year are the optional SQL Mastery Assignments for students that want to go the extra mile. Students that successfully submit solutions to the SQL Mastery Assignments get free travel and participation to DuckCon #6 in Amsterdam on 31 January 2025!!

Also this year, we will experiment with a new automatic grader called Socoles that will automatically give feedback on open questions that require SQL solutions. Socoles is developed by Benard Wanjiru. Socoles helps us grade the assignments for more than 300 students in the course. Of course, you will get human feedback too, during the tutorials on Friday mornings.

Wishing you a fruitful Part B!
Best wishes,  Djoerd Hiemstra

An open source implementation of web clustering algorithms for selective search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In distributed search, a document collection is partitioned across several shards, which can be queried independently to speed up query processing. Selective search builds upon this infrastructure, but reduces the required resources further by only querying a small number of the index shards. A resource selection algorithm is used to predict which shards are relevant for a given query. To ensure that this works effectively, the shards are usually created using a topic-driven clustering algorithm, so that different documents that are relevant for the same query are more likely to be assigned to the same shard. To make the topic-driven clustering algorithms usable by the general public, and make it easier for researchers or search engine developers to implement and experiment with selective search systems, we release an open source implementation of SB2 K-means, including the extensions QKLD and QInit. Our implementation will be published as a Python package on PyPI.

The be presented at the 6th International Open Search Symposium #OSSYM24 on 9-11 October 2024 in Munich, Germany

[download pdf] [git repo]

Zaheer Babar defends PhD thesis on radiology report generation systems

Evaluating the impact of Radiology Reports Structure on AI-Powered Radiology Report Generation Systems

by Zaheer Babar

Radiology reports play an essential role in diagnosing and monitoring various diseases and conditions, from pneumonia to lung cancer and bone conditions. The ability to convey findings clearly and comprehensively is paramount, and producing well-structured, clear, and clinically well-focused radiology reports is essential for high-quality patient diagnosis and care. High-quality patient diagnosis and care can be achieved using a computer-aided radiology report system, which assists radiologists in producing well-structured, clear, and clinically well-focused radiology reports. Deep learning has made significant strides in image caption generation, but it has remained a highly challenging task in the medical domain.
One main challenge is understanding and linking complicated medical observations detected in given images with accurate natural language descriptions. Radiologists follow a standard way of writing these reports, describing a fixed set of diseases and conditions, indicating whether it is normal or abnormal. As a result, medical reports usually overlap with each other due to the common content of anatomy. This standardized way of reporting makes it challenging for the machine learning model to capture the prominent problems and abnormalities indicated in radiology reports. This impact can be felt across various aspects of the task, ranging from the utilization of validation metrics to the performance of the model and the use of different components within it. In this thesis, we study this impact on different levels and demonstrate that our research will lead to reliable progress in automatic radiology report generation.

[more information]

Sensitivity of Automated SQL Grading in Computer Science Courses

by Benard Wanjiru, Patrick van Bommel, and Djoerd Hiemstra

Previous research has primarily relied on fixed procedures when implementing partial grading systems. As a result, the sensitivity of such systems in terms of error analysis becomes inflexible as well. In this paper, we employ a software correctness model that allows for a dynamic and flexible approach for adjusting the sensitivity of a grading system based on the user’s needs and goals. We show how partial grading can be used to award fair grades and also categorize students into groups based on their strengths and weaknesses observed in their answers. Furthermore, we show how the sensitivity of a grading system can be varied to allow such grouping. To illustrate this, we analysed more than 2000 answers for 6 SQL programming assignments. An implication of this study is that instructors can carry out more effective partial grading of SQL queries as well as adjust learning material based on the needs of a particular group of students. They can address the observed limitations, thereby bridging the gap between high-performing students and those that require additional attention.

To be presented at the third International Conference on Innovations in Computing Research (ICR) on August 12–14, 2024 in Athens, Greece.

[download pdf]

Team OpenWebSearch at CLEF 2024

LongEval

by Daria Alexander, Maik Fröbe, Gijs Hendriksen, Ferdinand Schlatt, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Arjen de Vries

We describe the OpenWebSearch group’s participation in the CLEF 2024 LongEval IR track. Our submitted runs explore how historical data from the past can be transferred into future retrieval systems. Therefore, we incorporate relevance information from past click logs into the query reformulation process via keyqueries and into the indexing process via a reverted index and ultimately incorporate both into learning-to-rank pipelines to ensure that retrieval is also possible for novel queries that were not seen before. Our evaluation shows that keyqueries substantially outperform other approaches for queries with historical click data available.

To be presented at CLEF 2024: Conference and Labs of the Evaluation Forum on 9-12 September in Grenoble, France.

[download pdf]

Announcing IRRJ

Today at SIGIR 2024, the Information Retrieval Research Journal (IRRJ) will be informally announced:

  • Open Access,
  • No article processing charges,
  • Papers in all areas Information Retrieval (IR),
  • First issue planned end of 2024,
  • Submissions open in September,
  • To enlarge IR with researchers from low-income countries!

Editorial board:

  • Djoerd Hiemstra (Radboud University, the Netherlands)
  • Vanessa Murdock (Amazon, USA)
  • Johanne Trippas (RMIT, Australia)
  • Makoto Kato, (University of Tsukuba, Japan)
  • Ismail Sengor Altingovde (Middle East Technical University, Turkiye)
  • Monica Paramita (University of Sheffield, UK)
  • Negin Rahimi (University of Massachusetts, Amherst, USA)
  • Ben He (University Chinese Academy of Sciences, China)
  • Shangsong Liang (Mohamed bin Zayed University of Artificial Intelligence, UAE)
  • Haiming Liu (University of Southampton, UK)
  • Debarshi Kumar Sanyal, (Indian Association for the Cultivation of Science, India)
  • Daniela Godoy (National Council for Scientific and Technological Research, Argentina)
  • Barbara Poblete (DCC University, Chile)

Advisory board:

  • Paul Kantor (Emeritus, Rutgers University, USA)
  • Stephen Robertson (formerly Microsoft Research, UK)

More information follows soon!

Nirmal Roy defends PhD thesis on the effects of interfaces on search

Exploring the effects of interactive interfaces on user search behaviour

by Nirmal Roy

Interactive information retrieval (IIR) is a user-centered approach to information seeking and retrieval. In this paradigm, the search process is not confined to a single query and a static set of results. Instead, it emphasises the active involvement of users in refining their information needs, iteratively modifying queries, and exploring retrieved content. IIR studies research how to facilitate a more tailored and practical search experience, adapting to the evolving requirements and preferences of users. In this thesis, we focus on four distinct yet interrelated areas in the domain of IIR to have a better understanding of the interaction between the user and the information retrieval system.

[Read more]