Team OpenWebSearch at CLEF 2024

LongEval

by Daria Alexander, Maik Fröbe, Gijs Hendriksen, Ferdinand Schlatt, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Arjen de Vries

We describe the OpenWebSearch group’s participation in the CLEF 2024 LongEval IR track. Our submitted runs explore how historical data from the past can be transferred into future retrieval systems. Therefore, we incorporate relevance information from past click logs into the query reformulation process via keyqueries and into the indexing process via a reverted index and ultimately incorporate both into learning-to-rank pipelines to ensure that retrieval is also possible for novel queries that were not seen before. Our evaluation shows that keyqueries substantially outperform other approaches for queries with historical click data available.

To be presented at CLEF 2024: Conference and Labs of the Evaluation Forum on 9-12 September in Grenoble, France.

[download pdf]

Tom Rust graduates on Learned Sparse Retrieval

by Tom Rust

Machine learning algorithms are achieving better results each day and are gaining popularity. The top-performing models are usually deep learning models. These models can absorb vast amounts of training data, improving prediction results. Unfortunately, these models consume a large amount of energy, which is something that not everyone is aware of. In information retrieval, large language models are used to provide extra context to queries and documents. Since information retrieval systems typically have large datasets, a suitable deep learning model must be chosen to find a balance between accuracy and energy usage. Learned sparse retrieval models are an example of these deep learning models. These models work by expanding all documents to create the optimal document representation that allows this document to be found correctly. This step is done before creating the inverted index, allowing for conventional ranking methods such as BM25. With this research, we compare different learned sparse retrieval models in terms of accuracy, speed, size and energy usage. We also compare them with a full-text index. We see that on MS Marco, the learned sparse retrievers outperform the full-text index on all popular evaluation benchmarks. However, the learned sparse retrievers can consume up to 100 times more energy whilst creating the index, which then has a higher query latency, and it uses more disk space. For WT10g we see that the full-text index gives us the highest accuracies whilst also being more energy efficient, using less disk space and having a lower query latency.
We conclude that learned sparse retrieval has the potential to improve accuracy on certain datasets, but a trade-off is necessary between the improved accuracy and the cost of increased storage, latency, and energy consumption.

Proceedings of WOWS 2024

The Proceedings of the first Workshop on Open Web Search (WOWS), which took place on 28 March 2024 in Glasgow, UK, are now published in the CEUR Workshop Series as Volume 3689.

WOWS 2024 had two calls for contributions. The first call targets scientific contributions on cooperative search engine development. This includes cooperative crawling of the web and cooperative deployment and evaluation of search engines. We specifically highlight the potential of enabling public and commercial organizations to use an indexed web crawl as a resource to create innovative search engines tailored to specific user groups, instead of relying on one search engine provider. The second call aims at gaining practical experience with joint, cooperative evaluation of search engine prototypes and their components using the Information Retrieval Experiment Platform TIREx. The workshop involved a keynote by Negar Arabzadeh from the University of Waterloo, 8 paper presentations (5 full papers and 3 short papers accepted out of 13 submissions), and a breakout session with participant discussions. WOWS received funding from the European Union’s Horizon Europe research and innovation program under grant agreement No 101070014. We would like to thank the Program Committee members for helpful reviews and suggestions to improve the contributions to the workshop. Special thanks go to Christine Plote, Managing Director of the Open Search Foundation for the WOWS 2024 website.

https://ceur-ws.org/Vol-3689/

[download pdf]

The Open Web Index

Crawling and Indexing the Web for Public Use

by Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Benno Stein

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index. The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index – for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

To be presented at the European Conference on Information Retrieval (ECIR 2024) in Glasgow on 24-28 March.

[download pdf]

Weighted AUReC

Handling Skew in Shard Map Quality Estimation for Selective Search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC (Area Under Recall Curve) measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.

To be presented at the European Conference on Information Retrieval (ECIR) in Glasgow on 24-28 March.

[download pdf]

Inaugural lecture on 1 March

Invitation

On 1 March 2024 at 15:45h., I will give my inaugural lecture: “Zoekmachines: Samen en duurzaam vooruit” (in Dutch). Everyone is invited. Please register on: https://www.ru.nl/rede/hiemstra

In the lecture, I will share an ancient wisdom about working together; I will discuss my plan to teach students of all background their shared history; and I will reveal my dream to provide unrestricted access to all human information by working together. The lecture will contain cars, iPhone chargers, the Space Shuttle, and references to exciting recent research.

[download pdf]

WOWS2024: Workshop on Open Web Search

Co-located with ECIR 2024 in Glasgow on 28 March 2024

The First International Workshop on Open Web Search (WOWS) aims to promote and discuss ideas and approaches to open up the web search ecosystem so that small research groups and young startups can leverage the web to foster an open and diverse search market. Therefore, the workshop has two calls that support collaborative and open web search engines:

  1. for scientific contributions, and
  2. for open source implementation

The first call aims for scientific contributions to building collaborative search engines, including collaborative crawling, collaborative search engine deployment, collaborative search engine evaluation, and collaborative use of the web as a resource for researchers and innovators. The second call aims to gather open-source prototypes and gain practical experience with collaborative, cooperative evaluation of search engines and their components using the TIREx Information Retrieval Evaluation Platform.

Important Dates

  • January 24, 2024 (optional): Early Bird Submissions of Software and Papers. You receive early notifications; Accepted contributions get a free WOWS T-Shirt
  • February 14, 2024: Deadline Submissions of Software and Papers
  • March 13, 2024: Peer review notification
  • March 20, 2024: Camera-ready papers submission
  • March 28, 2024: Workshop (co-located with ECIR 2024 in Glasgow)

More information at: https://opensearchfoundation.org/wows2024/

Challenges of index exchange for search engine interoperability

by Djoerd Hiemstra, Gijs Hendriksen, Chris Kamphuis, and Arjen de Vries

We discuss tokenization challenges that arise when sharing inverted file indexes to support interoperability between search engines, in particular: How to tokenize queries such that the tokens are consistent with the tokens in the shared index? We discuss various solutions and present preliminary experimental results that show when the problem occurs and how it can be mitigated by standardizing on a simple, generic tokenizer for all shared indexes.

To be presented at the 5th International Open Search Symposium #OSSYM2023 at CERN, Geneva, Switzerland on 4-6 October 2023

[download pdf]