The Open Web Index

Crawling and Indexing the Web for Public Use

by Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, and Benno Stein

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index. The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index – for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

To be presented at the European Conference on Information Retrieval (ECIR 2024) in Glasgow on 24-28 March.

[download pdf]

Weighted AUReC

Handling Skew in Shard Map Quality Estimation for Selective Search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In selective search, a document collection is partitioned into a collection of topical index shards. To efficiently estimate the topical coherence (or quality) of a shard map, the AUReC (Area Under Recall Curve) measure was introduced. AUReC makes the assumption that shards are of similar sizes, one that is violated in practice, even for unsupervised approaches. The problem might be amplified if supervised labelling approaches with skewed class distributions are used. To estimate the quality of such unbalanced shard maps, we introduce a weighted adaptation of the AUReC measure, and empirically evaluate its effectiveness using the ClueWeb09B and Gov2 datasets. We show that it closely matches the evaluations of the original AUReC when shards are similar in size, but captures better the differences in performance when shard sizes are skewed.

To be presented at the European Conference on Information Retrieval (ECIR) in Glasgow on 24-28 March.

[download pdf]

WOWS2024: Workshop on Open Web Search

Co-located with ECIR 2024 in Glasgow on 28 March 2024

The First International Workshop on Open Web Search (WOWS) aims to promote and discuss ideas and approaches to open up the web search ecosystem so that small research groups and young startups can leverage the web to foster an open and diverse search market. Therefore, the workshop has two calls that support collaborative and open web search engines:

  1. for scientific contributions, and
  2. for open source implementation

The first call aims for scientific contributions to building collaborative search engines, including collaborative crawling, collaborative search engine deployment, collaborative search engine evaluation, and collaborative use of the web as a resource for researchers and innovators. The second call aims to gather open-source prototypes and gain practical experience with collaborative, cooperative evaluation of search engines and their components using the TIREx Information Retrieval Evaluation Platform.

Important Dates

  • January 24, 2024 (optional): Early Bird Submissions of Software and Papers. You receive early notifications; Accepted contributions get a free WOWS T-Shirt
  • February 14, 2024: Deadline Submissions of Software and Papers
  • March 13, 2024: Peer review notification
  • March 20, 2024: Camera-ready papers submission
  • March 28, 2024: Workshop (co-located with ECIR 2024 in Glasgow)

More information at: https://opensearchfoundation.org/wows2024/

Challenges of index exchange for search engine interoperability

by Djoerd Hiemstra, Gijs Hendriksen, Chris Kamphuis, and Arjen de Vries

We discuss tokenization challenges that arise when sharing inverted file indexes to support interoperability between search engines, in particular: How to tokenize queries such that the tokens are consistent with the tokens in the shared index? We discuss various solutions and present preliminary experimental results that show when the problem occurs and how it can be mitigated by standardizing on a simple, generic tokenizer for all shared indexes.

To be presented at the 5th International Open Search Symposium #OSSYM2023 at CERN, Geneva, Switzerland on 4-6 October 2023

[download pdf]

Impact and development of an Open Web Index for open web search

by Michael Granitzer, Stefan Voigt, Noor Afshan Fathima, Martin Golasowski, Christian Guetl, Tobias Hecking, Gijs Hendriksen, Djoerd Hiemstra, Jan Martinovič, Jelena Mitrović, Izidor Mlakar, Stavros Moiras, Alexander Nussbaumer, Per Öster, Martin Potthast, Marjana Senčar Srdič, Sharikadze Megi, Kateřina Slaninová, Benno Stein, Arjen P. de Vries, Vít Vondrák, Andreas Wagner, Saber Zerhoudi

Web search is a crucial technology for the digital economy. Dominated by a few gatekeepers focused on commercial success, however, web publishers have to optimize their content for these gatekeepers, resulting in a closed ecosystem of search engines as well as the risk of publishers sacrificing quality. To encourage an open search ecosystem and offer users genuine choice among alternative search engines, we propose the development of an Open Web Index (OWI). We outline six core principles for developing and maintaining an open index, based on open data principles, legal compliance, and collaborative technology development. The combination of an open index with what we call declarative search engines will facilitate the development of vertical search engines and innovative web data products (including, e.g., large language models), enabling a fair and open information space. This framework underpins the EU-funded project OpenWebSearch.EU, marking the first step towards realizing an Open Web Index.

Published by the Journal of the American Society of Information Science and Technology (JASIST)

[download pdf]

Fausto de Lang graduates on tokenization for information retrieval

An empirical study of the effect of vocabulary size for various tokenization strategies in passage retrieval performance.

by Fausto de Lang

Many interactions between the the fields of lexical retrieval and large language models still remain underexplored, in particular there is little research into the use of advanced language model tokenizers in combination with classical information retrieval mechanisms. This research looks into the effect of vocabulary size for various tokenization strategies in passage retrieval performance. It also provides an overview of the impact of the WordPiece, Byte-Pair Encoding and Unigram tokenization techniques on the MSMARCO passage retreival task. These techniques are explored in both re-trained tokenizers and in tokenizers trained from scratch. Based on three metrics this research has found the WordPiece tokenization technique is the best performing technique on the MSMARCO passage retrieval tasks. It has also found that a training vocabulary size of around 10,000 tokens is best in regards to Recall performance, while around 320,000 tokens shows the optimal Mean Reciprocal Rank and Normalized Discounted Cumulative Gain scores. Most importantly, the optimum at a relatively small vocabulary size suggests that shorter subwords can benefit the indexing and searching process (up to a certain point). This is a meaningful result since it means that many applications where (re-)trained tokenizers are used in information retrieval capacity might be improved by tweaking the vocabulary size during training. This research has mainly focused on building a bridge between (re-)trainable tokenizers and information retrieval software, while reporting on interesting tunable parameters. Finally, this research recommends researchers to build their
own tokenizer from scratch since it forces one to look at the configuration of the underlying processing steps.

Defended on 27 June 2023

Git repository at: gitlab.com/tokenization/Lucene

UNFair: Search Engine Manipulation, Undetectable by Amortized Inequity

by Tim de Jonge and Djoerd Hiemstra

Modern society increasingly relies on Information Retrieval systems to answer various information needs. Since this impacts society in many ways, there has been a great deal of work to ensure the fairness of these systems, and to prevent societal harms. There is a prevalent risk of failing to model the entire system, where nefarious actors can produce harm outside the scope of fairness metrics. We demonstrate the practical possibility of this risk through UNFair, a ranking system that achieves performance and measured fairness competitive with current state-of-the-art, while simultaneously being manipulative in setup. UNFair demonstrates how adhering to a fairness metric, Amortized Equity, can be insufficient to prevent Search Engine Manipulation. This possibility of manipulation bypassing a fairness metric discourages imposing a fairness metric ahead of time, and motivates instead a more holistic approach to fairness assessments.

To be presented at the ACM Conference on Fairness, Accountability, and Transparency (FAccT 2023) on 12-15 June in Chicago, USA.

[download pdf]

Cross-Market Product-Related Question Answering

by Negin Ghasemi, Mohammad Aliannejadi, Hamed Bonab, Evangelos Kanoulas, Arjen de Vries, James Allan, and Djoerd Hiemstra

Online shops such as Amazon, eBay, and Etsy continue to expand their presence in multiple countries, creating new resource-scarce marketplaces with thousands of items. We consider a marketplace to be resource-scarce when only limited user-generated data is available about the products (e.g., ratings, reviews, and product-related questions). In such a marketplace, an information retrieval system is less likely to help users find answers to their questions about the products. As a result, questions posted online may go unanswered for extended periods. This study investigates the impact of using available data in a resource-rich marketplace to answer new questions in a resource-scarce marketplace, a new problem we call cross-market question answering. To study this problem’s potential impact, we collect and annotate a new dataset, XMarket-QA, from Amazon’s UK (resource-scarce) and US (resource-rich) local marketplaces. We conduct a data analysis to understand the scope of the cross-market question-answering task. This analysis shows a temporal gap of almost one year between the first question answered in the UK marketplace and the US marketplace. Also, it shows that the first question about a product is posted in the UK marketplace only when 28 questions, on average, have already been answered about the same product in the US marketplace. Human annotations demonstrate that, on average, 65% of the questions in the UK marketplace can be answered within the US marketplace, supporting the concept of cross-market question answering. Inspired by these findings, we develop a new method, CMJim, which utilizes product similarities across marketplaces in the training phase for retrieving answers from the resource-rich marketplace that can be used to answer a question in the resource-scarce marketplace. Our evaluations show CMJim’s significant improvement compared to competitive baselines.

To be presented at the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2023) on July 23-27 in Taipei, Taiwan.

[download pdf]