Open Search Symposium 2022

10-12 October 202 at CERN

The Open Search Symposium series (#OSSYM) provides a forum to discuss and advance the ideas and concepts of Open Internet search in Europe. This year’s #OSSYM2022 takes place at CERN and online from 10-12 October 2022. The programme is great with for instance on Monday a keynote from Tomáš “Word2Vec” Mikolov, on Tuesday a track with alternative search engines including Raphael Auphan (the CEO of Qwant), Isabel Claus (founder of the B-to-B engine, and Joseph Cullhead (, a Swedish nonprofit organization with a low budget search engine). Wednesday has a panel discussion about the ethics of search.

[Register now via CERN]

Open Web Search project kicked off

Today, we kick-off our new EU project In the project, we develop a new architecture for search engines where many parts of the system will be decentralized. The key idea is to separate index construction from the search engines themselves, where the most expensive step to create index shards can be carried out on large clusters while the search engine itself can be operated locally.

We also envision an Open-Web-Search Engine Hub, where companies and individuals can share their specifications of search engines and pre-computed, regularly updated search indices. We think of this as a search engine mash-up, that would enable a new future of human-centric search without privacy concerns.

More information at:

BERT for Target Apps Selection

Analyzing the Diversity and Performance of BERT in Unified Mobile Search

by Negin Ghasemi, Mohammad Aliannejadi, and Djoerd Hiemstra

A unified mobile search framework aims to identify the mobile apps that can satisfy a user’s information need and route the user’s query to them. Previous work has shown that resource descriptions for mobile apps are sparse as they rely on the app’s previous queries. This problem puts certain apps in dominance and leaves out the resource-scarce apps from the top ranks. In this case, we need a ranker that goes beyond simple lexical matching. Therefore, our goal is to study the extent of a BERT-based ranker’s ability to improve the quality and diversity of app selection. To this end, we compare the results of the BERT-based ranker with other information retrieval models, focusing on the analysis of selected apps diversification. Our analysis shows that the BERT-based ranker selects more diverse apps while improving the quality of baseline results by selecting the relevant apps such as Facebook and Contacts for more personal queries and decreasing the bias towards the dominant resources such as the Google Search app.

[More info]

A research agenda

Slow, content-based, federated, explainable, and fair

Access to information on the world wide web is dominated by monopolists, (Google and Facebook) that decide most of the information we see. Their business models are based on “surveillance capitalism”, that is, profiting from getting to know as much as possible about individuals that use the platforms. The information about individuals is used to maximize their engagement thereby maximizing the number of targeted advertisements shown to these individuals. Google’s and Facebook’s financial success has influenced many other online businesses as well as a substantial part of the academic research agenda in machine learning and information retrieval, that increasingly focuses on training on huge datasets, literally building on the success of Google and Facebook by using their pre-trained models (e.g. BERT and ELMo). Large pre-trained models and algorithms that maximize engagement come with many societal problems: They have been shown to discriminate minority groups, to manipulate elections, to radicalize users, and even to enable genocide. Looking forward to 2021-2027, we aim to research the following technical alternatives that do not exhibit these problems: 1) slow, content-based, learning that maximizes user satisfaction instead of fast, click-based learning that maximizes user engagement; 2) federated information access and search instead of centralized access and search; 3) explainable, fair approaches instead of black-box, biased approaches.

Reducing Misinformation in Query Autocompletions

by Djoerd Hiemstra

Query autocompletions help users of search engines to speed up their searches by recommending completions of partially typed queries in a drop down box. These recommended query autocompletions are usually based on large logs of queries that were previously entered by the search engine’s users. Therefore, misinformation entered — either accidentally or purposely to manipulate the search engine — might end up in the search engine’s recommendations, potentially harming organizations, individuals, and groups of people. This paper proposes an alternative approach for generating query autocompletions by extracting anchor texts from a large web crawl, without the need to use query logs. Our evaluation shows that even though query log autocompletions perform better for shorter queries, anchor text autocompletions outperform query log autocompletions for queries of 2 words or more.

To be presented at the 2nd International Symposium on Open Search Technology (OSSYM 2020), 12-14 October 2020, CERN, Geneva, Switzerland.

[download pdf] [slides]

PhD candidate vacancy: Transfer Learning for Federated Search

We are looking for a PhD candidate to join the Data Science group at Radboud University for an exciting new project on transfer learning for language modelling with an application for federated search. Transfer learning learns general purpose language models from huge datasets, such as web crawls, and then trains the models further on smaller datasets for a specific task. Transfer learning in NLP has successfully used pre-trained word-embeddings for several tasks. Although the success of word embeddings on search tasks has been limited, recently pre-trained general purpose language representations such as BERT and ELMo have been successful on several search tasks, including question answering tasks and conversational search tasks. Resource descriptions in federated search consist of samples of the full data that are sparser than full resource representations. This raises the question of how to infer vocabulary that is missing from the sampled data. A promising approach comes from transfer learning from pre-trained language representations. An open question is how to effectively and efficiently apply those pre-trained representations and how to adapt them to the domain of federated search. In this project, you will use pre-trained language models, and further train those models for a (federated) search task. You will evaluate the quality of those models as part of international evaluation conferences like the Text Retrieval Conference (TREC) and the Conference and Labs of the Evaluation Forum (CLEF).

[more information]

Whom to Follow on Mastodon?

Recommending Users: Whom to Follow on Federated Social Networks

by Jan Trienes, Andrés Torres Cano, and Djoerd Hiemstra

To foster an active and engaged community, social networks employ recommendation algorithms that filter large amounts of contents and provide a user with personalized views of the network. Popular social networks such as Facebook and Twitter generate follow recommendations by listing profiles a user may be interested to connect with. Federated social networks aim to resolve issues associated with the popular social networks – such as large-scale user-surveillance and the miss-use of user data to manipulate elections – by decentralizing authority and promoting privacy. Due to their recent emergence, recommender systems do not exist for federated social networks, yet. To make these networks more attractive and promote community building, we investigate how recommendation algorithms can be applied to decentralized social networks. We present an offline and online evaluation of two recommendation strategies: a collaborative filtering recommender based on BM25 and a topology-based recommender using personalized PageRank. Our experiments on a large unbiased sample of the federated social network Mastodon shows that collaborative filtering approaches outperform a topology-based approach, whereas both approaches significantly outperform a random recommender. A subsequent live user experiment on Mastodon using balanced interleaving shows that the collaborative filtering recommender performs on par with the topology-based recommender.

This paper will be presented at the 17th Dutch-Belgian Information Retrieval workshop in Leiden on 23 November 2018

[download pdf]