We are looking for a PhD candidate to join the Data Science group at Radboud University for an exciting new project on transfer learning for language modelling with an application for federated search. Transfer learning learns general purpose language models from huge datasets, such as web crawls, and then trains the models further on smaller datasets for a specific task. Transfer learning in NLP has successfully used pre-trained word-embeddings for several tasks. Although the success of word embeddings on search tasks has been limited, recently pre-trained general purpose language representations such as BERT and ELMo have been successful on several search tasks, including question answering tasks and conversational search tasks. Resource descriptions in federated search consist of samples of the full data that are sparser than full resource representations. This raises the question of how to infer vocabulary that is missing from the sampled data. A promising approach comes from transfer learning from pre-trained language representations. An open question is how to effectively and efficiently apply those pre-trained representations and how to adapt them to the domain of federated search. In this project, you will use pre-trained language models, and further train those models for a (federated) search task. You will evaluate the quality of those models as part of international evaluation conferences like the Text Retrieval Conference (TREC) and the Conference and Labs of the Evaluation Forum (CLEF).
Recommending Users: Whom to Follow on Federated Social Networks
by Jan Trienes, Andrés Torres Cano, and Djoerd Hiemstra
To foster an active and engaged community, social networks employ recommendation algorithms that filter large amounts of contents and provide a user with personalized views of the network. Popular social networks such as Facebook and Twitter generate follow recommendations by listing profiles a user may be interested to connect with. Federated social networks aim to resolve issues associated with the popular social networks – such as large-scale user-surveillance and the miss-use of user data to manipulate elections – by decentralizing authority and promoting privacy. Due to their recent emergence, recommender systems do not exist for federated social networks, yet. To make these networks more attractive and promote community building, we investigate how recommendation algorithms can be applied to decentralized social networks. We present an offline and online evaluation of two recommendation strategies: a collaborative filtering recommender based on BM25 and a topology-based recommender using personalized PageRank. Our experiments on a large unbiased sample of the federated social network Mastodon shows that collaborative filtering approaches outperform a topology-based approach, whereas both approaches significantly outperform a random recommender. A subsequent live user experiment on Mastodon using balanced interleaving shows that the collaborative filtering recommender performs on par with the topology-based recommender.
This paper will be presented at the 17th Dutch-Belgian Information Retrieval workshop in Leiden on 23 November 2018
by Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, and Djoerd Hiemstra
A publicly available dataset for federated search reflecting a real web environment has long been absent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines.
First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.
Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation
by Thomas Demeester, Robin Aly, Djoerd Hiemstra, Dong Nguyen, and Chris Develder
Evaluation of search engines relies on assessments of search results for selected test queries, from which we would ideally like to draw conclusions in terms of relevance of the results for general (e.g., future, unknown) users. In practice however, most evaluation scenarios only allow us to conclusively determine the relevance towards the particular assessor that provided the judgments. A factor that cannot be ignored when extending conclusions made from assessors towards users, is the possible disagreement on relevance, assuming that a single gold truth label does not exist. This paper presents and analyzes the predicted relevance model (PRM), which allows predicting a particular result’s relevance for a random user, based on an observed assessment and knowledge on the average disagreement between assessors. With the PRM, existing evaluation metrics designed to measure binary assessor relevance, can be transformed into more robust and effectively graded measures that evaluate relevance towards a random user. It also leads to a principled way of quantifying multiple graded or categorical relevance levels for use as gains in established graded relevance measures, such as normalized discounted cumulative gain, which nowadays often use heuristic and data-independent gain values. Given a set of test topics with graded relevance judgments, the PRM allows evaluating systems on different scenarios, such as their capability of retrieving top results, or how well they are able to filter out non-relevant ones. Its use in actual evaluation scenarios is illustrated on several information retrieval test collections.
To be published in Information Retrieval Journal by Springer
Presenting the New Test Collection for Federated Web Search
by Thomas Demeester (Ghent University), Dolf Trieschnigg, Ke Zhou (Yahoo!), Dong Nguyen, and Djoerd Hiemstra
This paper presents FedWeb Greatest Hits, a large new test collection for research in web information retrieval. As a combination and extension of the datasets used in the TREC Federated Web Search Track, this collection opens up new research possibilities on federated web search challenges, as well as on various other problems.
The paper will be presented at the 24th International World Wide Web Conference (WWW 2015) in Florence, Italy on 18-22 May 2015.
To obtain the dataset go to: http://fedwebgh.intec.ugent.be.
by Sebastiaan Vercammen
Distributed search introduces problems with resources that require time to process queries and produce results, and users waiting to get an answer to their query. The system could wait a maximum amount of time for every resource to produce its results or start displaying results the very moment they are retrieved by the distributed search engine. This thesis introduces a number of alternative display strategies and describes a method to research their effectiveness in providing the most relevant results, as quickly and as high in the combined results as possible, while maintaining a user-friendly search experience. It then continues by describing the performed research and its results. For each experiment, test participants are asked a number of questions, to describe their experience operating the search engine using the specific display strategy. Also recorded are statistics concerning test participants’ clicks. These metrics are combined with the answers to the user questions and also used for determining the best display strategy. Observations were made of aspects that seemed to have influenced the experiment, such as the red color of the notifications used for one of the display strategies. The precise influence of these aspects should be further studied, by using A/B testing, as proposed in section 7.2. Finally, the conclusion is drawn that the Screen fill with “endless” scrolling display strategy (section 3.3.4) performed best when taking the test participants’ answers into account.
Federated Aggregated Search
by Andrés Marenco Zúñiga
The traditional search engine paradigm has changed from retrieving simple text documents, to selecting a broader combination of diverse document types (i.e. images, videos, maps…) that could satisfy the user’s information need. Each type of document, stored in specialized databases known as ‘verticals’, and found in either local or federated locations, is nowadays integrated into 'aggregated search engines'. Due to this domain coverage of each vertical, when a query enters the system, only the ones which are most likely to contain the desired information should be selected. To perform this selection, a text representation of each vertical is created by directly sampling a set of documents from the vertical’s search engine. However, many times the vertical representation is not descriptive enough. Reasons such as the heterogeneous nature of the documents or the lack of cooperation of the vertical could negatively affect the generation of the representation. Thus, we focus on the problem of creating an aggregated search engine which integrates federated collections in an uncooperative environment. With the help of Wikipedia as a complementary external source of information, we investigate the use of three techniques found in the literature aimed to enrich the vertical representation: a) using only Wikipedia articles as representation; b) using a combination of Wikipedia articles and the sample obtained from the vertical; and c) expanding the contents of each sampled document. We discovered how by applying latent Dirichlet allocation to model the hidden topics of documents directly sampled from each vertical it is possible to identify Wikipedia articles with the same theme coverage as the vertical. Then, we demonstrate how by using only Wikipedia articles for representation of some particular verticals, the selection task is improved. As a second point, we explored the use of the modelled topics together with Wikipedia categories to boost the score of the verticals that could be associated with the query string. Although in this case our results are inconclusive, the experiments suggest that by applying query classification and then matching obtained categories with the verticals' categories it is possible to increase the effectiveness of the vertical selection task.
We organize a workshop on Heterogeneous Information Access hosted by the 8th International Conference on Web Search and Data Mining on 6 February 2015 in Shanghai, China
Invited speakers: Mounia Lalmas (Yahoo) and Milad Shokouhi, (Microsoft Research)
Information access is becoming increasingly heterogeneous. Especially when the user's information need is for exploratory purpose, returning a set of diverse results from different resources could benefit the user. For example, when a user is planning a trip to China on the Web, retrieving and presenting results from vertical search engines like travel, flight information, map and Q2A sites could satisfy the user's rich and diverse information need. This heterogeneous search aggregation paradigm is useful in many contexts and brings many new challenges.
Aggregated search and composite retrieval are two in- stances of this new heterogeneous information access paradigm. They are applied on the Web with heterogeneous vertical search engines. This paradigm can be useful in many other scenarios: a user aims to re-find comprehensive information about his query in his personal search (emails, slides); or a user searches and gathers different nugget information (e.g. an entity) from a set of RDF Web datasets (e.g., DBpedia, IMDB, etc.); or the user searches a set of different files (e.g., images, documents) in a peer-to-peer online file sharing systems.
This is an emerging area as different services provided are becoming more heterogeneous and complex. Therefore, there are a number of directions that might be interesting for the research and industrial community. How to select the most relevant resources and present them concisely in order to best satisfy the user? How to model the complex user behaviour in this search scenario? How can we evaluate the performance of these systems? Those are a few key interesting research questions to study for heterogeneous information access.
The workshop topics of interest are within the context of heterogeneous information access. They include but are not limited to:
- User modeling for Heterogeneous Information Access, Personalization
- Metrics, measurements, and test collections
- Optimization: Resource and vertical selection, Result presentation and diversification
- Applications: Aggregated/Federated search, Composite retrieval, Structured/Semantic search, P2P search
The workshop includes invited talks by leading researchers in the field from both industry and academia, presentations by contributed submissions as well as organized and open discussion on heterogeneous information access.
More information at: http://hia-workshop.com/.
Thanks everyone for submitting runs to one of the TREC Federated Web Search tasks. We had roughly the same number of participants as last year; not bad, although our goal was to grow. Interestingly, our automatic submission system received an amazing 917 runs.
We discussed the future of the FedWeb track, and we decided that we will not propose a FedWeb 2015 track as coordinators. We were unable to secure funding. Combined with the fact that we created the FedWeb collection for three years in a row (although the first time independently of TREC), we believe it is best to properly finish the TREC this year, but not to run again next year. Read more…
Thomas Demeester, Dong Nguyen, Dolf Trieschnigg, Ke Zhou, and Djoerd Hiemstra