Inoculating Relevance Feedback Against Poison Pills

by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx

Relevance Feedback (RF) is a common approach for enriching queries, given a set of explicitly or implicitly judged documents to improve the performance of the retrieval. Although it has been shown that on average, the overall performance of retrieval will be improved after relevance feedback, for some topics, employing some relevant documents may decrease the average precision of the initial run. This is mostly because the feedback document is partially relevant and contains off-topic terms which adding them to the query as expansion terms results in loosing the retrieval performance. These relevant documents that hurt the performance of retrieval after feedback are called “poison pills”. In this paper, we discuss the effect of poison pills on the relevance feedback and present significant words language models (SWLM) as an approach for estimating feedback model to tackle this problem.

To be presented at the 15th Dutch-Belgian Information Retrieval Workshop, DIR 2016 on 25 November in Delft.

[download pdf]

Dutch-Belgian Information Retrieval workshop in Delft

The Dutch-Belgian Information Retrieval workshop DIR 2016 will be held in Delft on 25 November. The preliminary workshop program contains 2 keynotes, 12 oral presentations and 7 poster presentations. Max Wilson from the University of Nottingham will provide an Human Computer Interaction perspective on Information Retrieval. Carlos Castillo from Eurecat will talk about the detection of algorithmic discrimination.

DIR 2016

Register at http://dir2016.nl.

Data Science Platform Netherlands

Data Science Platform Netherlands

The Data Science Platform Netherlands (DSPN) is the national platform for ICT research within the Data Science domain. Data Science is the collection and analysis of so-called ‘Big Data’ according to academic methodology. DSPN unites all Dutch academic research institutions where Data Science is carried out from an ICT perspective, specifically the computer science or applied mathematics perspectives. The objectives of DSPN are to:

  • Highlight the importance of ICT research in Big Data and Data Science, especially in national discussions about research and education.
  • Exchange and disseminate information about Data Science research and education.
  • Build and maintain a network of ICT researchers active in the field of Data Science.

DSPN is launched as part of the ICT Research Platform Netherlands (IPN) to give a voice to the Data Science initiatives of the Dutch ICT research organisations. For more information, see the website at: http://www.datascienceplatform.org/.

#WhoAmI in 160 Characters?

Classifying Social Identities Based on Twitter

by Anna Priante, Djoerd Hiemstra, Tijs van den Broek, Aaqib Saeed, Michel Ehrenhard, and Ariana Need

We combine social theory and NLP methods to classify English-speaking Twitter users’ online social identity in profile descriptions. We conduct two text classification experiments. In Experiment 1 we use a 5-category online social identity classification based on identity and self-categorization theories. While we are able to automatically classify two identity categories (Relational and Occupational), automatic classification of the other three identities (Political, Ethnic/religious and Stigmatized) is challenging. In Experiment 2 we test a merger of such identities based on theoretical arguments. We find that by combining these identities we can improve the predictive performance of the classifiers in the experiment. Our study shows how social theory can be used to guide NLP methods, and how such methods provide input to revisit traditional social theory that is strongly consolidated in offline setting

To be presented at the EMNLP Workshop on Natural Language Processing and Computational Social Science (NLP+CSS) on November 5 in Austin, Texas, USA.

[download pdf]

Download the code book and classifier source code from github.

Resource Selection for Federated Search on the Web

by Dong Nguyen, Thomas Demeester, Dolf Trieschnigg, and Djoerd Hiemstra

A publicly available dataset for federated search reflecting a real web environment has long been absent, making it difficult for researchers to test the validity of their federated search algorithms for the web setting. We present several experiments and analyses on resource selection on the web using a recently released test collection containing the results from more than a hundred real search engines, ranging from large general web search engines such as Google, Bing and Yahoo to small domain-specific engines.
First, we experiment with estimating the size of uncooperative search engines on the web using query based sampling and propose a new method using the ClueWeb09 dataset. We find the size estimates to be highly effective in resource selection. Second, we show that an optimized federated search system based on smaller web search engines can be an alternative to a system using large web search engines. Third, we provide an empirical comparison of several popular resource selection methods and find that these methods are not readily suitable for resource selection on the web. Challenges include the sparse resource descriptions and extremely skewed sizes of collections.

[download pdf]

CLEF keynote slides

The slides of the CLEF keynote can be downloaded below

A case for search specialization and search delegation

Evaluation conferences like CLEF, TREC and NTCIR are important for the field, and keep being important because there is no “one-size-fits-all” for search engines. Different domains need different ranking approaches: For instance, Web search benefits from analyzing the link graph; Twitter search benefits from retweets and likes; Restaurant search benefits from geo-location and reviews; Advertisement search need bids and click-through, etc. Researching many domains will learn us more about the need and the value of the specialization of search engines, and about approaches that can quickly learn rankings for new domains using for instance learning-to-rank and clever feature selection.
A search engine that provides results from multiple domains, therefore better delegates its queries to specialized search engines. This brings up unique research questions on how to best select a specialized search engine. The TREC Federated Web Search track, that ran in 2013 and 2014, studied these questions in two tasks: the resource selection task studied how to select, given a query but before seeing the results for the query, the top specialized search engines for a query. The vertical selection task studied how to select the top domains from a predefined set of domains such as news, video, Q&A, etc.
I will present the lessons that we learned from running the Federated Web Search track, focusing on successful approaches to resource selection and vertical selection. I will conclude the talk by discussing our steps to take this work to full practice by running the University of Twente's search engine as a federation of more than 30 smaller search engines, including local databases with news, courses, publications, as well as results from social media like Twitter and YouTube. The engine that runs U. Twente search is called Searsia and is available as open source software at: http://searsia.org.

[download slides]

SIKS/CBS Data Camp & Advanced Course on Managing Big Data

On December 06 and 07 2016 The Netherlands School for Information and Knowledge Systems (SIKS) and Statistics Netherlands (CBS) organize a two day tutorial on the management of Big Data, the DataCamp, hosted at the University of Twente.
The Data Camp's objective is to use big data sets to produce valuable and innovative answers to research questions with societal relevance. SIKS PhD students and CBS data analysts will learn about big data technologies and create, in small groups, feasibility studies for a research question of their choice.
Participants get access to predefined CBS research questions and massive datasets, including a large collection of Dutch Tweets, traffic data from Dutch high ways, and AIS data from ships. Participants will get access to the Twente Hadoop cluster, a 56 node cluster with almost 1 petabyte of storage space. The tutorial focuses on hands-on experience. The Data Camp participants will work in small, mixed teams in an informal setting, which stimulates intense contact with technologies and research questions. Experienced data scientists will support the teams by short lectures and hands-on support. Short lectures will introduce technologies to manage and visualize big data, that were first adopted by Google and are now used by many companies that manage large datasets. The tutorial teaches how to process terabytes of data on large clusters of commodity machines using new programming styles like MapReduce and Spark. The tutorial will be given in English and is part of the educational program for SIKS PhD students.

Also see the SIKS announcement.

Luhn Revisited: Significant Words Language Models

by Mostafa Dehghani, Hosein Azarbonyad, Jaap Kamps, Djoerd Hiemstra, and Maarten Marx

Users tend to articulate their complex information needs in only a few keywords, making underspecified statements of request the main bottleneck for retrieval effectiveness. Taking advantage of feedback information is one of the best ways to enrich the query representation, but can also lead to loss of query focus and harm performance – in particular when the initial query retrieves only little relevant information – when overfitting to accidental features of the particular observed feedback documents. Inspired by the early work of Hans Peter Luhn, we propose significant words language models of feedback documents that capture all, and only, the significant shared terms from feedback documents. We adjust the weights of common terms that are already well explained by the document collection as well as the weight of rare terms that are only explained by specific feedback documents, which eventually results in having only the significant terms left in the feedback model.

Establishing a set of 'Significant Words'

Our main contributions are the following. First, we present significant words language models as the effective models capturing the essential terms and their probabilities. Second, we apply the resulting models to the relevance feedback task, and see a better performance over the state-of-the-art methods. Third, we see that the estimation method is remarkably robust making the models insensitive to noisy non-relevant terms in feedback documents. Our general observation is that the significant words language models more accurately capture relevance by excluding general terms and feedback document specific terms.

To be presented at the 25th ACM International Conference on Information and Knowledge Management (CIKM 2016) on October 24-28, 2016 in Indianapolis, United States.

[download pdf]

Evaluation and analysis of term scoring methods for term extraction

by Suzan Verberne, Maya Sappelli, Djoerd Hiemstra, and Wessel Kraaij

We evaluate five term scoring methods for automatic term extraction on four different types of text collections: personal document collections, news articles, scientific articles and medical discharge summaries. Each collection has its own use case: author profiling, boolean query term suggestion, personalized query suggestion and patient query expansion. The methods for term scoring that have been proposed in the literature were designed with a specific goal in mind. However, it is as yet unclear how these methods perform on collections with characteristics different than what they were designed for, and which method is the most suitable for a given (new) collection. In a series of experiments, we evaluate, compare and analyse the output of six term scoring methods for the collections at hand. We found that the most important factors in the success of a term scoring method are the size of the collection and the importance of multi-word terms in the domain. Larger collections lead to better terms; all methods are hindered by small collection sizes (below 1000 words). The most flexible method for the extraction of single-word and multi-word terms is pointwise Kullback-Leibler divergence for informativeness and phraseness. Overall, we have shown that extracting relevant terms using unsupervised term scoring methods is possible in diverse use cases, and that the methods are applicable in more contexts than their original design purpose.

To appear in Information Retrieval.

[download pdf]