Evaluation Contrasted with Effectiveness
by Claudia Hauff, Leif Azzopardi, Djoerd Hiemstra, and Franciska de Jong
Query performance predictors are commonly evaluated by reporting correlation coefficients to denote how well the methods perform at predicting the retrieval performance of a set of queries. Despite the amount of research dedicated to this area, one aspect remains neglected: how strong does the correlation need to be in order to realize an improvement in retrieval effectiveness in an operational setting? We address this issue in the context of two settings: Selective Query Expansion and Meta-Search. In an empirical study, we control the quality of a predictor in order to examine how the strength of the correlation achieved, affects the effectiveness of an adaptive retrieval system. The results of this study show that many existing predictors fail to achieve a correlation strong enough to reliably improve the retrieval effectiveness in the Selective Query Expansion as well as the Meta-Search setting.
by Claudia Hauff, Djoerd Hiemstra, Leif Azzopardi, and Franciska de Jong
Ranking a set retrieval systems according to their retrieval effectiveness without relying on relevance judgments was first explored by Soboroff et al. (Soboroff, I., Nicholas, C., Cahan, P. Ranking retrieval systems without relevance judgments. In: SIGIR 2001, pp. 66-73) Over the years, a number of alternative approaches have been proposed, all of which have been evaluated on early TREC test collections. In this work, we perform a wider analysis of system ranking estimation methods on sixteen TREC data sets which cover more tasks and corpora than previously. Our analysis reveals that the performance of system ranking estimation approaches varies across topics. This observation motivates the hypothesis that the performance of such methods can be improved by selecting the “right” subset of topics from a topic set. We show that using topic subsets improves the performance of automatic system ranking methods by 26% on average, with a maximum of 60%. We also observe that the commonly experienced problem of underestimating the performance of the best systems is data set dependent and not inherent to system ranking estimation. These findings support the case for automatic system evaluation and motivate further research.
by Claudia Hauff and Djoerd Hiemstra
The University of Twente participated in three tasks of TREC 2009: the adhoc task, the diversity task and the relevance feedback task. All experiments are performed on the English part of ClueWeb09. In this draft paper, we describe our approach to tuning our retrieval system in absence of training data in Section 3. We describe the use of categories and a query log for diversifying search results in Section 4. Section 5 describes preliminary results for the relevance feedback task.
by Claudia Hauff, Djoerd Hiemstra, Franciska de Jong, and Leif Azzopardi
Ranking a number of retrieval systems according to their retrieval effectiveness without relying on costly relevance judgments was first explored by Soboroff et al. . Over the years, a number of alternative approaches have been proposed. We perform a comprehensive analysis of system ranking estimation approaches on a wide variety of TREC test collections and topics sets. Our analysis reveals that the performance of such approaches is highly dependent upon the topic or topic subset, used for estimation. We hypothesize that the performance of system ranking estimation approaches can be improved by selecting the “right” subset of topics and show that using topic subsets improves the performance by 32% on average, with a maximum improvement of up to 70% in some cases.
The paper will be presented at the 18th ACM Conference on Information and Knowledge Management (CIKM 2009) in Hong Kong, China
Both Pavel Serdyukov and Claudia Hauff have joint papers accepted for the 32nd Annual ACM SIGIR Conference in Boston, USA.
Placing Flickr Photos on a Map
by Pavel Serdyukov, Vanessa Murdock (Yahoo!), and Roelof van Zwol (Yahoo!)
In this paper we investigate generic methods for placing photos uploaded to Flickr on the World map. As primary input for our methods we use the textual annotations provided by the users to predict the single most probable location where the image was taken. Central to our approach is a language model based entirely on the annotations provided by users. We define extensions to improve over the language model using tag-based smoothing and cell-based smoothing, and leveraging spatial ambiguity. Further we demonstrate how to incorporate GeoNames, a large external database of locations. For varying levels of granularity, we are able to place images on a map with at least twice the precision of the state-of-the-art reported in the literature.
Efficiency trade-offs in two-tier web search systems
by Ricardo Baeza-Yates (Yahoo!), Vanessa Murdock (Yahoo!), and Claudia Hauff
Search engines rely on searching multiple partitioned corpora to return results to users in a reasonable amount of time. In this paper we analyze the standard two-tier architecture for Web search with the difference that the corpus to be searched for a given query is predicted in advance. We show that any predictor better than random yields time savings, but this decrease in the processing time yields an increase in the infrastructure cost. We provide an analysis and investigate this trade-off in the context of two different scenarios on real-world data. We demonstrate that in general the decrease in answer time is justified by a small increase in infrastructure cost.
See: list of accepted papers at SIGIR'09.
by Claudia Hauff, Leif Azzopardi, and Djoerd Hiemstra
In this paper, we examine a number of newly applied methods for combining pre-retrieval query performance predictors in order to obtain a better prediction of the query's performance. However, in order to adequately and appropriately compare such techniques, we critically examine the current evaluation methodology and show how using linear correlation coefficients (i) do not provide an intuitive measure indicative of a method's quality, (ii) can provide a misleading indication of performance, and (iii) overstate the performance of combined methods. To address this, we extend the current evaluation methodology to include cross validation, report a more intuitive and descriptive statistic, and apply statistical testing to determine significant differences. During the course of a comprehensive empirical study over several TREC collections, we evaluate nineteen pre-retrieval predictors and three combination methods.
The paper will be presented at the 31st European Conference on Information Retrieval (ECIR), April 6-9, 2009 in Toulouse, France.
by Claudia Hauff, Djoerd Hiemstra, and Franciska de Jong
The focus of research on query performance prediction is to predict the effectiveness of a query given a search system and a collection of documents. If the performance of queries can be estimated in advance of, or during the retrieval stage, specific measures can be taken to improve the overall performance of the system. In particular, pre-retrieval predictors predict the query performance before the retrieval step and are thus independent of the ranked list of results; such predictors base their predictions solely on query terms, the collection statistics and possibly external sources such as WordNet. In this paper, 22 pre-retrieval predictors are categorized and assessed on three different TREC test collections.