Shard Ranking and Cutoff Estimation for Topically Partitioned Collections

by Anagha Kulkarni, Almer Tigelaar, Djoerd Hiemstra, and Jamie Callan

Large document collections can be partitioned into topical shards to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

to be presented at the The 21st ACM International Conference on Information and Knowledge Management, CIKM 2012.

[download pdf]

Federated Search in the Wild

The Combined Power of over a Hundred Search Engines

by Dong Nguyen, Thomas Demeester (Ghent University), Dolf Trieschnigg, and Djoerd Hiemstra

Federated search has the potential of improving web search: the user becomes less dependent on a single search provider and parts of the deep web become available through a unified interface, leading to a wider variety in the retrieved search results. However, a publicly available dataset for federated search reflecting an actual web environment has been absent. As a result, it has been difficult to assess whether proposed systems are suitable for the web setting. We introduce a new test collection containing the results from more than a hundred actual search engines, ranging from large general web search engines such as Google and Bing to small domain-specific engines. We discuss the design and analyze the effect of several sampling methods. For a set of test queries, we collected relevance judgments for the top 10 results of each search engine. The dataset is publicly available and is useful for researchers interested in resource selection for web search collections, result merging and size estimation of uncooperative resources.

to be presented at the The 21st ACM International Conference on Information and Knowledge Management, CIKM 2012.

[download pdf]

Query log analysis for Treinplanner

An analysis of free-text queries for a multi-field web form

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

Treinplanner.info We report how users interact with an experimental system that transforms single- field textual input into a multi- field query for an existing travel planner system. The experimental system was made publicly available and we collected over 30,000 queries from almost 12,000 users. From the free-text query log, we examined how users formulated structured information needs into free-text queries. The query log analysis shows that there is great variety in query formulation, over 400 query templates were found that occurred at least 4 times. Furthermore, with over 100 respondents to our questionnaire, we provide both quantitative and qualitative evidence indicating that end-users significantly prefer a single field interface over a multi-field interface when performing structured search.

The paper will be presented at the fourth Information Interaction in Context Symposium, IIiX 2012 on August 21-24, 2012 in Nijmegen, the Netherlands.

[download pdf]

Wrapper induction for search results

Ranking XPaths for extracting search result records

by Dolf Trieschnigg, Kien Tjin-Kam-Jet and Djoerd Hiemstra

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.

[download pdf]

Search Result Finder XPaths

Download Search Result Finder Firefox plugin.

Peer-to-Peer Information Retrieval: An Overview

by Almer Tigelaar, Djoerd Hiemstra, Dolf Trieschnigg

Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these have seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralised solutions. In this paper we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralised client-server solutions in terms of scalability, performance, user satisfaction and freedom.

The paper will appear in ACM Transactions on Information Systems.

[download pdf]

Free-Text Search over Complex Web Forms

by Kien Tjin-Kam-Jet, Dolf Trieschnigg, and Djoerd Hiemstra

This paper investigates the problem of using free-text queries as an alternative means for searching 'behind' web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

The paper will be presented at the Information Retrieval Facility Conference IRFC 2011 on 6 June in Vienna, Austria

[download preprint]

Search Result Caching in P2P Information Retrieval Networks

by Almer Tigelaar, Djoerd Hiemstra, and Dolf Trieschnigg

See Almer's post: For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

The paper will be presented at the Information Retrieval Facility Conference IRFC 2011 on 6 June in Vienna, Austria

[download preprint]

Eelco Eerenberg graduates on economic models for distributed search

Towards Distributed Information Retrieval based on Economic Models

by Eelco Eerenberg

The aim of this research is to build a successful distributed information retrieval system based on an economic model, allowing servers to open up their part of the deep web. This research consists of three parts: 1) selecting suitable economic models, 2) simulating these models, and 3) performing a real-world test. We found the models of Vickrey auction and bond redistribution to be the most suitable ones. These models behaved well in our simulation and both outperformed a naive comparison model. The Vickrey auction model performed best in a scenario that mostly resembles the Internet. On average 69% of all models with a strong correlation between the economic outcomes and the performance of information retrieval (Kendall’s-τ > 0.6) is a Vickrey auction model. In the real-world test we show that users appreciate both the use and administration of an information retrieval system based on an economic model. Furthermore, if we apply a perfect categorization, the economic model outperforms the comparison engine with a 66% increase in performance.

more information