Collection Selection with Highly Discriminative Keys

by Sander Bockting and Djoerd Hiemstra

The centralized web search paradigm introduces several problems, such as large data traffic requirements for crawling, index freshness problems and problems to index everything. In this study, we look at collection selection using highly discriminative keys and query-driven indexing as part of a distributed web search system. The approach is evaluated on diff erent splits of the TREC WT10g corpus. Experimental results show that the approach outperforms a Dirichlet smoothing language modeling approach for collection selection, if we assume that web servers index their local content.

The paper will be presented at the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval in Boston, USA.

[download pdf]