An open source implementation of web clustering algorithms for selective search

by Gijs Hendriksen, Djoerd Hiemstra, and Arjen de Vries

In distributed search, a document collection is partitioned across several shards, which can be queried independently to speed up query processing. Selective search builds upon this infrastructure, but reduces the required resources further by only querying a small number of the index shards. A resource selection algorithm is used to predict which shards are relevant for a given query. To ensure that this works effectively, the shards are usually created using a topic-driven clustering algorithm, so that different documents that are relevant for the same query are more likely to be assigned to the same shard. To make the topic-driven clustering algorithms usable by the general public, and make it easier for researchers or search engine developers to implement and experiment with selective search systems, we release an open source implementation of SB2 K-means, including the extensions QKLD and QInit. Our implementation will be published as a Python package on PyPI.

The be presented at the 6th International Open Search Symposium #OSSYM24 on 9-11 October 2024 in Munich, Germany

[download pdf] [git repo]