Collection Selection with Highly Discriminative Keys

by Sander Bockting and Djoerd Hiemstra

The centralized web search paradigm introduces several problems, such as large data traffic requirements for crawling, index freshness problems and problems to index everything. In this study, we look at collection selection using highly discriminative keys and query-driven indexing as part of a distributed web search system. The approach is evaluated on diff erent splits of the TREC WT10g corpus. Experimental results show that the approach outperforms a Dirichlet smoothing language modeling approach for collection selection, if we assume that web servers index their local content.

The paper will be presented at the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval in Boston, USA.

[download pdf]

Kien Tjin-Kam-Jet graduates on result merging for distributed information retrieval

Centralized Web search has difficulties with crawling and indexing the Visible Web. The Invisible Web is estimated to contain much more content, and this content is even more difficult to crawl. Metasearch, a form of distributed search, is a possible solution. However, a major problem is how to merge the results from several search engines into a single result list. We train two types of Support Vector Machines (SVMs): a regression model and preference classification model. Round Robin (RR) is used as our merging baseline. We varied the number of search engines being merged, the selection policy, and the document collection size of the engines. Our findings show that RR is the fastest method and that, in a few cases, it performs as well as regression-SVM. Both SVM methods are much slower and, judging by performance, regression-SVM is the best of all three methods. The choice of which method to use depends strongly on the usage scenario. In most cases, we recommend using regression-SVM.

[download pdf]

Federated search in Windows 7

Windows 7 along with the desktop search, introduces Federated Search in which the scope of the search goes beyond your PC. You can now search for items in remote repositories from your PC. It is based on OpenSearch and the RSS format. Since it is based on open standards, it becomes very simple to create custom 'search connectors' for your own remote repositories. For example, you can search Flickr or Twitter from within explorer.

Read more

OpenSearch: share your search results

OpenSearch is a collection of simple XML formats for sharing search results, that was originally developed by A9, a company founded by A9 acts as a search mediator: You pick your favorite search engines, and A9 sends your queries to these engines, aggregates the results, and done, you have your own personal view of the web!

Many search engines provide some kind of OpenSearch or RSS-like search these days, for instance, here's an ego search on Yahoo. But, OpenSearch is just as useful on a much smaller scale, for instance for searching these pages for information on SIKS (the Dutch School for Information and Knowledge Systems).

Distributed Search and Keyword Auctions

After the burst of the dot-com bubble in the autumn of 2001, the World Wide Web has gone through some remarkable changes in its organizational structure. Consumers of data and content are increasingly taking the role of producers of data and content, thereby threatening traditional publishers. A well known example is the Wikipedia encyclopedia, which is written entirely by its (non-professional) users on a voluntary basis, while still rivaling a traditional publisher like Britannica on-line in both size and quality. Similarly, in SourceForge, communities of open source software developers collaboratively create new software thereby rivaling software vendors like Microsoft; Blogging turned the internet consumers of news into news providers; Kazaa and related peer-to-peer platforms like BitTorrent and E-mule turned anyone who downloads a file automatically into contributors of files; Flickr turned users into contributors of visual content, but also into indexers of that content by social tagging, etc. Communities of users operate by trusting each other as co-developers and contributors, without the need for strict rules. There is however one major internet application for which communities only play a minor role. One of the web's most important applications — if not the most important application — is search. Internet search is almost exclusively run by three companies that dominate the search market: Google, Yahoo, and Microsoft. In contrast to traditional centralized search, where a centralized body like Google or Yahoo is in full control, a community-run search engine would consist of many small search engines that collaboratively provide the search service. This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions.

[download pdf]

Distributed search: who bids on “britney spears”?

NWO We will start a new NWO research project on the use of keyword auctions for distributed information retrieval. The project's aim is to distribute internet search functionality in such a way that communities of users and/or federations of small search systems provide search services in a collaborative way. Instead of getting all data to a centralized point and process queries centrally, as is done by today's search systems, the project will distribute queries over many small autonomous search systems and process them locally. Distributed information retrieval is a well researched sub area of information retrieval, but it has not resulted in practical solutions for large scale search problems because of high administration costs of setting up large numbers of installations and because it turns out to be hard in practice to direct queries to the appropriate local search systems. In this project we will research a radical new approach to distribute search: distributed information retrieval by means of keyword auctions.

Keyword auctions like Google's AdWords give advertisers the opportunity to provide targeted advertisements by bidding on specific keywords, for instance by bidding on today's hottest query britney spears. Analogous to these keyword auctions, local search systems will bid for keywords at a central broker. They “pay” by serving queries for the broker. The broker will send queries to those local search systems that optimize the overall effectiveness of the system, i.e., local search systems that are willing to serve many queries, but also are able to provide high quality results. The project will approach the problem from three different angles: 1) modeling the local search system, including models for automatic bidding and multi-word keywords; 2) modeling the search broker's optimization using the bids, the quality of the answers, and click-through rates; 3) integration of structured data typically available behind web forms of local search systems with text search. The approaches will be evaluated using prototype systems and simulations on benchmark test collections.

See: NWO news (in Dutch)