Eelco Eerenberg graduates on economic models for distributed search

Towards Distributed Information Retrieval based on Economic Models

by Eelco Eerenberg

The aim of this research is to build a successful distributed information retrieval system based on an economic model, allowing servers to open up their part of the deep web. This research consists of three parts: 1) selecting suitable economic models, 2) simulating these models, and 3) performing a real-world test. We found the models of Vickrey auction and bond redistribution to be the most suitable ones. These models behaved well in our simulation and both outperformed a naive comparison model. The Vickrey auction model performed best in a scenario that mostly resembles the Internet. On average 69% of all models with a strong correlation between the economic outcomes and the performance of information retrieval (Kendall’s-τ > 0.6) is a Vickrey auction model. In the real-world test we show that users appreciate both the use and administration of an information retrieval system based on an economic model. Furthermore, if we apply a perfect categorization, the economic model outperforms the comparison engine with a 66% increase in performance.

more information

Collection Selection with Highly Discriminative Keys

by Sander Bockting and Djoerd Hiemstra

The centralized web search paradigm introduces several problems, such as large data traffic requirements for crawling, index freshness problems and problems to index everything. In this study, we look at collection selection using highly discriminative keys and query-driven indexing as part of a distributed web search system. The approach is evaluated on diff erent splits of the TREC WT10g corpus. Experimental results show that the approach outperforms a Dirichlet smoothing language modeling approach for collection selection, if we assume that web servers index their local content.

The paper will be presented at the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval in Boston, USA.

[download pdf]

Kien Tjin-Kam-Jet graduates on result merging for distributed information retrieval

Centralized Web search has difficulties with crawling and indexing the Visible Web. The Invisible Web is estimated to contain much more content, and this content is even more difficult to crawl. Metasearch, a form of distributed search, is a possible solution. However, a major problem is how to merge the results from several search engines into a single result list. We train two types of Support Vector Machines (SVMs): a regression model and preference classification model. Round Robin (RR) is used as our merging baseline. We varied the number of search engines being merged, the selection policy, and the document collection size of the engines. Our findings show that RR is the fastest method and that, in a few cases, it performs as well as regression-SVM. Both SVM methods are much slower and, judging by performance, regression-SVM is the best of all three methods. The choice of which method to use depends strongly on the usage scenario. In most cases, we recommend using regression-SVM.

[download pdf]

Federated search in Windows 7

Windows 7 along with the desktop search, introduces Federated Search in which the scope of the search goes beyond your PC. You can now search for items in remote repositories from your PC. It is based on OpenSearch and the RSS format. Since it is based on open standards, it becomes very simple to create custom 'search connectors' for your own remote repositories. For example, you can search Flickr or Twitter from within explorer.

Read more

OpenSearch: share your search results

OpenSearch is a collection of simple XML formats for sharing search results, that was originally developed by A9, a company founded by Amazon.com. A9 acts as a search mediator: You pick your favorite search engines, and A9 sends your queries to these engines, aggregates the results, and done, you have your own personal view of the web!

Many search engines provide some kind of OpenSearch or RSS-like search these days, for instance, here's an ego search on Yahoo. But, OpenSearch is just as useful on a much smaller scale, for instance for searching these pages for information on SIKS (the Dutch School for Information and Knowledge Systems).

Distributed Search and Keyword Auctions

After the burst of the dot-com bubble in the autumn of 2001, the World Wide Web has gone through some remarkable changes in its organizational structure. Consumers of data and content are increasingly taking the role of producers of data and content, thereby threatening traditional publishers. A well known example is the Wikipedia encyclopedia, which is written entirely by its (non-professional) users on a voluntary basis, while still rivaling a traditional publisher like Britannica on-line in both size and quality. Similarly, in SourceForge, communities of open source software developers collaboratively create new software thereby rivaling software vendors like Microsoft; Blogging turned the internet consumers of news into news providers; Kazaa and related peer-to-peer platforms like BitTorrent and E-mule turned anyone who downloads a file automatically into contributors of files; Flickr turned users into contributors of visual content, but also into indexers of that content by social tagging, etc. Communities of users operate by trusting each other as co-developers and contributors, without the need for strict rules. There is however one major internet application for which communities only play a minor role. One of the web's most important applications — if not the most important application — is search. Internet search is almost exclusively run by three companies that dominate the search market: Google, Yahoo, and Microsoft. In contrast to traditional centralized search, where a centralized body like Google or Yahoo is in full control, a community-run search engine would consist of many small search engines that collaboratively provide the search service. This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions.

[download pdf]