2009 – Page 4 – Djoerd Hiemstra

Susan Dumais won the Salton award

Susan Dumais Sue Dumais won the Salton award, and gave a terrific keynote talk at the SIGIR Conference in Boston entitled “An Interdisciplinary Perspective on Information Retrieval”. Susan was awarded for “nearly thirty years of significant, sustained, and continuing contributions to research, for exceptional mentorship, and for leadership in bridging the fields of information retrieval and human computer interaction. Her contributions to both the theoretical development and practical implementations of Latent Semantic Indexing, question-answering, desktop search, combining search and navigation, and incorporating the user and their context, have all substantially advanced and enriched the field of Information Retrieval.”

More info at ACM SIGIR.

Kien Tjin-Kam-Jet joins Database Group

Kien Tjin-Kam-Jet joined the Database Group to work on Distributed Information Retrieval. Welcome Kien!

Blackboard site for Information Retrieval

As of today, the Blackboard site of the Information Retrieval course will be gradually filled with information. You can now register for the course. The deadline for self-enrollment is not known at the time of writing this. So don't wait until the last moment!

Pavel Serdyukov defends PhD thesis on Expert Search

by Pavel Serdyukov

The automatic search for knowledgeable people in the scope of an organization is a key function which makes modern enterprise search systems commercially successful and socially demanded. A number of effective approaches to expert finding were recently proposed in academic publications. Although, most of them use reasonably defined measures of personal expertise, they often limit themselves to rather unrealistic and sometimes oversimplified principles. In this thesis, we explore several ways to go beyond state-of-the-art assumptions used in research on expert finding and propose several novel solutions for this and related tasks. First, we describe measures of expertise that do not assume independent occurrence of terms and persons in a document what makes them perform better than the measures based on independence of all entities in a document. One of these measures makes persons central to the process of terms generation in a document. Another one assumes that the position of the personâ€™s mention in a document with respect to the positions of query terms indicates the relation of the person to the documentâ€™s relevant content. Second, we find the ways to use not only direct expertise evidence for a person concentrated within the document space of the personâ€™s current employer and only within those organizational documents that mention the person. We successfully utilize the predicting potential of additional indirect expertise evidence publicly available on the Web and in the organizational documents implicitly related to a person. Finally, besides the expert finding methods we proposed, we also demonstrate solutions for tasks from related domains. In one case, we use several algorithms of multi-step relevance propagation to search for typed entities in Wikipedia. In another case, we suggest generic methods for placing photos uploaded to Flickr on the world map using language models of locations built entirely on the annotations provided by users with a few task specific extensions.

[download pdf]

Open Search Validator

Pretty useful, the Open Search Validator shows that my site is all green.

More information at Tatham Oddie's blog.

Robin Aly presents at SIGIR Doctoral Consortium

Modeling Uncertainty in Video Retrieval: A Retrieval Model for Uncertain Semantic Representations of Videos

by Robin Aly

The need for content based multimedia retrieval increases rapidly because of ever faster growing collection sizes. However, retrieval systems often do not perform well enough for real-life applications. A promising approach is to detect semantic primitives at indexing time. Currently investigated primitives are: the uttering of the words and the occurrence of so-called semantic concepts, such as “Outdoor” and “Person”. We refer to a concrete instantiation of these primitives as the representation of the video document. Most detector programs emit scores reflecting the likelihood of each primitive. However, the detection is far from perfect and a lot of uncertainty about the real representation remains. Some retrieval algorithms ignore this uncertainty, which clearly hurts precision and recall. Other methods use the scores as anonymous features and learn their relationship to relevance. This has the disadvantage of requiring vast amounts of training data and has to be redone for every detector change.

The main contribution of our work is a formal retrieval model of treating this uncertainty. We conceptually consider the retrieval problem as two steps: (1) the determination of the posterior probability distribution given the scores over all representations (using existing methods) and (2) the derivation of a ranking status value (RSV) for each representation. We then take the expected RSV weighted by the respresentationâ€™s posterior probability as the effective RSV of this shot for ranking. We claim that our approach has following advantages: (a) that step (2) is easier achieved than using the machine learning alternative and (b) that it benefits from all detector improvements.

[more information]

Collection Selection with Highly Discriminative Keys

by Sander Bockting and Djoerd Hiemstra

The centralized web search paradigm introduces several problems, such as large data traffic requirements for crawling, index freshness problems and problems to index everything. In this study, we look at collection selection using highly discriminative keys and query-driven indexing as part of a distributed web search system. The approach is evaluated on diff erent splits of the TREC WT10g corpus. Experimental results show that the approach outperforms a Dirichlet smoothing language modeling approach for collection selection, if we assume that web servers index their local content.

The paper will be presented at the 7th Workshop on Large-Scale Distributed Systems for Information Retrieval in Boston, USA.

[download pdf]

2nd SIKS/Twente Seminar on Searching and Ranking

On June 24, 2009 at the University of Twente

http://www.cs.utwente.nl/ssr2009/

The goal of the one day seminar is to bring together researchers from companies and academia working on enterprise search problems. Speakers at the seminar are: David Hawking from Funnelback Internet and Enterprise Search & the Australian National University, who will talk about Practical Methods for Evaluating Enterprise Search. Iadh Ounis from the University of Glasgow will present Voting Techniques for Expert Search. Maarten de Rijke from the University of Amsterdam will talk about Expert Profiling Out In the Wild.

Parallel and Distributed Databases at Euro-Par 2009

by Djoerd Hiemstra, Alfons Kemper, Manuel Prieto, and Alex Szalay

Euro-Par is an annual series of international conferences dedicated to the promotion and advancement of all aspects of parallel and distributed computing. Euro-Par focuses on all aspects of hardware, software, algorithms and applications for parallel and distributed computing. The Euro-Par 2009 conference will take place in Delft, the Netherlands, from August 25th until August 28th, 2009.

Sessions at Euro-Par cover several topics. Euro-Par Topic 5 addresses data management issues in parallel and distributed computing. Advances in data management (storage, access, querying, retrieval, mining) are inherent to current and future information systems. Today, accessing large volumes of information is a reality: Data-intensive applications enable huge user communities to transparently access multiple pre-existing autonomous, distributed and heterogeneous resources (data, documents, images, services, etc.). Data management solutions need efficient techniques for exploiting and mining large datasets available in clusters, peer to peer and Grid architectures. Parallel and distributed file systems, databases, data warehouses, and digital libraries are a key element for achieving scalable, efficient systems that will cost-effectively manage and extract data from huge amounts of highly distributed and heterogeneous digital data repositories. Each paper submitted to Euro-Parâ€™s topic Parallel and Distributed Databases was reviewed by at least three reviewers. Of 11 papers submitted to the topic this year, 3 were accepted, which makes an acceptance rate of 27 %. The three accepted papers discuss diverse issues: database transactions, efficient and reliable structured peer-to-peer systems, and selective replicated declustering.

In their paper Unifying Memory and Database Transactions, Ricardo Dias, and João Lourenço present a simple but powerful idea: Combining software transactional memory with database transactions. The paper proposes to provide unified memory and database transactions by integrating the database transaction control with a software framework for transactional memory. Experimental results show that the overhead for unified transactions is low. It is likely that the approach lowers the burden on the application developer. The paper by Hao Gong, Guangyu Shi, Jian Chen, and Lingyuan Fan, A DHT Key-Value Storage System with Carrier Grade Performance, tries to achieves reliability and efficiency in peer-to-peer systems in order so support Telecom services. The proposed design is based on: Adopting a two-layer distributed hash table, embedding location information into peer IDs; Providing one-hop routing by enhancing each peer with an additional one-hop routing table, where super-peers are in charge of updating and synchronizing this routing information. Finally, the approach replicates subscriber data on multiple peers. Finally, Kerim Oktay, Ata Turk, and Cevdet Aykanat present a paper on Selective Replicated Declustering for Arbitrary Queries. The authors present a new algorithm for selective replicated declustering for arbitrary queries. The algorithm makes use of query information available in order to decide on the data assignment to different disks and on which data to replicate respecting space constraints. Further, it is described how to apply the proposed algorithm in a recursive way for obtaining a multi-way replicated declustering. Experiments show the algorithm outperforms existing replicated declustering schemes, especially for low replication constraints.

[download pdf]

More info at the Euro-Par 2009 conference site.

Towards an Information Retrieval Theory of Everything

I present three well-known probabilistic models of information retrieval in tutorial style: The binary independence probabilistic model, the language modeling approach, and Googleâ€™s page rank. Although all three models are based on probability theory, they are very different in nature. Each model seems well-suited for solving certain information retrieval problems, but not so useful for solving others. So, essentially each model solves part of a bigger puzzle, and a unified view on these models might be a first step towards an Information Retrieval Theory of Everything.

The paper is published in the news letter of the NVTI, the “Nederlandse Vereniging voor Theoretische Informatica”. A more extensive overview of information retrieval theory, covering eight models is given in: Djoerd Hiemstra, Information Retrieval Models. In: Ayse Goker and John Davies (eds.), Information Retrieval: Searching in the 21st Century, Wiley, 2009.

[download pdf]