Pavel Serdyukov defends PhD thesis on Expert Search

by Pavel Serdyukov

The automatic search for knowledgeable people in the scope of an organization is a key function which makes modern enterprise search systems commercially successful and socially demanded. A number of effective approaches to expert finding were recently proposed in academic publications. Although, most of them use reasonably defined measures of personal expertise, they often limit themselves to rather unrealistic and sometimes oversimplified principles. In this thesis, we explore several ways to go beyond state-of-the-art assumptions used in research on expert finding and propose several novel solutions for this and related tasks. First, we describe measures of expertise that do not assume independent occurrence of terms and persons in a document what makes them perform better than the measures based on independence of all entities in a document. One of these measures makes persons central to the process of terms generation in a document. Another one assumes that the position of the person’s mention in a document with respect to the positions of query terms indicates the relation of the person to the document’s relevant content. Second, we find the ways to use not only direct expertise evidence for a person concentrated within the document space of the person’s current employer and only within those organizational documents that mention the person. We successfully utilize the predicting potential of additional indirect expertise evidence publicly available on the Web and in the organizational documents implicitly related to a person. Finally, besides the expert finding methods we proposed, we also demonstrate solutions for tasks from related domains. In one case, we use several algorithms of multi-step relevance propagation to search for typed entities in Wikipedia. In another case, we suggest generic methods for placing photos uploaded to Flickr on the world map using language models of locations built entirely on the annotations provided by users with a few task specific extensions.

[download pdf]

Henning Rode defends Ph.D. thesis on Entity Ranking

From Document to Entity Retrieval: Improving Precision and Performance of Focused Text Search

by Henning Rode

Text retrieval is an active area of research since decades. Finding the best index terms, the development of statistical models for the estimation of relevance, using relevance feedback, and the challenge to keep retrieval tasks efficient with ever growing text collections had been important issues over the entire period. Especially in the last decade, we have also seen a diversification of retrieval tasks. Instead of searching for entire documents only, passage or XML retrieval systems allow to formulate a more focused search for finer grained text units. Question answering systems even try to pinpoint the part of a sentence that contains the answer to a user question, and expert search systems return a list of persons with expertise on the topic. The sketched situation forms the starting point of this thesis, which presents a number of task specific search solutions and tries to set them into more generic frameworks. In particular, we take a look at the three areas (1) context adaptivity of search, (2) efficient XML retrieval, and (3) entity ranking.

In the first case, we show how different types of context information can be incorporated in the retrieval of documents. When users are searching for information, the search task is typically part of a wider working process. This search context, however, is often not reflected by the few search keywords stated to the retrieval system, though it can contain valuable information for query refinement. We address with this work two research questions related to the aim of developing context aware retrieval systems. Firstly, we show how already available information about the user's context can be employed effectively to gain highly precise search results. Secondly, we investigate how such meta-data about the search context can be gathered. The proposed “query profiles” have a central role in the query refinement process. They automatically detect necessary context information and help the user to explicitly express context dependent search constraints. The effectiveness of the approach is tested with experiments on selected dimensions of the user's context.

When documents are not regarded as a simple sequence of words, but their content is structured in a machine readable form, it is straightforward to develop retrieval systems that make use of the additional structure information. Structured retrieval first asks for the design of a suitable language that enables the user to express queries on content and structure. We investigate here existing query languages, whether and how they support the basic needs of structured querying. However, our main focus lies on the efficiency of structured retrieval systems. Conventional inverted indices for document retrieval systems are not suitable for maintaining structure indices. We identify base operations involved in the execution of structured queries and show how they can be supported by new indices and algorithms on a database system. Efficient query processing has to be concerned with the optimization of query plans as well. We investigate low level query plans of physical database operators for simple query patterns, and demonstrate the benefits of higher level query optimization for complex queries.

New search tasks and interfaces for the presentation of search results, like faceted search applications, question answering, expert search, and automatic timeline construction, come with the need to rank entities, such as persons, organizations or dates, instead of documents or text passages. Modern language processing tools are able to automatically detect and categorize named entities in large text collections. In order to estimate their relevance to a given search topic, we develop retrieval models for entities which are based on the relevance of texts that mention the entity. A graph-based relevance propagation framework is introduced for this purpose that enables to derive the relevance of entities. Several options for the modeling of entity containment graphs and different relevance propagation approaches are tested, demonstrating usefulness of the graph-based ranking framework.

Download Henning's thesis from EPrints.

Jun Wang defends Ph.D. thesis on Collaborative Filtering

Relevance Models for Collaborative Filtering

by Jung Wang

Collaborative filtering is the common technique of predicting the interests of a user by collecting preference information from many users. Although it is generally regarded as a key information retrieval technique, its relation to the existing information retrieval theory is unclear. This thesis shows how the development of collaborative filtering can gain many benefits from information retrieval theories and models. It brings the notion of relevance into collaborative filtering and develops several relevance models for collaborative filtering. Besides dealing with user profiles that are obtained by explicitly asking users to rate information items, the relevance models can also cope with the situations where user profiles are implicitly supplied by observing user interactions with a system. Experimental results complement the theoretical insights with improved recommendation accuracy for both item relevance ranking and user rating prediction. Furthermore, the approaches are more than just analogy: our derivations of the unified relevance model show that popular user-based and item-based approaches represent only a partial view of the problem, whereas a unified view that brings these partial views together gives better insights into their relative importance and how retrieval can benefit from their combination.

[download pdf]

Vojkan Mihajlovic defends Ph.D. thesis on structured information retrieval

Score Region Algebra: A flexible framework for structured information retrieval

by Vojkan Mihajlovic

The scope of the research presented in this thesis is the retrieval of relevant information from structured documents. The thesis describes a framework for information retrieval in documents that have some form of annotation used for describing logical and semantical document structure, such as XML and SGML. The development of the structured information retrieval framework follows the ideas from both database and information retrieval worlds. It uses a three-level database architecture and implements relevance scoring mechanisms inherited from information retrieval models.

To develop the structured retrieval framework, the problem of structured information retrieval is analyzed and elementary requirements for structured retrieval systems are specified. These requirements are: (1) entity selection – the selection of different entities in structured documents, such as elements, terms, attributes, image and video references, which are parts of the user query; (2) entity relevance score computation – the computation of relevance scores for different structured elements with respect to the content they contain; (3) relevance score combination – the combination of relevance scores from (different) elements in a document structure, resulting in a common element relevance score; (4) relevance score propagation – the propagation of scores from different elements to common ancestor or descendant elements following the query. These four requirements are supported when developing a database logical algebra in harmony with the retrieval models used for ranking. In the specification of the logical algebra we face a challenge of a transparent instantiation of retrieval models, i.e., the specification of different retrieval models without affecting the algebra operators.

Download Vojkan's thesis from EPrints.