Kien Tjin-Kam-Jet defends PhD thesis on Distributed Deep Web Search

Distributed Deep Web Search

by Kien Tjin-Kam-Jet

The World Wide Web contains billions of documents (and counting); hence, it is likely that some document will contain the answer or content you are searching for. While major search engines like Bing and Google often manage to return relevant results to your query, there are plenty of situations in which they are less capable of doing so. Specifically, there is a noticeable shortcoming in situations that involve the retrieval of data from the deep web. Deep web data is difficult to crawl and index for today's web search engines, and this is largely due to the fact that the data must be accessed via complex web forms. However, deep web data can be highly relevant to the information-need of the end-user. This thesis overviews the problems, solutions, and paradigms for deep web search. Moreover, it proposes a new paradigm to overcome the apparent limitations in the current state of deep web search, and makes the following scientific contributions:

  1. A more specific classification scheme for deep web search systems, to better illustrate the differences and variation between these systems.
  2. Virtual surfacing, a new, and in our opinion better, deep web search paradigm which tries to combine the benefits of the two already existing paradigms, surfacing and virtual integration, and which also raises new research opportunities.
  3. A stack decoding approach which combines rules and statistical usage information for interpreting the end-user's free-text query, and to subsequently derive filled-out web forms based on that interpretation.
  4. A practical comparison of the developed approach against a well-established text-processing toolkit.
  5. Empirical evidence that, for a single site, end-users would rather use the proposed free-text search interface instead of a complex web form.

Analysis of data obtained from user studies shows that the stack decoding approach works as well as, or better than, today’s top-performing alternatives.

[download pdf]

Adele Lu Jia defends her PhD thesis on incentives in p2p networks

Adele Lu Jia successfully defended her PhD thesis at Delft University of Technology,

Online Networks as Societies: User Behaviors and Contribution Incentives

by Adele Lu Jia

Online networks like Facebook and BitTorrent have become popular and powerful infrastructures for users to communicate, to interact, and to share social lives with each other. These networks often rely on the cooperation and the contribution of their users. Nevertheless, users in online networks are often found to be selfish, lazy, or even ma- licious, rather than cooperative, and therefore need to be incentivized for contributions. To date, great effort has been put into designing effective contribution incentive policies, which range from barter schemes to monetary schemes. In this thesis, we conduct an analysis of user behaviors and contribution incentives in online networks. We approach online networks as both computer systems and societies, hoping that this approach will, on the one hand, motivate computer scientists to think about the similarities between their artificial computer systems and the natural world, and on the other hand, help people outside the field understand online networks more smoothly.

To summarize, in this thesis we provide theoretical and practical insights into the correlation between user behaviors and contribution incentives in online networks. We demonstrate user behaviors and their consequences at both the system and the individual level, we analyze barter schemes and their limitations in incentivizing users to contribute, we evaluate monetary schemes and their risks in causing the collapse of the entire system, and we examine user interactions and their implications in inferring user relationships. Above all, unlike the offline human society that has evolved for thousands of years, online networks only emerged two decades ago and are still in a primitive state. Yet with their ever-improving technologies we have already obtained many exciting results. This points the way to a promising future for the study of online networks, not only in analyzing online behaviors, but also in cross reference with offline societies.

[more info]

Frans van der Sluis defends his PhD thesis on Information Experience

When Complexity becomes Interesting: An Inquiry into the Information eXperience

by Frans van der Sluis

To date, most research in information retrieval and related fields has been concerned primarily with efficiency and effectiveness of either the information system or the interaction of the user with the information system. At the same time, understanding the experience of a user during information interaction is recognized as a grand challenge for the development of information systems. There is a widely shared intuition that the value of the retrieved information is dependent on more than system characteristics such as the topical overlap between a query and a document. As it is not obvious how to embrace this intuition, this challenge has mostly been left ignored. This dissertation embarked upon the challenge of describing and developing an operational model of the Information eXperience (IX) – the experience during the interaction with information. This task was decomposed into three sub-challenges:

  1. Transform the fuzzy concept of the IX into a formalized one.
  2. Develop a model of textual complexity that enables an information system to influence a user's IX.
  3. Identify and influence the causes of the experience of interest in text.

Debasis Ganguly successfully defends PhD thesis on Topical Relevance Models

Today, Debasis Ganguly successfully defended his PhD thesis at Dublin City University.

Topical Relevance Models

by Debasis Ganguly

An inherent characteristic of information retrieval (IR) is that the query expressing a user's information need is often multi-faceted, that is, it encapsulates more than one specific potential sub-information need. This multi-facetedness of queries manifests itself as a topic distribution in the retrieved set of documents, where each document can be considered as a mixture of topics, one or more of which may correspond to the sub-information needs expressed in the query. In some specific domains of IR, such as patent prior art search, where the queries are full patent articles and the objective is to (in)validate the claims contained therein, the queries themselves are multi-topical in addition to the retrieved set of documents. The overall objective of the research described in this thesis involves investigating techniques to recognize and exploit these multi-topical characteristic of the retrieved documents and the queries in IR and relevance feedback in IR.

Robert Neumayer defends thesis on distributed entity search

Today, Robert Neumayer defended his Ph.D. thesis Semantic and Distributed Entity Search in the Web of Data at the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway

by Robert Neumayer

Both the growth and ubiquitious character of the Internet have had a profound effect on how we access and consume ata and information. More recently, the Semantic Web, an extension of the current Web has come increasingly relevant due to its widespread adoption.
The Web of Data (WoD) is an extension of the current web, where not only docu- ments are interlinked by means of hyperlinks but also data in terms of predicates. Specifically, it describes objects, entities or “things” in terms of their attributes and their relationships, using RDF data (and often is used equivalently to Linked Data). Given its growth, there is a strong need for making this wealth of knowl- edge accessible by keyword search (the de-facto standard paradigm for accessing information online).
The overall goal of this thesis is to provide new techniques for accessing this data, i.e., to leverage its full potential to end users. We therefore address the following four main issues: a) how can the Web of Data be searched by means of keyword search?, b) what sets apart search in the WoD from traditional web search?, c) how can these elements be used in a theoretically sound and effective way?, and d) How can the techniques be adapted to a distributed environment? To this end, we develop techniques for effectively searching WoD sources. We build upon and formalise existing entity modelling approaches within a generative language modelling framework, and compare them experimentally using standard test collections. We show that these models outperform the current state-of-the-art in terms of retrieval effectiveness, however, this is done at the cost of abandoning a large part of the semantics behind the data. We propose a novel entity model capable of preserving the semantics associated with entities, without sacrificing retrieval effectiveness. We further show how these approaches can be applied in the distributed context, both with low (federated search) and high numbers (Peer- to-peer or P2P) of independent repositories, collections, or nodes.
The main contributions are as follows:

  • We develop a hybrid approach to search in the Web of Data, using elements from traditional information retrieval and structured retrieval alike.
  • We formalise our approaches in a language model setting.
  • Our extensions are successfully evaluated with respect to their applicability in different distributed environments such as federated search and P2P.
  • We discuss and analyse based on our empirical evaluation and provide insights into the entity search problem.

Almer Tigelaar defends PhD thesis on P2P Search

Peer-to-peer information retrieval

by Almer Tigelaar

The Internet has become an integral part of our daily lives. However, the essential task of finding information is dominated by a handful of large centralised search engines. In this thesis we study an alternative to this approach. Instead of using large data centres, we propose using the machines that we all use every day: our desktop, laptop and tablet computers, to build a peer-to-peer web search engine. We provide a definition of the associated research field: peer-to-peer information retrieval. We examine what separates it from related fields, give an overview of the work done so far and provide an economic perspective on peer-to-peer search. Furthermore, we introduce our own architecture for peer-to-peer search systems, inspired by BitTorrent. Distributing the task of providing search results for queries introduces the problem of query routing: a query needs to be send to a peer that can provide relevant search results. We investigate how the content of peers can be represented so that queries can be directed to the best ones in terms of relevance. While cooperative peers can provide their own representation, the content of uncooperative peers can be accessed only through a search interface and thus they can not actively provide a description of themselves. We look into representing these uncooperative peers by probing their search interface to construct a representation. Finally, the capacity of the machines in peer-to-peer networks differs considerably making it challenging to provide search results quickly. To address this, we present an approach where copies of search results for previous queries are retained at peers and used to serve future requests and show participation can be incentivised using reputations. There are still problems to be solved before a real-world peer-to-peer web search engine can be build. This thesis provides a starting point for this ambitious goal and also provides a solid basis for reasoning about peer-to-peer information retrieval systems in general.

[download pdf]

Rianne Kaptein defends PhD thesis on Focused Retrieval

Effective Focused Retrieval by Exploiting Query Context and Document Structure

by Rianne Kaptein

The classic IR (Information Retrieval) model of the search process consists of three elements: query, documents and search results. A user looking to fulfill an information need formulates a query usually consisting of a small set of keywords summarizing the information need. The goal of an IR system is to retrieve documents containing information which might be useful or relevant to the user. Throughout the search process there is a loss of focus, because keyword queries entered by users often do not suitably summarize their complex information needs, and IR systems do not sufficiently interpret the contents of documents, leading to result lists containing irrelevant and redundant information. The main research question of this thesis is to exploit query context and document structure to provide for more focused retrieval.

More info

Dolf Trieschnigg defends PhD thesis on Biomedical IR

Proof of Concept: Concept-based Biomedical Information Retrieval

by Dolf Trieschnigg

In this thesis we investigate the possibility to integrate domain-specific knowledge into biomedical information retrieval (IR). Recent decades have shown a fast growing interest in biomedical research, reflected by an exponential growth in scientific literature. Biomedical IR is concerned with the disclosure of these vast amounts of written knowledge. Biomedical IR is not only important for end-users, such as biologists, biochemists, and bioinformaticians searching directly for relevant literature but also plays an important role in more sophisticated knowledge discovery. An important problem for biomedical IR is dealing with the complex and inconsistent terminology encountered in biomedical publications. Multiple synonymous terms can be used for single biomedical concepts, such as genes and diseases. Conversely, single terms can be ambiguous, and may refer to multiple concepts. Dealing with the terminology problem requires domain knowledge stored in terminological resources: controlled indexing vocabularies and thesauri. The integration of this knowledge in modern word-based information retrieval is, however, far from trivial. This thesis investigates the problem of handling biomedical terminology based on three research themes.

The first research theme deals with robust word-based retrieval. Effective retrieval models commonly use a word-based representation for retrieval. As so many spelling variations are present in biomedical text, the way in which these word-based representations are obtained affect retrieval effectiveness. We investigated the effect of choices in document preprocessing heuristics on retrieval effectiveness. This investigation included stop-word removal, stemming, different approaches to breakpoint identification and normalisation, and character n-gramming. In particular breakpoint identification and normalisation (that is determining word parts in biomedical compounds) showed a strong effect on retrieval performance. A combination of effective preprocessing heuristics was identified and used to obtain word-based representations from text for the remainder of this thesis.

The second research theme deals with concept-based retrieval. We investigated two representation vocabularies for concept-based indexing, one based on the Medical Subject Headings thesaurus, the other based on the Unified Medical Language System metathesaurus extended with a number of gene and protein dictionaries.

We investigated the following five topics.

  1. How documents are represented in a concept-based representation.
  2. To what extent such a document representation can be obtained automatically.
  3. To what extent a text-based query can be automatically mapped onto a concept-based representation and how this affects retrieval performance.
  4. To what extent a concept-based representation is effective in representing information needs.
  5. How the relationship between text and concepts can be used to determine the relatedness of concepts.

We compared different classification systems to obtain concept-based document and query representations automatically. We proposed two classification methods based on statistical language models, one based on K-Nearest Neighbours (KNN) and one based on Concept Language Models (CLM).

For a selection of classification systems we carried out a document classification experiment in which we investigated to what extent automatic classification could reproduce manual classification. The proposed KNN system performed well in comparison to the out-of-the-box systems. Manual analysis indicated the improved exhaustiveness of automatic classification over manual classification. Retrieval based on only concepts was demonstrated to be significantly less effective than word-based retrieval. This deteriorated performance could be explained by errors in the classification process, limitations of the concept vocabularies and limited exhaustiveness of the concept-based document representations. Retrieval based on a combination of word-based and automatically obtained concept-based query representations did significantly improve word-only retrieval. In an artificial setting, we compared the optimal retrieval performance which could be obtained with word-based and concept-based representations. Contrary to our intuition, on average a single word-based query performed better than a single concept-based representation, even when the best concept term precisely represented part of the information need.

We investigated to what extent the relatedness between pairs of concepts as indicated by human judgements could be automatically reproduced. Results on a small test set indicated that a method based on comparing concept language models performed particularly well in comparison to systems based on taxonomy structure, information content and (document) association.

In the third and last research theme of this thesis we propose a framework for concept-based retrieval. We approached the integration of domain knowledge in monolingual information retrieval as a cross-lingual information retrieval (CLIR) problem. Two languages were identified in this monolingual setting: a word-based representation language based on free text, and a concept-based representation language based on a terminological resource. Similar to what is common in traditional CLIR, queries and documents are translated into the same representation language and matched. The cross-lingual perspective gives us the opportunity to adopt a large set of established CLIR methods and techniques for this domain. In analogy to established CLIR practice, we investigated translation models based on a parallel corpus containing documents in multiple representations and translation models based on a thesaurus. Surprisingly, even the integration of very basic translation models showed improvements in retrieval effectiveness over word-only retrieval. A translation model based on pseudo-feedback translation was shown to perform particularly well. We proposed three extensions to a basic cross-lingual retrieval model which, similar to previous approaches in established CLIR, improved retrieval effectiveness by combining multiple translation models. Experimental results indicate that, even when using very basic translation models, monolingual biomedical IR can benefit from a cross-lingual approach to integrate domain knowledge.

[download pdf]

Robin Aly defends PhD thesis on uncertainty in concept-based multimedia retrieval

by Robin Aly

This thesis considers concept-based multimedia retrieval, where documents are represented by the occurrence of concepts (also referred to as semantic concepts or high-level features). A concept can be thought of as a kind of label, which is attached to (parts of) the multimedia documents in which it occurs. Since concept-based document representations are user, language and modality independent, using them for retrieval has great potential for improving search performance. As collections quickly grow both in volume and size, manually labeling concept occurrences becomes infeasible and the so-called concept detectors are used to decide upon the occurrence of concepts in the documents automatically.

The following fundamental problems in concept-based retrieval are identified and addressed in this thesis. First, the concept detectors frequently make mistakes while detecting concepts. Second, it is difficult for users to formulate their queries since they are unfamiliar with the concept vocabulary, and setting weights for each concept requires knowledge of the collection. Third, for supporting retrieval of longer video segments, single concept occurrences are not sufficient to differentiate relevant from non-relevant documents and some notion of the importance of a concept in a segment is needed. Finally, since current detection techniques lack performance, it is important to be able to predict what search performance retrieval engines yield, if the detection performance improves.

The main contribution of this thesis is the uncertain document representation ranking framework (URR). Based on the Nobel prize winning Portfolio Selection Theory, the URR framework considers the distribution over all possible concept-based document representations of a document given the observed confidence scores of concept detectors. For a given score function, documents are ranked by the expected score plus an additional term of the variance of the score, which represents the risk attitude of the system.

User-friendly concept selection is achieved by re-using an annotated development collection. Each video shot of the development collection is transformed into a textual description which yields a collection of textual descriptions. This collection is then searched for a textual query which does not require the user's knowledge of the concept vocabulary. The ranking of the textual descriptions and the knowledge of the concept occurrences in the development collection allows a selection of useful concepts together with their weights.

The URR framework and the proposed concept selection method are used to derive a shot and a video segment retrieval framework. For shot retrieval, the probabilistic ranking framework for unobservable events is proposed. The framework re-uses the well-known probability of relevance score function from text retrieval. Because of the representation uncertainty, documents are ranked by their expected retrieval score given the confidence scores from the concept detectors.

For video segment retrieval, the uncertain concept language model is proposed for retrieving news items — a particular video segment type. A news item is modeled as a series of shots and represented by the frequency of each selected concept. Using the parallel between concept frequencies and term frequencies, a concept language model score function is derived from the language modelling framework. The concept language model score function is then used according to the URR framework and documents are ranked by the expected concept language score plus an additional term of the score's variance.

The Monte Carlo Simulation method is used to predict the behavior of current retrieval models under improved concept detector performance. First, a probabilistic model of concept detector output is defined as two Gaussian distributions, one for the shots in which the concept occurs and one for the shots in which it does not. Randomly generating concept detector scores for a collection with known concept occurrences and executing a search on the generated output estimates the expected search performance given the model's parameters. By modifying the model parameters, the detector performance can be improved and the future search performance can be predicted.

Experiments on several collections of the TRECVid evaluation benchmark showed that the URR framework often significantly improve the search performance compared to several state-of-the-art baselines. The simulation of concept detectors yields that today's video shot retrieval models will show an acceptable performance, once the detector performance is around 0.60 mean average precision. The simulation of video segment retrieval suggests, that this task is easier and will sooner be applicable to real-life applications.

[download pdf]

Claudia Hauff defends PhD thesis on performance prediction

Predicting the Effectiveness of Queries and Retrieval Systems

by Claudia Hauff

The thesis considers users' attempts to express their information needs through queries, or search requests and tries to predict whether those requests will be of high or low quality. Intuitively, a query's quality is determined by the outcome of the query, that is, whether the retrieved search results meet the user's expectations. The second type of prediction methods under investigation are those which attempt to predict the quality of search systems themselves. Given a number of search systems to consider, these methods estimate how well or how poorly the systems will perform in comparison to each other.

The motivation for this research effort stems primarily from the enormous benefits originating from successfully predicting the quality of a query or a system. Accurate predictions enable the employment of adaptive retrieval components which would have a considerable positive effect on the user experience. Furthermore, if we would achieve sufficiently accurate predictions of the quality of retrieval systems, the cost of evaluation would be significantly reduced.

In a first step, pre-retrieval predictors are investigated, which predict a query's effectiveness before the retrieval step and are thus independent of the ranked list of results. Such predictors base their predictions solely on query terms, collection statistics and possibly external sources such as WordNet or Wikipedia. A total of twenty-two prediction algorithms are categorized and their quality is assessed on three different TREC test collections, including two large Web collections. A number of newly applied methods for combining various predictors are examined to obtain a better prediction of a query's effectiveness. In order to adequately and appropriately compare such techniques the current evaluation methodology is critically examined. It is shown that the standard evaluation measure, namely the linear correlation coefficient, can provide a misleading indication of performance. To address this issue, the current evaluation methodology is extended to include cross validation and statistical testing to determine significant differences.

Building on the analysis of pre-retrieval predictors, post-retrieval approaches are then investigated, which estimate a query's effectiveness on the basis of the retrieved results. The thesis focuses in particular on the Clarity Score approach and provides an analysis of its sensitivity towards different variables such as the collection, the query set and the retrieval approach. Adaptations to Clarity Score are introduced which improve the estimation accuracy of the original algorithm on most evaluated test collections.

The utility of query effectiveness prediction methods is commonly evaluated by reporting correlation coefficients, such as Kendall's Tau and the linear correlation coefficient, which denote how well the methods perform at predicting the retrieval effectiveness of a set of queries. Despite the significant amount of research dedicated to this important stage in the retrieval process, the following question has remained unexplored: what is the relationship of the current evaluation methodology for query effectiveness prediction and the change in effectiveness of retrieval systems that employ a predictor? We investigate this question with a large scale study for which predictors of arbitrary accuracy are generated in order to examine how the strength of their observed Kendall's Tau coefficient affects the retrieval effectiveness in two adaptive system settings: selective query expansion and meta-search. It is shown that the accuracy of currently existing query effectiveness prediction methods is not yet high enough to lead to consistent positive changes in retrieval performance in these particular settings.

The last part of the thesis is concerned with the task of estimating the ranking of retrieval systems according to their retrieval effectiveness without relying on costly relevance judgments. Five different system ranking estimation approaches are evaluated on a wide range of data sets which cover a variety of retrieval tasks and a variety of test collections. The issue that has long prevented this line of automatic evaluation to be used in practice is the severe mis-ranking of the best systems. In the experiments reported in this work, however, we show this not to be an inherent problem of system ranking estimation approaches, it is rather data set dependent. Under certain conditions it is indeed possible to automatically identify the best systems correctly. Furthermore, our analysis reveals that the estimated ranking of systems is not equally accurate for all topics of a topic set, which motivates the investigation of relying on topic subsets to improve the accuracy of the estimate. A study to this effect indicates the validity of the approach.

[download pdf]