Dueling bandits for online ranker evaluation
by Masrour Zoghi
In every domain where a service or a product is provided, an important question is that of evaluation: given a set of possible choices for deployment, what is the best one? An important example, which is considered in this work, is that of ranker evaluation from the field of information retrieval (IR). The goal of IR is to satisfy the information need of a user in response to a query issued by them, where this information need is typically satisfied by a document (or a small set of documents) contained in what is often a much larger collection. This goal is often attained by ranking the documents according to their usefulness to the issued query using an algorithm, called a ranker, a procedure that takes as input a query and a set of documents and specifies how the documents need to be ordered.
This thesis is concerned with ranker evaluation. The goal of ranker evaluation is to determine the quality of the rankers under consideration to allow us to choose the best option: given a finite set of possible rankers, which one of them leads to the highest level of user satisfaction? There are two main methods for carrying this out: absolute metrics and relative comparisons. This thesis is concerned with the second, relative form of ranker evaluation because it is more efficient at distinguishing between rankers of different quality: for instance interleaved comparisons take a fraction of the time required by A/B testing, but they produce the same outcome. More precisely, the problem of online ranker evaluation from relative feedback can be described as follows: given a finite set of rankers, choose the best using only pairwise comparisons between the rankers under consideration, while minimizing the number of comparisons involving sub-optimal rankers. This problem is an instance of what is referred to as the dueling bandit problem in the literature.
The main contribution of this thesis is devising a dueling bandit algorithm, called Copeland Confidence Bounds (CCB), that solves this problem under practically general assumptions and providing theoretical guarantees for its proper functioning. In addition to that, the thesis contains a number of other algorithms that are better suited for dueling bandit problems with particular properties.
by Mohammadreza Khelghati
Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist or a business analyst, your primary concern is to access all the relevant data to your topics of interest. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data. In accessing web data through either general search engines or direct querying of deep web sources, the laborious work of querying, navigating results, downloading, storing and tracking data changes is a burden on shoulders of users. To decrease this intensive labor work of accessing data, (semi-)automatic harvesters have a crucial role. However, they lack a number of functionalities that we discuss and address in this work.
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles. First, we try to find methods that can achieve the best coverage in harvesting data for a topic. Although using a fully automatic general harvester facilitates accessing web data, it is not a complete solution to collect a thorough data coverage on a given topic. Some search engines, in both surface web and deep web, restrict the number of requests from a user or limit the number of returned results presented to him. We suggest an efficient approach which can pass these limitations and achieve a complete data coverage.
Second, we investigate reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size. Harvesting tasks continue till they face the posed query submission limitations by search engines or consume all the allocated resources. To prevent this undesirable situation, we need to know the size of the targeted source. For a website that hides the true size of its residing data, we suggest an accurate method to estimate its size.
As the third challenge, we focus on monitoring data changes over time in web data repositories. This information is helpful in providing the most up-to-date answers to information needs of users. The fast evolving web adds extra challenges for having an up-to-date data collection. Considering the costly process of harvesting, it is important to find methods which facilitate efficient re-harvesting processes.
Lastly, we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings.
These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.
Grouping by association: Using associative networks for document categorization
by Niels Bloom
In this thesis we describe a method of using associative networks for automatic document grouping. Associative networks are networks of concepts in which each concept is linked to concepts that are semantically similar to it. By activating concepts in the network based on the text of a document and spreading this activation to related concepts, we can determine what concepts are related to the document, even if the document itself does not contain words linked directly to those concepts. Based on this information, we can group documents by the concepts they refer to.
In the first part of the thesis we describe the method itself, as well as the details of various algorithms used in the implementation. We additionally discuss the theory upon which the method is based and compare it to various related methods. In the second part of the thesis we present several improvements to the method. We evaluate techniques to create associative networks from easily accessible knowledge sources, as well as different methods for the training of the associative network. Additionally, we evaluate techniques to improve the extraction of concepts from documents, we compare methods of spreading activation from concept to concept, and we present a novel technique by which the extracted concepts can be used to categorize documents. We also extend the method of associative networks to enable the application to multilingual document libraries and compare the method to other state-of-the-art methods. Finally, we present a practical application of associative networks, as implemented in a corporate environment in the form of the Pagelink Knowledge Centre. We demonstrate the practical usability of our work, and discuss the various advantages and disadvantages that the method of associative networks offers.
Sergio Duarte Torres' PhD defense last Friday February 14th resulted in a exceptional PhD degree cum laude. His PhD thesis: “Information Retrieval for Children: Search Behavior and Solutions” was written at the Database Group as part of the European project PuppyIR, a joint project with amongst others Human Media Interaction. Sergio's research shows an extraordinary diversity and heterogeneity, touching many areas of computer science, including Information Retrieval, Big Data analysis, and Machine Learning. Sergio sought cooperation with leading search engine companies in the field: Yahoo and Yandex. He did a three-month internship at Yahoo Research in Barcelona. Sergio's work is well-received. His paper on vertical selection for search for children was nominated for the Best Student Paper Award at the joint ACM/IEEE conference on Digital Libraries in Indianapolis, USA. His work is accepted at two important journals in the field: the ACM Transactions on the Web, and the Journal of the American Society of Information Science and Technology. Specifically worth mentioning is the user study with children aged 8 to 10 years old done by Sergio to evaluate the child-friendly search approaches that he developed. We are proud of the achievements of Sergio Duarte Torres. He will be an excellent ambassador of the University of Twente.
Today, Ilya Markov successfully defended his PhD thesis at the Università della Svizzera italiana in Lugano, Switzerland.
Uncertainty in Distributed Information Retrieval
by Ilya Markov
Large amounts of available digital information call for distributed processing and management solutions. Distributed Information Retrieval (DIR), also known as Federated Search, provides techniques for performing retrieval over such distributed data. In particular, it studies approaches to aggregating multiple searchable sources of information within a single interface.
DIR provides an efficient and low-cost solution to a distributed retrieval problem. As opposed to a centralized retrieval system, which acquires, stores and processes all available information locally, DIR delegates the search task to distributed sources. This way, DIR lowers the storage and processing costs and provides a user with up-to-date information even if this information is not crawlable (i.e. cannot be reached using hyperlinks).
DIR is usually based on a brokered architecture, according to which distributed retrieval is managed by a single broker. The broker-based DIR can be divided into five steps: resource discovery, resource description, resource selection, score normalization and results presentation. Among these steps, resource description, resource selection and score normalization are actively studied within DIR research, while the resource discovery step is addressed by the database community and results presentation is studied within aggregated search.
Despite the large volume of research on resource selection and score normalization, no unified framework of developed techniques exists, which makes difficult the application and comparison of available methods. The first goal of this dissertation is to summarize, analyze and evaluate existing resource selection and score normalization techniques within a unified framework. This should improve the understanding of available methods, reveal their underlying assumptions and limitations and describe their properties. This, in turn, will help to improve existing resource selection and score normalization techniques and to apply the right method in the right setting.
The second and the main contribution of this dissertation is in stating and addressing the problem of uncertainty in DIR. In Information Retrieval (IR) this problem has been recognized for a long time and numerous techniques have been proposed to deal with uncertainty in various IR tasks. This dissertation raises the question of uncertainty in DIR, outlines the sources of uncertainty on different DIR phases and proposes methods for measuring and reducing this uncertainty.
Distributed Deep Web Search
by Kien Tjin-Kam-Jet
The World Wide Web contains billions of documents (and counting); hence, it is likely that some document will contain the answer or content you are searching for. While major search engines like Bing and Google often manage to return relevant results to your query, there are plenty of situations in which they are less capable of doing so. Specifically, there is a noticeable shortcoming in situations that involve the retrieval of data from the deep web. Deep web data is difficult to crawl and index for today's web search engines, and this is largely due to the fact that the data must be accessed via complex web forms. However, deep web data can be highly relevant to the information-need of the end-user. This thesis overviews the problems, solutions, and paradigms for deep web search. Moreover, it proposes a new paradigm to overcome the apparent limitations in the current state of deep web search, and makes the following scientific contributions:
- A more specific classification scheme for deep web search systems, to better illustrate the differences and variation between these systems.
- Virtual surfacing, a new, and in our opinion better, deep web search paradigm which tries to combine the benefits of the two already existing paradigms, surfacing and virtual integration, and which also raises new research opportunities.
- A stack decoding approach which combines rules and statistical usage information for interpreting the end-user's free-text query, and to subsequently derive filled-out web forms based on that interpretation.
- A practical comparison of the developed approach against a well-established text-processing toolkit.
- Empirical evidence that, for a single site, end-users would rather use the proposed free-text search interface instead of a complex web form.
Analysis of data obtained from user studies shows that the stack decoding approach works as well as, or better than, today’s top-performing alternatives.
Adele Lu Jia successfully defended her PhD thesis at Delft University of Technology,
Online Networks as Societies: User Behaviors and Contribution Incentives
by Adele Lu Jia
Online networks like Facebook and BitTorrent have become popular and powerful infrastructures for users to communicate, to interact, and to share social lives with each other. These networks often rely on the cooperation and the contribution of their users. Nevertheless, users in online networks are often found to be selfish, lazy, or even ma- licious, rather than cooperative, and therefore need to be incentivized for contributions. To date, great effort has been put into designing effective contribution incentive policies, which range from barter schemes to monetary schemes. In this thesis, we conduct an analysis of user behaviors and contribution incentives in online networks. We approach online networks as both computer systems and societies, hoping that this approach will, on the one hand, motivate computer scientists to think about the similarities between their artificial computer systems and the natural world, and on the other hand, help people outside the field understand online networks more smoothly.
To summarize, in this thesis we provide theoretical and practical insights into the correlation between user behaviors and contribution incentives in online networks. We demonstrate user behaviors and their consequences at both the system and the individual level, we analyze barter schemes and their limitations in incentivizing users to contribute, we evaluate monetary schemes and their risks in causing the collapse of the entire system, and we examine user interactions and their implications in inferring user relationships. Above all, unlike the offline human society that has evolved for thousands of years, online networks only emerged two decades ago and are still in a primitive state. Yet with their ever-improving technologies we have already obtained many exciting results. This points the way to a promising future for the study of online networks, not only in analyzing online behaviors, but also in cross reference with offline societies.
When Complexity becomes Interesting: An Inquiry into the Information eXperience
by Frans van der Sluis
To date, most research in information retrieval and related fields has been concerned primarily with efficiency and effectiveness of either the information system or the interaction of the user with the information system. At the same time, understanding the experience of a user during information interaction is recognized as a grand challenge for the development of information systems. There is a widely shared intuition that the value of the retrieved information is dependent on more than system characteristics such as the topical overlap between a query and a document. As it is not obvious how to embrace this intuition, this challenge has mostly been left ignored. This dissertation embarked upon the challenge of describing and developing an operational model of the Information eXperience (IX) – the experience during the interaction with information. This task was decomposed into three sub-challenges:
- Transform the fuzzy concept of the IX into a formalized one.
- Develop a model of textual complexity that enables an information system to influence a user's IX.
- Identify and influence the causes of the experience of interest in text.
Today, Debasis Ganguly successfully defended his PhD thesis at Dublin City University.
Topical Relevance Models
by Debasis Ganguly
An inherent characteristic of information retrieval (IR) is that the query expressing a user's information need is often multi-faceted, that is, it encapsulates more than one specific potential sub-information need. This multi-facetedness of queries manifests itself as a topic distribution in the retrieved set of documents, where each document can be considered as a mixture of topics, one or more of which may correspond to the sub-information needs expressed in the query. In some specific domains of IR, such as patent prior art search, where the queries are full patent articles and the objective is to (in)validate the claims contained therein, the queries themselves are multi-topical in addition to the retrieved set of documents. The overall objective of the research described in this thesis involves investigating techniques to recognize and exploit these multi-topical characteristic of the retrieved documents and the queries in IR and relevance feedback in IR.
Today, Robert Neumayer defended his Ph.D. thesis Semantic and Distributed Entity Search in the Web of Data at the Norwegian University of Science and Technology (NTNU) in Trondheim, Norway
by Robert Neumayer
Both the growth and ubiquitious character of the Internet have had a profound effect on how we access and consume ata and information. More recently, the Semantic Web, an extension of the current Web has come increasingly relevant due to its widespread adoption.
The Web of Data (WoD) is an extension of the current web, where not only docu- ments are interlinked by means of hyperlinks but also data in terms of predicates. Specifically, it describes objects, entities or “things” in terms of their attributes and their relationships, using RDF data (and often is used equivalently to Linked Data). Given its growth, there is a strong need for making this wealth of knowl- edge accessible by keyword search (the de-facto standard paradigm for accessing information online).
The overall goal of this thesis is to provide new techniques for accessing this data, i.e., to leverage its full potential to end users. We therefore address the following four main issues: a) how can the Web of Data be searched by means of keyword search?, b) what sets apart search in the WoD from traditional web search?, c) how can these elements be used in a theoretically sound and effective way?, and d) How can the techniques be adapted to a distributed environment? To this end, we develop techniques for effectively searching WoD sources. We build upon and formalise existing entity modelling approaches within a generative language modelling framework, and compare them experimentally using standard test collections. We show that these models outperform the current state-of-the-art in terms of retrieval effectiveness, however, this is done at the cost of abandoning a large part of the semantics behind the data. We propose a novel entity model capable of preserving the semantics associated with entities, without sacrificing retrieval effectiveness. We further show how these approaches can be applied in the distributed context, both with low (federated search) and high numbers (Peer- to-peer or P2P) of independent repositories, collections, or nodes.
The main contributions are as follows:
- We develop a hybrid approach to search in the Web of Data, using elements from traditional information retrieval and structured retrieval alike.
- We formalise our approaches in a language model setting.
- Our extensions are successfully evaluated with respect to their applicability in different distributed environments such as federated search and P2P.
- We discuss and analyse based on our empirical evaluation and provide insights into the entity search problem.