Optimizing Ranking Systems Online as Bandits
by Chang Li
People use interactive systems, such as search engines, as the main tool to obtain information. To satisfy the information needs, such systems usually provide a list of items that are selected out of a large candidate set and then sorted in the decreasing order of their usefulness. The result lists are generated by a ranking algorithm, called ranker, which takes the request of user and candidate items as the input and decides the order of candidate items. The quality of these systems depends on the underlying rankers.
There are two main approaches to optimize the ranker in an interactive system: using data annotated by humans or using the interactive user feedback. The first approach has been widely studied in history, also called offline learning to rank, and is the industry standard. However, the annotated data may not well represent information needs of users and are not timely. Thus, the first approaches may lead to suboptimal rankers. The second approach optimizes rankers by using interactive feedback. This thesis considers the second approach, learning from the interactive feedback. The reasons are two-fold:
- Everyday, millions of users interact with the interactive systems and generate a huge number of interactions, from which we can extract the information needs of users.
- Learning from the interactive data have more potentials to assist in designing the online algorithms.
Specifically, this thesis considers the task of learning from the user click feedback. The main contribution of this thesis is proposing a safe online learning to re-rank algorithm, named BubbleRank, which addresses one main disadvantage of online learning, i.e., the safety issue, by combining the advantages of both offline and online learning to rank algorithms. The thesis also proposes three other online algorithms, each of which solves unique online ranker optimization problems. All the proposed algorithms are theoretically sound and empirically effective.
Image by @firstname.lastname@example.org
The Blind Man and the Elephant: Measuring Economic Impacts of DDoS Attacks
Internet has become an important part of our everyday life. We use services like Netflix, Skype, online banking and scopus etc. daily. We even use Internet for filing our taxes and communicating with municipality. This dependency on network-based technologies also provides an opportunity to malicious actors in our society to remotely attack IT infrastructure. One such cyberattack that may lead to unavailability of network resources is known as distributed denial of service (DDoS) attack. A DDoS attack leverages many computers to launch a coordinated Denial of Service attack against one or more targets.
These attacks cause damages to victim businesses. According to reports published by several consultancies and security companies these attacks lead to millions of dollars in losses every year. One might ponder: are the damages caused by temporary unavailability of network services really this large? One of the points of criticism for these reports has been that they often base their findings on victim surveys and expert opinions. Now, as cost accounting/book keeping methods are not focused on measuring the impact of cyber security incidents, it is highly likely that surveys are unable to capture the true impact of an attack. A concerning fact is that most C-level managers make budgetary decisions for security based on the losses reported in these surveys. Several inputs for security investment decision models such as return on security investment (ROSI) also depend on these figures. This makes the situation very similar to the parable of the blind men and the elephant, who try to conceptualise how the elephant looks like by touching it. Hence, it is important to develop methodologies that capture the true impact of DDoS attacks. In this thesis, we study the economic impact of DDoS attacks on public/private organisations by using an empirical approach.
Temporal Information Models for Real-Time Microblog Search
by Flávio Martins
Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where it was shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches:
(1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time;
(2) efficient methods for federated query expansion towards the improvement of query meaning; and
(3) exploiting multiple sources towards the detection of temporal query intent.
It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step.
(Photo by @email@example.com)
Dueling bandits for online ranker evaluation
by Masrour Zoghi
In every domain where a service or a product is provided, an important question is that of evaluation: given a set of possible choices for deployment, what is the best one? An important example, which is considered in this work, is that of ranker evaluation from the field of information retrieval (IR). The goal of IR is to satisfy the information need of a user in response to a query issued by them, where this information need is typically satisfied by a document (or a small set of documents) contained in what is often a much larger collection. This goal is often attained by ranking the documents according to their usefulness to the issued query using an algorithm, called a ranker, a procedure that takes as input a query and a set of documents and specifies how the documents need to be ordered.
This thesis is concerned with ranker evaluation. The goal of ranker evaluation is to determine the quality of the rankers under consideration to allow us to choose the best option: given a finite set of possible rankers, which one of them leads to the highest level of user satisfaction? There are two main methods for carrying this out: absolute metrics and relative comparisons. This thesis is concerned with the second, relative form of ranker evaluation because it is more efficient at distinguishing between rankers of different quality: for instance interleaved comparisons take a fraction of the time required by A/B testing, but they produce the same outcome. More precisely, the problem of online ranker evaluation from relative feedback can be described as follows: given a finite set of rankers, choose the best using only pairwise comparisons between the rankers under consideration, while minimizing the number of comparisons involving sub-optimal rankers. This problem is an instance of what is referred to as the dueling bandit problem in the literature.
The main contribution of this thesis is devising a dueling bandit algorithm, called Copeland Confidence Bounds (CCB), that solves this problem under practically general assumptions and providing theoretical guarantees for its proper functioning. In addition to that, the thesis contains a number of other algorithms that are better suited for dueling bandit problems with particular properties.
by Mohammadreza Khelghati
Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist or a business analyst, your primary concern is to access all the relevant data to your topics of interest. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data. In accessing web data through either general search engines or direct querying of deep web sources, the laborious work of querying, navigating results, downloading, storing and tracking data changes is a burden on shoulders of users. To decrease this intensive labor work of accessing data, (semi-)automatic harvesters have a crucial role. However, they lack a number of functionalities that we discuss and address in this work.
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles. First, we try to find methods that can achieve the best coverage in harvesting data for a topic. Although using a fully automatic general harvester facilitates accessing web data, it is not a complete solution to collect a thorough data coverage on a given topic. Some search engines, in both surface web and deep web, restrict the number of requests from a user or limit the number of returned results presented to him. We suggest an efficient approach which can pass these limitations and achieve a complete data coverage.
Second, we investigate reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size. Harvesting tasks continue till they face the posed query submission limitations by search engines or consume all the allocated resources. To prevent this undesirable situation, we need to know the size of the targeted source. For a website that hides the true size of its residing data, we suggest an accurate method to estimate its size.
As the third challenge, we focus on monitoring data changes over time in web data repositories. This information is helpful in providing the most up-to-date answers to information needs of users. The fast evolving web adds extra challenges for having an up-to-date data collection. Considering the costly process of harvesting, it is important to find methods which facilitate efficient re-harvesting processes.
Lastly, we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings.
These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.
Co-occurrence Rate Networks: Towards separate training for undirected graphical models
by Zhemin Zhu
Dependence is a universal phenomenon which can be observed everywhere. In machine learning, probabilistic graphical models (PGMs) represent dependence relations with graphs. PGMs find wide applications in natural language processing (NLP), speech processing, computer vision, biomedicine, information retrieval, etc. Many traditional models, such as hidden Markov models (HMMs), Kalman filters, can be put under the umbrella of PGMs. The central idea of PGMs is to decompose (factorize) a joint probability into a product of local factors. Learning, inference and storage can be conducted efficiently over the factorization representation.
Two major types of PGMs can be distinguished: (i) Bayesian networks (directed graphs), and (ii) Markov networks (undirected graphs). Bayesian networks represent directed dependence with directed edges. Local factors of Bayesian networks are conditional probabilities. Directed dependence, directed edges and conditional probabilities are all asymmetric notions. In contrast, Markov networks represent mutual dependence with undirected edges. Both of mutual dependence and undirected edges are symmetric notions. For general Markov networks, based on the Hammersley–Clifford theorem, local factors are positive functions over maximum cliques. These local factors are explained using intuitive notions like ‘compatibility’ or ‘affinity’. Specially, if a graph forms a clique tree, the joint probability can be reparameterized into a junction tree factorization.
In this thesis, we propose a novel framework motivated by the Minimum Shared Information Principle (MSIP): We try to find a factorization in which the information shared between factors is minimum. In other words, we try to make factors as independent as possible.
The benefit of doing this is that we can train factors separately without paying a lot of efforts to guarantee consistency between them. To achieve this goal, we develop a theoretical framework called co-occurrence rate networks (CRNs) to obtain such a factorization. Briefly, given a joint probability, the CRN factorization is obtained as follows. We first strip off singleton probabilities from the joint probability. The quantity left is called co-occurrence rate (CR). CR is a symmetric quantity which measures mutual dependence among variables involved. Then we further decompose the joint CR into smaller and indepen dent CRs. Finally, we obtain a CRN factorization whose factors consist of all singleton probabilities and CR factors. There exist two kinds of independencies between these factors: (i) a singleton probability is independent (Here independent means two factors do not share information.) of other singleton probabilities; (ii) a CR factor is independent of other CR factors conditioned by singleton probabilities. Based on a CRN factorization, we propose an efficient two-step separate training method: (i) in the first step, we train a separate model for each singleton probability; (ii) given singleton probabilities, we train a separate model for each CR factor. Experimental results on three important natural language processing tasks show that our separate training method is two orders of magnitude faster than conditional random fields, while achieving competitive quality (often better on the overall quality metric F1).
The second contribution of this thesis is applying PGMs to a real-world NLP application: open relation extraction (ORE). In open relation extraction, two entities in a sentence are given, and the goal is to automatically extract their relation expression. ORE is a core technique, especially in the age of big data, for transforming unstructured information into structured data. We propose our model SimpleIE for this task. The basic idea is to decompose an extraction pattern into a sequence of simplification operations (components). The benefit by doing this is that these components can be re-combined in a new way to generate new extraction patterns. Hence SimpleIE can represent and capture diverse extraction patterns. This model is essentially a sequence labeling model. Experimental results on three benchmark data sets show that SimpleIE boosts recall and F1 by at least 15% comparing with seven ORE systems.
As tangible outputs of this thesis, we contribute open source implementations of our research results as well as an annotated data set.
Grouping by association: Using associative networks for document categorization
by Niels Bloom
In this thesis we describe a method of using associative networks for automatic document grouping. Associative networks are networks of concepts in which each concept is linked to concepts that are semantically similar to it. By activating concepts in the network based on the text of a document and spreading this activation to related concepts, we can determine what concepts are related to the document, even if the document itself does not contain words linked directly to those concepts. Based on this information, we can group documents by the concepts they refer to.
In the first part of the thesis we describe the method itself, as well as the details of various algorithms used in the implementation. We additionally discuss the theory upon which the method is based and compare it to various related methods. In the second part of the thesis we present several improvements to the method. We evaluate techniques to create associative networks from easily accessible knowledge sources, as well as different methods for the training of the associative network. Additionally, we evaluate techniques to improve the extraction of concepts from documents, we compare methods of spreading activation from concept to concept, and we present a novel technique by which the extracted concepts can be used to categorize documents. We also extend the method of associative networks to enable the application to multilingual document libraries and compare the method to other state-of-the-art methods. Finally, we present a practical application of associative networks, as implemented in a corporate environment in the form of the Pagelink Knowledge Centre. We demonstrate the practical usability of our work, and discuss the various advantages and disadvantages that the method of associative networks offers.
Sergio Duarte Torres' PhD defense last Friday February 14th resulted in a exceptional PhD degree cum laude. His PhD thesis: “Information Retrieval for Children: Search Behavior and Solutions” was written at the Database Group as part of the European project PuppyIR, a joint project with amongst others Human Media Interaction. Sergio's research shows an extraordinary diversity and heterogeneity, touching many areas of computer science, including Information Retrieval, Big Data analysis, and Machine Learning. Sergio sought cooperation with leading search engine companies in the field: Yahoo and Yandex. He did a three-month internship at Yahoo Research in Barcelona. Sergio's work is well-received. His paper on vertical selection for search for children was nominated for the Best Student Paper Award at the joint ACM/IEEE conference on Digital Libraries in Indianapolis, USA. His work is accepted at two important journals in the field: the ACM Transactions on the Web, and the Journal of the American Society of Information Science and Technology. Specifically worth mentioning is the user study with children aged 8 to 10 years old done by Sergio to evaluate the child-friendly search approaches that he developed. We are proud of the achievements of Sergio Duarte Torres. He will be an excellent ambassador of the University of Twente.
Today, Ilya Markov successfully defended his PhD thesis at the Università della Svizzera italiana in Lugano, Switzerland.
Uncertainty in Distributed Information Retrieval
by Ilya Markov
Large amounts of available digital information call for distributed processing and management solutions. Distributed Information Retrieval (DIR), also known as Federated Search, provides techniques for performing retrieval over such distributed data. In particular, it studies approaches to aggregating multiple searchable sources of information within a single interface.
DIR provides an efficient and low-cost solution to a distributed retrieval problem. As opposed to a centralized retrieval system, which acquires, stores and processes all available information locally, DIR delegates the search task to distributed sources. This way, DIR lowers the storage and processing costs and provides a user with up-to-date information even if this information is not crawlable (i.e. cannot be reached using hyperlinks).
DIR is usually based on a brokered architecture, according to which distributed retrieval is managed by a single broker. The broker-based DIR can be divided into five steps: resource discovery, resource description, resource selection, score normalization and results presentation. Among these steps, resource description, resource selection and score normalization are actively studied within DIR research, while the resource discovery step is addressed by the database community and results presentation is studied within aggregated search.
Despite the large volume of research on resource selection and score normalization, no unified framework of developed techniques exists, which makes difficult the application and comparison of available methods. The first goal of this dissertation is to summarize, analyze and evaluate existing resource selection and score normalization techniques within a unified framework. This should improve the understanding of available methods, reveal their underlying assumptions and limitations and describe their properties. This, in turn, will help to improve existing resource selection and score normalization techniques and to apply the right method in the right setting.
The second and the main contribution of this dissertation is in stating and addressing the problem of uncertainty in DIR. In Information Retrieval (IR) this problem has been recognized for a long time and numerous techniques have been proposed to deal with uncertainty in various IR tasks. This dissertation raises the question of uncertainty in DIR, outlines the sources of uncertainty on different DIR phases and proposes methods for measuring and reducing this uncertainty.
Distributed Deep Web Search
by Kien Tjin-Kam-Jet
The World Wide Web contains billions of documents (and counting); hence, it is likely that some document will contain the answer or content you are searching for. While major search engines like Bing and Google often manage to return relevant results to your query, there are plenty of situations in which they are less capable of doing so. Specifically, there is a noticeable shortcoming in situations that involve the retrieval of data from the deep web. Deep web data is difficult to crawl and index for today's web search engines, and this is largely due to the fact that the data must be accessed via complex web forms. However, deep web data can be highly relevant to the information-need of the end-user. This thesis overviews the problems, solutions, and paradigms for deep web search. Moreover, it proposes a new paradigm to overcome the apparent limitations in the current state of deep web search, and makes the following scientific contributions:
- A more specific classification scheme for deep web search systems, to better illustrate the differences and variation between these systems.
- Virtual surfacing, a new, and in our opinion better, deep web search paradigm which tries to combine the benefits of the two already existing paradigms, surfacing and virtual integration, and which also raises new research opportunities.
- A stack decoding approach which combines rules and statistical usage information for interpreting the end-user's free-text query, and to subsequently derive filled-out web forms based on that interpretation.
- A practical comparison of the developed approach against a well-established text-processing toolkit.
- Empirical evidence that, for a single site, end-users would rather use the proposed free-text search interface instead of a complex web form.
Analysis of data obtained from user studies shows that the stack decoding approach works as well as, or better than, today’s top-performing alternatives.