Temporal Information Models for Real-Time Microblog Search
by Flávio Martins
Real-time search in Twitter and other social media services is often biased towards the most recent results due to the “in the moment” nature of topic trends and their ephemeral relevance to users and media in general. However, “in the moment”, it is often difficult to look at all emerging topics and single-out the important ones from the rest of the social media chatter. This thesis proposes to leverage on external sources to estimate the duration and burstiness of live Twitter topics. It extends preliminary research where it was shown that temporal re-ranking using external sources could indeed improve the accuracy of results. To further explore this topic we pursued three significant novel approaches:
(1) multi-source information analysis that explores behavioral dynamics of users, such as Wikipedia live edits and page view streams, to detect topic trends and estimate the topic interest over time;
(2) efficient methods for federated query expansion towards the improvement of query meaning; and
(3) exploiting multiple sources towards the detection of temporal query intent.
It differs from past approaches in the sense that it will work over real-time queries, leveraging on live user-generated content. This approach contrasts with previous methods that require an offline preprocessing step.
(Photo by @firstname.lastname@example.org)
Dueling bandits for online ranker evaluation
by Masrour Zoghi
In every domain where a service or a product is provided, an important question is that of evaluation: given a set of possible choices for deployment, what is the best one? An important example, which is considered in this work, is that of ranker evaluation from the field of information retrieval (IR). The goal of IR is to satisfy the information need of a user in response to a query issued by them, where this information need is typically satisfied by a document (or a small set of documents) contained in what is often a much larger collection. This goal is often attained by ranking the documents according to their usefulness to the issued query using an algorithm, called a ranker, a procedure that takes as input a query and a set of documents and specifies how the documents need to be ordered.
This thesis is concerned with ranker evaluation. The goal of ranker evaluation is to determine the quality of the rankers under consideration to allow us to choose the best option: given a finite set of possible rankers, which one of them leads to the highest level of user satisfaction? There are two main methods for carrying this out: absolute metrics and relative comparisons. This thesis is concerned with the second, relative form of ranker evaluation because it is more efficient at distinguishing between rankers of different quality: for instance interleaved comparisons take a fraction of the time required by A/B testing, but they produce the same outcome. More precisely, the problem of online ranker evaluation from relative feedback can be described as follows: given a finite set of rankers, choose the best using only pairwise comparisons between the rankers under consideration, while minimizing the number of comparisons involving sub-optimal rankers. This problem is an instance of what is referred to as the dueling bandit problem in the literature.
The main contribution of this thesis is devising a dueling bandit algorithm, called Copeland Confidence Bounds (CCB), that solves this problem under practically general assumptions and providing theoretical guarantees for its proper functioning. In addition to that, the thesis contains a number of other algorithms that are better suited for dueling bandit problems with particular properties.
by Mohammadreza Khelghati
Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist or a business analyst, your primary concern is to access all the relevant data to your topics of interest. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data. In accessing web data through either general search engines or direct querying of deep web sources, the laborious work of querying, navigating results, downloading, storing and tracking data changes is a burden on shoulders of users. To decrease this intensive labor work of accessing data, (semi-)automatic harvesters have a crucial role. However, they lack a number of functionalities that we discuss and address in this work.
In this thesis, we investigate the path towards a focused web harvesting approach which can automatically and efficiently query websites, navigate through results, download data, store it and track data changes over time. Such an approach can also facilitate users to access a complete collection of relevant data to their topics of interest and monitor it over time. To realize such a harvester, we focus on the following obstacles. First, we try to find methods that can achieve the best coverage in harvesting data for a topic. Although using a fully automatic general harvester facilitates accessing web data, it is not a complete solution to collect a thorough data coverage on a given topic. Some search engines, in both surface web and deep web, restrict the number of requests from a user or limit the number of returned results presented to him. We suggest an efficient approach which can pass these limitations and achieve a complete data coverage.
Second, we investigate reducing the cost of harvesting a website regarding the number of submitted requests by estimating its actual size. Harvesting tasks continue till they face the posed query submission limitations by search engines or consume all the allocated resources. To prevent this undesirable situation, we need to know the size of the targeted source. For a website that hides the true size of its residing data, we suggest an accurate method to estimate its size.
As the third challenge, we focus on monitoring data changes over time in web data repositories. This information is helpful in providing the most up-to-date answers to information needs of users. The fast evolving web adds extra challenges for having an up-to-date data collection. Considering the costly process of harvesting, it is important to find methods which facilitate efficient re-harvesting processes.
Lastly, we combine our experiences in harvesting with the studies in the literature to suggest a general designing and developing framework for a web harvester. It is important to know how to configure harvesters so that they can be applied to different websites, domains and settings.
These steps bring further improvements to data coverage and monitoring functionalities of web harvesters and can help users such as journalists, business analysts, organizations and governments to reach the data they need without requiring extreme software and hardware facilities. With this thesis, we hope to have contributed to the goal of focused web harvesting and monitoring topics over time.
Co-occurrence Rate Networks: Towards separate training for undirected graphical models
by Zhemin Zhu
Dependence is a universal phenomenon which can be observed everywhere. In machine learning, probabilistic graphical models (PGMs) represent dependence relations with graphs. PGMs find wide applications in natural language processing (NLP), speech processing, computer vision, biomedicine, information retrieval, etc. Many traditional models, such as hidden Markov models (HMMs), Kalman filters, can be put under the umbrella of PGMs. The central idea of PGMs is to decompose (factorize) a joint probability into a product of local factors. Learning, inference and storage can be conducted efficiently over the factorization representation.
Two major types of PGMs can be distinguished: (i) Bayesian networks (directed graphs), and (ii) Markov networks (undirected graphs). Bayesian networks represent directed dependence with directed edges. Local factors of Bayesian networks are conditional probabilities. Directed dependence, directed edges and conditional probabilities are all asymmetric notions. In contrast, Markov networks represent mutual dependence with undirected edges. Both of mutual dependence and undirected edges are symmetric notions. For general Markov networks, based on the Hammersley–Clifford theorem, local factors are positive functions over maximum cliques. These local factors are explained using intuitive notions like ‘compatibility’ or ‘affinity’. Specially, if a graph forms a clique tree, the joint probability can be reparameterized into a junction tree factorization.
In this thesis, we propose a novel framework motivated by the Minimum Shared Information Principle (MSIP): We try to find a factorization in which the information shared between factors is minimum. In other words, we try to make factors as independent as possible.
The benefit of doing this is that we can train factors separately without paying a lot of efforts to guarantee consistency between them. To achieve this goal, we develop a theoretical framework called co-occurrence rate networks (CRNs) to obtain such a factorization. Briefly, given a joint probability, the CRN factorization is obtained as follows. We first strip off singleton probabilities from the joint probability. The quantity left is called co-occurrence rate (CR). CR is a symmetric quantity which measures mutual dependence among variables involved. Then we further decompose the joint CR into smaller and indepen dent CRs. Finally, we obtain a CRN factorization whose factors consist of all singleton probabilities and CR factors. There exist two kinds of independencies between these factors: (i) a singleton probability is independent (Here independent means two factors do not share information.) of other singleton probabilities; (ii) a CR factor is independent of other CR factors conditioned by singleton probabilities. Based on a CRN factorization, we propose an efficient two-step separate training method: (i) in the first step, we train a separate model for each singleton probability; (ii) given singleton probabilities, we train a separate model for each CR factor. Experimental results on three important natural language processing tasks show that our separate training method is two orders of magnitude faster than conditional random fields, while achieving competitive quality (often better on the overall quality metric F1).
The second contribution of this thesis is applying PGMs to a real-world NLP application: open relation extraction (ORE). In open relation extraction, two entities in a sentence are given, and the goal is to automatically extract their relation expression. ORE is a core technique, especially in the age of big data, for transforming unstructured information into structured data. We propose our model SimpleIE for this task. The basic idea is to decompose an extraction pattern into a sequence of simplification operations (components). The benefit by doing this is that these components can be re-combined in a new way to generate new extraction patterns. Hence SimpleIE can represent and capture diverse extraction patterns. This model is essentially a sequence labeling model. Experimental results on three benchmark data sets show that SimpleIE boosts recall and F1 by at least 15% comparing with seven ORE systems.
As tangible outputs of this thesis, we contribute open source implementations of our research results as well as an annotated data set.
Grouping by association: Using associative networks for document categorization
by Niels Bloom
In this thesis we describe a method of using associative networks for automatic document grouping. Associative networks are networks of concepts in which each concept is linked to concepts that are semantically similar to it. By activating concepts in the network based on the text of a document and spreading this activation to related concepts, we can determine what concepts are related to the document, even if the document itself does not contain words linked directly to those concepts. Based on this information, we can group documents by the concepts they refer to.
In the first part of the thesis we describe the method itself, as well as the details of various algorithms used in the implementation. We additionally discuss the theory upon which the method is based and compare it to various related methods. In the second part of the thesis we present several improvements to the method. We evaluate techniques to create associative networks from easily accessible knowledge sources, as well as different methods for the training of the associative network. Additionally, we evaluate techniques to improve the extraction of concepts from documents, we compare methods of spreading activation from concept to concept, and we present a novel technique by which the extracted concepts can be used to categorize documents. We also extend the method of associative networks to enable the application to multilingual document libraries and compare the method to other state-of-the-art methods. Finally, we present a practical application of associative networks, as implemented in a corporate environment in the form of the Pagelink Knowledge Centre. We demonstrate the practical usability of our work, and discuss the various advantages and disadvantages that the method of associative networks offers.
Sergio Duarte Torres' PhD defense last Friday February 14th resulted in a exceptional PhD degree cum laude. His PhD thesis: “Information Retrieval for Children: Search Behavior and Solutions” was written at the Database Group as part of the European project PuppyIR, a joint project with amongst others Human Media Interaction. Sergio's research shows an extraordinary diversity and heterogeneity, touching many areas of computer science, including Information Retrieval, Big Data analysis, and Machine Learning. Sergio sought cooperation with leading search engine companies in the field: Yahoo and Yandex. He did a three-month internship at Yahoo Research in Barcelona. Sergio's work is well-received. His paper on vertical selection for search for children was nominated for the Best Student Paper Award at the joint ACM/IEEE conference on Digital Libraries in Indianapolis, USA. His work is accepted at two important journals in the field: the ACM Transactions on the Web, and the Journal of the American Society of Information Science and Technology. Specifically worth mentioning is the user study with children aged 8 to 10 years old done by Sergio to evaluate the child-friendly search approaches that he developed. We are proud of the achievements of Sergio Duarte Torres. He will be an excellent ambassador of the University of Twente.
Today, Ilya Markov successfully defended his PhD thesis at the Università della Svizzera italiana in Lugano, Switzerland.
Uncertainty in Distributed Information Retrieval
by Ilya Markov
Large amounts of available digital information call for distributed processing and management solutions. Distributed Information Retrieval (DIR), also known as Federated Search, provides techniques for performing retrieval over such distributed data. In particular, it studies approaches to aggregating multiple searchable sources of information within a single interface.
DIR provides an efficient and low-cost solution to a distributed retrieval problem. As opposed to a centralized retrieval system, which acquires, stores and processes all available information locally, DIR delegates the search task to distributed sources. This way, DIR lowers the storage and processing costs and provides a user with up-to-date information even if this information is not crawlable (i.e. cannot be reached using hyperlinks).
DIR is usually based on a brokered architecture, according to which distributed retrieval is managed by a single broker. The broker-based DIR can be divided into five steps: resource discovery, resource description, resource selection, score normalization and results presentation. Among these steps, resource description, resource selection and score normalization are actively studied within DIR research, while the resource discovery step is addressed by the database community and results presentation is studied within aggregated search.
Despite the large volume of research on resource selection and score normalization, no unified framework of developed techniques exists, which makes difficult the application and comparison of available methods. The first goal of this dissertation is to summarize, analyze and evaluate existing resource selection and score normalization techniques within a unified framework. This should improve the understanding of available methods, reveal their underlying assumptions and limitations and describe their properties. This, in turn, will help to improve existing resource selection and score normalization techniques and to apply the right method in the right setting.
The second and the main contribution of this dissertation is in stating and addressing the problem of uncertainty in DIR. In Information Retrieval (IR) this problem has been recognized for a long time and numerous techniques have been proposed to deal with uncertainty in various IR tasks. This dissertation raises the question of uncertainty in DIR, outlines the sources of uncertainty on different DIR phases and proposes methods for measuring and reducing this uncertainty.
Distributed Deep Web Search
by Kien Tjin-Kam-Jet
The World Wide Web contains billions of documents (and counting); hence, it is likely that some document will contain the answer or content you are searching for. While major search engines like Bing and Google often manage to return relevant results to your query, there are plenty of situations in which they are less capable of doing so. Specifically, there is a noticeable shortcoming in situations that involve the retrieval of data from the deep web. Deep web data is difficult to crawl and index for today's web search engines, and this is largely due to the fact that the data must be accessed via complex web forms. However, deep web data can be highly relevant to the information-need of the end-user. This thesis overviews the problems, solutions, and paradigms for deep web search. Moreover, it proposes a new paradigm to overcome the apparent limitations in the current state of deep web search, and makes the following scientific contributions:
- A more specific classification scheme for deep web search systems, to better illustrate the differences and variation between these systems.
- Virtual surfacing, a new, and in our opinion better, deep web search paradigm which tries to combine the benefits of the two already existing paradigms, surfacing and virtual integration, and which also raises new research opportunities.
- A stack decoding approach which combines rules and statistical usage information for interpreting the end-user's free-text query, and to subsequently derive filled-out web forms based on that interpretation.
- A practical comparison of the developed approach against a well-established text-processing toolkit.
- Empirical evidence that, for a single site, end-users would rather use the proposed free-text search interface instead of a complex web form.
Analysis of data obtained from user studies shows that the stack decoding approach works as well as, or better than, today’s top-performing alternatives.
Adele Lu Jia successfully defended her PhD thesis at Delft University of Technology,
Online Networks as Societies: User Behaviors and Contribution Incentives
by Adele Lu Jia
Online networks like Facebook and BitTorrent have become popular and powerful infrastructures for users to communicate, to interact, and to share social lives with each other. These networks often rely on the cooperation and the contribution of their users. Nevertheless, users in online networks are often found to be selfish, lazy, or even ma- licious, rather than cooperative, and therefore need to be incentivized for contributions. To date, great effort has been put into designing effective contribution incentive policies, which range from barter schemes to monetary schemes. In this thesis, we conduct an analysis of user behaviors and contribution incentives in online networks. We approach online networks as both computer systems and societies, hoping that this approach will, on the one hand, motivate computer scientists to think about the similarities between their artificial computer systems and the natural world, and on the other hand, help people outside the field understand online networks more smoothly.
To summarize, in this thesis we provide theoretical and practical insights into the correlation between user behaviors and contribution incentives in online networks. We demonstrate user behaviors and their consequences at both the system and the individual level, we analyze barter schemes and their limitations in incentivizing users to contribute, we evaluate monetary schemes and their risks in causing the collapse of the entire system, and we examine user interactions and their implications in inferring user relationships. Above all, unlike the offline human society that has evolved for thousands of years, online networks only emerged two decades ago and are still in a primitive state. Yet with their ever-improving technologies we have already obtained many exciting results. This points the way to a promising future for the study of online networks, not only in analyzing online behaviors, but also in cross reference with offline societies.
When Complexity becomes Interesting: An Inquiry into the Information eXperience
by Frans van der Sluis
To date, most research in information retrieval and related fields has been concerned primarily with efficiency and effectiveness of either the information system or the interaction of the user with the information system. At the same time, understanding the experience of a user during information interaction is recognized as a grand challenge for the development of information systems. There is a widely shared intuition that the value of the retrieved information is dependent on more than system characteristics such as the topical overlap between a query and a document. As it is not obvious how to embrace this intuition, this challenge has mostly been left ignored. This dissertation embarked upon the challenge of describing and developing an operational model of the Information eXperience (IX) – the experience during the interaction with information. This task was decomposed into three sub-challenges:
- Transform the fuzzy concept of the IX into a formalized one.
- Develop a model of textual complexity that enables an information system to influence a user's IX.
- Identify and influence the causes of the experience of interest in text.