Vincent van Donselaar graduates on database synchronization

Low latency asynchronous database synchronization and data transformation using the replication log

by Vincent van Donselaar

Analytics firm Distimo offers a web based product that allows mobile app developers to track the performance of their apps across all major app stores. The Distimo backend system uses web scraping techniques to retrieve the market data which is stored in the backend master database: the data warehouse (DWH). A batch-oriented program periodically synchronizes relevant data to the frontend database that feeds the customer-facing web interface.
The synchronization program poses limitations due to its batch-oriented design. The relevant metadata that must be calculated before and after each batch results in overhead and increased latency. The goal of this research is to streamline the synchronization process by moving to a continuous, replication-like solution, combined with principles seen in the field of data warehousing. The binary transaction log of the master database is used to feed the synchronization program that is also responsible for implicit data transformations like aggregation and metadata generation. In contrast to traditional homogeneous database replication, this design allows synchronization across heterogeneous database schemas. The prototype demonstrates that a composition of replication and data warehousing techniques can offer an adequate solution for robust and low latency data synchronization software.

[download pdf]

#SupportTheCause: Identifying Motivations to Participate in Online Health Campaigns

by Dong Nguyen, Tijs van den Broek, Claudia Hauff (TU Delft), Djoerd Hiemstra, and Michel Ehrenhard

We consider the task of automatically identifying participants’ motivations in the public health campaign Movember and investigate the impact of the different motivations on the amount of campaign donations raised. Our classification scheme is based on the Social Identity Model of Collective Action (van Zomeren et al., 2008). We find that automatic classification based on Movember profiles is fairly accurate, while automatic classification based on tweets is challenging. Using our classifier, we find a strong relation between types of motivations and donations. Our study is a first step towards scaling-up collective action research methods.

The paper will be presented at the Conference on Empirical Methods in Natural Language Processing (EMNLP) on September 17-21, in Lisbon, Portugal.

[download pdf]

What country was this tweeted from?

Determine the User Country of a Tweet

by Han van der Veen, Djoerd Hiemstra, Tijs van den Broek, Michel Ehrenhard, and Ariana Need

In the widely used message platform Twitter, about 2% of the tweets contains the geographical location through exact GPS coordinates (latitude and longitude). Knowing the location of a tweet is useful for many data analytics questions. This research is looking at the determination of a location for tweets that do not contain GPS coordinates. An accuracy of 82% was achieved using a Naive Bayes model trained on features such as the users' timezone, the user's language, and the parsed user location. The classiffier performs well on active Twitter countries such as the Netherlands and United Kingdom. An analysis of errors made by the classiffier shows that mistakes were made due to limited information and shared properties between countries such as shared timezone. A feature analysis was performed in order to see the effect of different features. The features timezone and parsed user location were the most informative features.

[download pdf]

Mike Kolkman graduates on cross-domain geocoding

Cross-domain textual geocoding: influence of domain-specific training data

by Mike Kolkman

Modern technology is more and more able to understand natural language. To do so, unstructured texts need to be analysed and structured. One such structuring method is geocoding, which is aimed at recognizing and disambiguating references to geographical locations in text. These locations can be countries and cities, but also streets and buildings, or even rivers and lakes. A word or phrase that refers to a location is called a toponym. Approaches to tackle the geocoding task mainly use natural language processing techniques and machine learning. The difficulty of the geocoding task is dependent of multiple aspects, one of which is the data domain. The domain of a text describes the type of the text, like its goal, degree of formality, and target audience. When texts come from two (or more) different domains, like a Twitter post and a news item, they are said to be cross-domain.
An analysis of baseline geocoding systems shows that identifying toponyms on cross-domain data has still room for improvement, as existing systems depend significantly on domain-specific metadata. Systems focused on Twitter data are often dependent on account information of the author and other Twitter specific metadata. This causes the performance of these systems to drop significantly when applied on news item data.
This thesis presents a geocoding system, called XD-Geocoder, aimed at robust cross-domain performance by using text-based and lookup list based features only. Such a lookup list is called a gazetteer and contains a vast amount of geographical locations and information about these locations. Features are built up using word shape, part-of-speech tags, dictionaries and gazetteers. The features are used to train SVM and CRF classifiers.
Both classifiers are trained and evaluated on three corpora from three domains: Twitter posts, news items and historical documents. These evaluations show Twitter data to be the best for training out of the tested data sets, because both classifiers show the best overall performance when trained on tweets. However, this good performance might also be caused by the relatively high toponym to word ratio in the used Twitter data.
Furthermore, the XD-Geocoder was compared to existing geocoding systems. Although the XD-Geocoder is outperformed by state-of-the-art geocoders on single-domain evaluations (trained and evaluated on data from the same domain), it outperforms the baseline systems on cross-domain evaluations.

[download pdf]

Niels Bloom defends PhD thesis on associative networks for document categorization

Grouping by association: Using associative networks for document categorization

by Niels Bloom

In this thesis we describe a method of using associative networks for automatic document grouping. Associative networks are networks of concepts in which each concept is linked to concepts that are semantically similar to it. By activating concepts in the network based on the text of a document and spreading this activation to related concepts, we can determine what concepts are related to the document, even if the document itself does not contain words linked directly to those concepts. Based on this information, we can group documents by the concepts they refer to.

In the first part of the thesis we describe the method itself, as well as the details of various algorithms used in the implementation. We additionally discuss the theory upon which the method is based and compare it to various related methods. In the second part of the thesis we present several improvements to the method. We evaluate techniques to create associative networks from easily accessible knowledge sources, as well as different methods for the training of the associative network. Additionally, we evaluate techniques to improve the extraction of concepts from documents, we compare methods of spreading activation from concept to concept, and we present a novel technique by which the extracted concepts can be used to categorize documents. We also extend the method of associative networks to enable the application to multilingual document libraries and compare the method to other state-of-the-art methods. Finally, we present a practical application of associative networks, as implemented in a corporate environment in the form of the Pagelink Knowledge Centre. We demonstrate the practical usability of our work, and discuss the various advantages and disadvantages that the method of associative networks offers.

[download pdf]

Where to go on your next trip?

Optimizing Travel Destinations Based on User Preferences

by Julia Kiseleva (TU Eindhoven), Melanie MĂĽller (, Lucas Bernardi (, Chad Davis (, Ivan Kovacek (, Mats Stafseng Einarsen (, Jaap Kamps (University of Amsterdam), Alexander Tuzhilin (New York University), Djoerd Hiemstra

Recommendation based on user preferences is a common task for e-commerce websites. New recommendation algorithms are often evaluated by offline comparison to baseline algorithms such as recommending random or the most popular items. Here, we investigate how these algorithms themselves perform and compare to the operational production system in large scale online experiments in a real-world application. Specifically, we focus on recommending travel destinations at, a major online travel site, to users searching for their preferred vacation activities. To build ranking models we use multi-criteria rating data provided by previous users after their stay at a destination. We implement three methods and compare them to the current baseline in random, most popular, and Naive Bayes. Our general conclusion is that, in an online A/B test with live users, our Naive-Bayes based ranker increased user engagement significantly over the current online system.

To be presented at SIGIR 2015, the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, on 12 August in Santiago de Chile.

[download preprint]

How to build Google in an Afternoon

How many machines do we need to search and manage an index of billions of documents? In this lecture, I will discuss basic techniques for indexing very large document collections. I will discuss inverted files, index compression, and top-k query optimization techniques, showing that a single desktop PC suffices for searching billions of documents. An important part of the lecture will be spend on estimating index sizes and processing times. At the end of the afternoon, students will have a better understanding of the scale of the web and its consequences for building large-scale web search engines, and students will be able to implement a cheap but powerful new 'Google'.

To be presented at the SIKS Course Advances in Information Retrieval on 18, 19 June in Vught, The Netherlands.

Data Science Day

On 20 April, we organize a Data Science Day in the DesignLab. Invited speakers at the Data Science Colloquium are Piet Daas, methodologist and big data research coordinator of the CBS (Centraal Bureau Statistiek) who will talk about big data from Twitter and Facebook as a data source for official statistics; Rolf de By and Raul Zurita Milla, professors of ITC Geo-Information Science and Earth Observation, will talk about remote sensing techniques, using satelites and drones for helping economies in poor areas in the world, a prestigious project funded by the Bill and Melinda Gates Foundation; and Jan Willem Tulp creator of interactive data visualisations for magazines like Scientific American and Popular Science, as well as companies, for instance the Tax Free Retail Analysis Tool for Schiphol Amsterdam Airport.

The Data Science colloquia are kindly sponsored by the CTIT and the Netherlands Research School for Information and Knowledge Systems (SIKS) and part of the SIKS educational program.

[more information]

FedWeb Greatest Hits

Presenting the New Test Collection for Federated Web Search

by Thomas Demeester (Ghent University), Dolf Trieschnigg, Ke Zhou (Yahoo!), Dong Nguyen, and Djoerd Hiemstra

This paper presents FedWeb Greatest Hits, a large new test collection for research in web information retrieval. As a combination and extension of the datasets used in the TREC Federated Web Search Track, this collection opens up new research possibilities on federated web search challenges, as well as on various other problems.

The paper will be presented at the 24th International World Wide Web Conference (WWW 2015) in Florence, Italy on 18-22 May 2015.

[download pdf]

To obtain the dataset go to: