SIKS/CBS Data Camp & Advanced Course on Managing Big Data

On December 06 and 07 2016 The Netherlands School for Information and Knowledge Systems (SIKS) and Statistics Netherlands (CBS) organize a two day tutorial on the management of Big Data, the DataCamp, hosted at the University of Twente.
The Data Camp's objective is to use big data sets to produce valuable and innovative answers to research questions with societal relevance. SIKS PhD students and CBS data analysts will learn about big data technologies and create, in small groups, feasibility studies for a research question of their choice.
Participants get access to predefined CBS research questions and massive datasets, including a large collection of Dutch Tweets, traffic data from Dutch high ways, and AIS data from ships. Participants will get access to the Twente Hadoop cluster, a 56 node cluster with almost 1 petabyte of storage space. The tutorial focuses on hands-on experience. The Data Camp participants will work in small, mixed teams in an informal setting, which stimulates intense contact with technologies and research questions. Experienced data scientists will support the teams by short lectures and hands-on support. Short lectures will introduce technologies to manage and visualize big data, that were first adopted by Google and are now used by many companies that manage large datasets. The tutorial teaches how to process terabytes of data on large clusters of commodity machines using new programming styles like MapReduce and Spark. The tutorial will be given in English and is part of the educational program for SIKS PhD students.

Also see the SIKS announcement.

Niek Tax wins the ENIAC thesis award

Another thesis prize for Niek Tax: Best master thesis in computer science in 2014/2015 at the University of Twente, awarded by Alumni Association ENIAC. Photo: Niek Tax receives the award from Johan Noltes on behalf of the ENIAC jury. Congrats, Niek! Other nominees were Justyna Chromik (DACS), Vincent Bloemen (FMT), Maarten Brilman (HMI), Tim Paauw (IEBIS), and Moritz Müller (SCS).

Niek Tax

CBS / UT Data Camp 2015

On 23-27 November 2015, the Data Camp, a joint event organized by the Central Bureau for Statistics of the Netherlands (CBS) and the University of Twente (UT). During the camp, a set of CBS data analysts and UT researchers will answer research questions about statistics using big data technologies. On Monday, the participants will be presented with overview presentations about the research questions and technologies. The data camp participants will work in small, mixed teams in an informal setting. Experienced data scientists will support the teams by short mini-workshops and hands-on support. The hope is that the intense contact with the research question in an informal and spontaneous environment will produce valuable and innovative answers to the posed questions.

Guest speakers are Erik Tjong Kim Sang (Meertens Institute, Amsterdam) and David González (Vizzuality, Madrid).

[download report]

CTIT Hadoop cluster open

CTIT Hadoop cluster with Djoerd, Frederik, Maurice, and Robin

The new CTIT Hadoop cluster, 512 cores, 2 TB ram, and 0.5 PB storage, is now open for researchers and students in Twente. We started by giving accounts to the 60 students that follow the course Managing Big Data. From left to right in the (not so) cold aisle: Djoerd, Frederik, Maurice, and Robin.

Overview of the TREC 2014 Federated Web Search Track

by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Ke Zhou, and Djoerd Hiemstra

The TREC Federated Web Search track facilitates research in topics related to federated web search, by providing a large realistic data collection sampled from a multitude of online search engines. The FedWeb 2013 challenges of Resource Selection and Results Merging challenges are again included in FedWeb 2014, and we additionally introduced the task of vertical selection. Other new aspects are the required link between the Resource Selection and Results Merging, and the importance of diversity in the merged results. After an overview of the new data collection and relevance judgments, the individual participants' results for the tasks are introduced, analyzed, and compared.

[download pdf]

Presented at the 23rd Text Retrieval Conference (TREC) in Gaithersburg, USA

Niek Tax graduates on scaling learning to rank to big data

Scaling Learning to Rank to Big Data: Using MapReduce to Parallelise Learning to Rank

by Niek Tax

Niek Tax

Learning to rank is an increasingly important task within the scientific fields of machine learning and information retrieval, that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated in terms of ranking accuracy on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by non-existence of a standard set of evaluation benchmark collections. Furthermore, little research is done in the field of scalability of the training procedure of Learning to Rank methods, to prepare us for input data sets that are getting larger and larger. This thesis concerns both the comparison of Learning to Rank methods using a sparse set of evaluation results on benchmark data sets, as well as the speed-up that can be achieved by parallelising Learning to Rank methods using MapReduce.

In the first part of this thesis we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: 1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and 2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF were found to be the best performing learning to rank methods in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number. Of these ranking algorithms, FenchelRank and FSMRank are pairwise ranking algorithms and the others are listwise ranking algorithms.

In the second part of this thesis we analyse the speed-up of the ListNet training algorithm when implemented in the MapReduce computing model. We found that running ListNet on MapReduce comes with a job scheduling overhead in the range of 150-200 seconds per training iteration. This makes MapReduce very inefficient to process small data sets with ListNet, compared to a single-machine implementation of the algorithm. The MapReduce implementation of ListNet was found to be able to offer improvements in processing time for data sets that are larger than the physical memory of the single machine otherwise available for computation. In addition we showed that ListNet tends to converge faster when a normalisation preprocessing procedure is applied to the input data. The training time of our cluster version of ListNet was found to grow linearly in terms of data size increase. This shows that the cluster implementation of ListNet can be used to scale the ListNet training procedure to arbitrarily large data sets, given that enough data nodes are available for computation.

[download pdf]

The Importance of Prior Probabilities for Entry Page Search

by Wessel Kraaij, Thijs Westerveld, and Djoerd Hiemstra

An important class of searches on the world-wide-web has the goal to find an entry page (homepage) of an organisation. Entry page search is quite different from Ad Hoc search. Indeed a plain Ad Hoc system performs disappointingly. We explored three non-content features of web pages: page length, number of incoming links and URL form. Especially the URL form proved to be a good predictor. Using URL form priors we found over 70% of all entry pages at rank 1, and up to 89% in the top 10. Non-content features can easily be embedded in a language model framework as a prior probability

[download pdf]

SIGIR 2014 Test of Time Honourable Mention

The paper was published at SIGIR 2002 and received an Honourable Mention for the ACM SIGIR Test of Time award at the 37th Annual ACM SIGIR conference on Research & development in information retrieval in Gold Coast Australia on 9 July 2014.

Overview of TREC FedWeb 2013

Overview of the TREC 2013 Federated Web Search Track

by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Djoerd Hiemstra

The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb 2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants' individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well.

TREC FedWeb
Ellen Voorhees presenting FedWeb at TREC 2013

The FedWeb task is organized as part of the Text REtrieval Conference (TREC)

[download pdf]