TOM module Data and Information finished

Today we successfully finished the TOM module Data & Information with final presentations, and the best group winning five FirefoxOS phones, one for each member. (TOM = “Twents Onderwijs Model” or Twente Educational Model)
End result:

  • 269 forum posts,
  • 113 git repositories,
  • 93 students (71 students finished successfully),
  • 30 scrum 'daily' stand-ups and project meetings,
  • 28 lectures,
  • 19 project groups,
  • 10 student assistants,
  • 9 tutorials,
  • 8 seminars,
  • 7 guest speakers,
  • 6 scrum mentors,
  • 5 professors,
  • 5 exams,
  • 5 sprint review meetings,
  • and 1 online lecture.

Yes, we have to work on our open, online material, so let me start by adding our project manual (in Dutch).

Data and Information project manual

(Edit: April 2015) Thanks, Luís Ferreira Pires: The project manual was translated to English this year.

Data and Information project manual

Yoran Heling graduates on peer selection in Direct Connect

by Yoran Heling

In a distributed Peer-to-peer (P2P) system such as Direct Connect, files are often distributed over multiple source peers. It is up to the downloading peer to decide from how many and from which source peers to download the particular file of interest. Biased Random Period Switching (BRPS) is an algorithm, implemented at the downloading peer, that determines at what point to download from which source peer. The number of source peers that a downloading peer downloads from at a certain point is called the Degree of Parallelism (DoP). This research focussed on implementing BRPS in an existing Direct Connect client and comparing the downloading performance against an unmodified client. Two implementations of BRPS in Direct Connect have been made. A simple implementation that follows the original BRPS algorithm as closely as possible, with minor modifications that were required to ensure that the downloading process would not get stuck on an unavailable source peer. An improved implementation has also been made with slight modifications to the original BRPS algorithm. The improved implementation incorporates two improvements to ensure that the DoP does not drop below its desired value in the face of unavailable source peers.

The original client and the two BRPS implementations have been evaluated in a controlled Direct Connect network with 50 downloading peers and a variable number of source peers. The source peers have been configured to throttle their available bandwidth to an average of 500 KB/s, and following a realistic bandwidth distribution based on measurements from the Tor P2P network. The experiments consisted of all downloading peers downloading the same file at the same time, and taking measurements on the side of these downloading peers. Four experiments have been performed, with one varying parameter in each experiment. The size of the file being downloaded was varied between 100 MB and 1024 MB in the first experiment, the second experiment varied the DoP between 1 and 15. The number of source peers was varied between 10 and 100 in the third experiment, and in the last experiment between 0% and 80% unavailable source peers were added to the network.

In all experiments, both BRPS implementations performed close to the optimal average download time, and were consistently faster than the original client by a factor of 2 to 5. In the last experiment, the improved BRPS implementation did keep the measured DoP closer to its desired value than the simple implementation, but this has not resulted in a significant difference in the measured download times.

[download pdf]

Norvig Web Data Science Award 2014

The Norvig Web Data Science Award is organized by Common Crawl and SURFsara for researchers and students in the Benelux. SURFsara provides free access to the their Hadoop cluster with a copy of the full Common Crawl web crawl from March 2014 – almost 3 billion web pages. Participants are completely free in choosing their research question. For example, last year there were submissions looking at concept association, connections between languages, readability and more. Be creative and think outside of the box!

The award is named after Peter Norvig, Director of Research at Google, who chairs the jury that will select the winning submission. The contest will run until July 31, 2014. The winning team will be announced at the award ceremony in September 2014 and will get a tablet, smart watch and Github small plan for a year.

Sign up on:

Cancer Early Detection Campaigns on Twitter

It is official! Twitter awards the University of Twente with a prestigious Twitter #DataGrant (with Tijs van den Broek, Michel Ehrenhard and Ariana Need). Twitter awarded 6 out 1,300 proposals.

Our research project aims to study the diffusion process and effectiveness of cancer early detection campaigns. We plan to analyse popular Twitter campaigns covering different types of cancer and geographical scopes, such as #Mamming (breast cancer), #Movember (prostate cancer), #DaveDay (pancreatic cancer) and #HPVReport (cervical cancer). We aim to map the diffusion process in detail by determining key events and actors that accelerate the diffusion process. Social network analysis will reveal if and when the campaign leads to word-of-mouth discussion, promotion and responses. We also aim to assess the effectiveness of the campaigns by comparing the frequency and sentiment of mentions of a particular type of cancer (e.g. breast cancer in case of #mamming) before and after the campaign.

Analysis of Search and Browsing Behavior of Young Users on the Web

by Sergio Duarte Torres, Ingmar Weber, and Djoerd Hiemstra

The Internet is increasingly used by young children for all kinds of purposes. Nonetheless, there are not many resources especially designed for children on the Internet and most of the content online is designed for grown-up users. This situation is problematic if we consider the large differences between young users and adults since their topic interests, computer skills, and language capabilities evolve rapidly during childhood. There is little research aimed at exploring and measuring the difficulties that children encounter on the Internet when searching for information and browsing for content. In the first part of this work, we employed query logs from a commercial search engine to quantify the difficulties children of different ages encounter on the Internet and to characterize the topics that they search for. We employed query metrics (e.g., the fraction of queries posed in natural language), session metrics (e.g., the fraction of abandoned sessions), and click activity (e.g., the fraction of ad clicks). The search logs were also used to retrace stages of child development. Concretely, we looked for changes in interests (e.g., the distribution of topics searched) and language development (e.g., the readability of the content accessed and the vocabulary size).

[download pdf]

Published in ACM Transactions on the Web (TWEB) Volume 8 Issue 2.

Expert group formation using facility location analysis

by Mahmood Neshati, Hamid Beigy, and Djoerd Hiemstra

In this paper, we propose an optimization framework to retrieve an optimal group of experts to perform a multi-aspect task. While a diverse set of skills are needed to perform a multi-aspect task, the group of assigned experts should be able to collectively cover all these required skills. We consider three types of multi-aspect expert group formation problems and propose a unified framework to solve these problems accurately and efficiently. The first problem is concerned with finding the top k experts for a given task, while the required skills of the task are implicitly described. In the second problem, the required skills of the tasks are explicitly described using some keywords but each expert has a limited capacity to perform these tasks and therefore should be assigned to a limited number of them. Finally, the third problem is the combination of the first and the second problems. Our proposed optimization framework is based on the Facility Location Analysis which is a well known branch of the Operation Research. In our experiments, we compare the accuracy and efficiency of the proposed framework with the state-of-the-art approaches for the group formation problems. The experiment results show the effectiveness of our proposed methods in comparison with state-of-the-art approaches.

Published in Information Processing & Management 50(2), March 2014, Pages 361–383

[download pdf]

Query Recommendation in the Information Domain of Children

by Sergio Duarte Torres, Djoerd Hiemstra, Ingmar Weber, and Pavel Serdyukov

Children represent an increasing part of web users. One of the key problems that hamper their search experience is their limited vocabulary, their difficulty to use the right keywords, and the inappropriateness of general-purpose query suggestions. In this work we propose a method that utilizes tags from social media to suggest queries related to children topics. Concretely we propose a simple yet effective approach to bias a random walk defined on a bipartite graph of web resources and tags through keywords that are more commonly used to describe resources for children. We evaluate our method using a large query log sample of queries submitted by children. We show that our method outperforms by a large margin the query suggestions of modern search engines and state-of-the art query suggestions based on random walks. We improve further the quality of the ranking by combining the score of the random walk with topical and language modeling features to emphasize even more the child-related aspects of the query suggestions.

to appear in the Journal of the American society for information science and technology JASIST.

[download preprint]

Overview of TREC FedWeb 2013

Overview of the TREC 2013 Federated Web Search Track

by Thomas Demeester, Dolf Trieschnigg, Dong Nguyen, Djoerd Hiemstra

The TREC Federated Web Search track is intended to promote research related to federated search in a realistic web setting, and hereto provides a large data collection gathered from a series of online search engines. This overview paper discusses the results of the first edition of the track, FedWeb 2013. The focus was on basic challenges in federated search: (1) resource selection, and (2) results merging. After an overview of the provided data collection and the relevance judgments for the test topics, the participants' individual approaches and results on both tasks are discussed. Promising research directions and an outlook on the 2014 edition of the track are provided as well.

Ellen Voorhees presenting FedWeb at TREC 2013

The FedWeb task is organized as part of the Text REtrieval Conference (TREC)

[download pdf]