CBS / UT Data Camp 2015

On 23-27 November 2015, the Data Camp, a joint event organized by the Central Bureau for Statistics of the Netherlands (CBS) and the University of Twente (UT). During the camp, a set of CBS data analysts and UT researchers will answer research questions about statistics using big data technologies. On Monday, the participants will be presented with overview presentations about the research questions and technologies. The data camp participants will work in small, mixed teams in an informal setting. Experienced data scientists will support the teams by short mini-workshops and hands-on support. The hope is that the intense contact with the research question in an informal and spontaneous environment will produce valuable and innovative answers to the posed questions.

Guest speakers are Erik Tjong Kim Sang (Meertens Institute, Amsterdam) and David GonzĂĄlez (Vizzuality, Madrid).

[download report]

Welcome to Advanced Database Systems

Welcome to the M.Sc. course Advanced Database Systems. In this course you will learn what it takes to be a Database Administrator, analysing and improving the performance of databases. You will also learn in detail how transactions are handled by database management systems. The course has a little bit of everything: ordinary lectures, a little practicum, and a small project about handling sensor data. We hope to see you Thursday September 3rd, at 10.45h. in CR-3D.

Mena Badieh Habib Morgan, Maurice van Keulen, and Djoerd Hiemstra.

More info at: (access still restricted 🙁 )

Welcome to Information Retrieval

Welcome to the course Information Retrieval. We will introduce some exciting new things in the course: This year's practical assignments are motivated by use cases of MyDataFactory, a company specialized in product data. The course uses the book “Introduction to Information Retrieval” by Christopher Manning, Prabhakar Raghavan and Hinrich SchĂŒtze. Have a look at the schedule on Blackboard under “Course Information” for an overview of the course first quarter of the course. In the second quarter, students will research a specific topic in depth. We hope to see you at the first lecture on Wednesday 2 September at 13.45h. in RA4334.

Theo Huibers, Dolf Trieschnigg and Djoerd Hiemstra.

More info at: (access restricted)


SIKS/Twente Seminar on Searching and Ranking

Together with SIKS and the CTIT we have organized several one day seminars, usually in conjunction with a PhD defense here in Twente.

  • SSR-1: Searching and Ranking in Structured Text Repositories on 27 June 2008, with Debora Donato, and Ricardo Baeza-Yates, both from Yahoo! Research Barcelona
  • SSR-2: Searching and Ranking in Enterprises, on 24 June 2009 with David Hawking (Funnelback & Australian National University), Iadh Ounis (University of Glasgow), and Maarten de Rijke (University of Amsterdam)
  • SSR-3: Effectiveness of Searching and Ranking on 29 January 2010, with Leif Azzopardi from University of Glasgow, UK, and Vanessa Murdock form Yahoo! Research, Barcelona
  • SSR-4: Searching and Ranking Multimedia on 2 July 2010 with Alexander Hauptmann from Carnegie Mellon University, USA.
  • SSR-5: Biomedical Text Mining on 1 September 2010 with Dietrich Rebholz-Schuhmann (European Bioinformatics Institute, UK) and Martijn Schuemie (Erasmus MC/Leiden University Medical Center, Rotterdam)
  • SSR-6: Searching Speech: Evaluation of Speech Recognition in Context on 5 July 2012 with Gareth Jones (Dublin City University, Ireland), David van Leeuwen (Radboud University Nijmegen, Netherlands Forensic Institute), and Lori Lamel (Limsi – CNRS, France)
  • SSR-7: Distributing Search on 26 September 2012 with Jamie Callan (Carnegie Mellon University, USA), Fabio Crestani (University of Lugano, Switzerland), Johan Pouwelse (Delft University of Technology)
  • SSR-8: Explorations in interactive retrieval and information experience on 29 August 2013 with Peter Ingwersen (Royal School of Library and Information Science, Copenhagen, Denmark), Ian Ruthven (Strathclyde University, Glasgow, Scotland), and Richard Glassey (Robert Gordon University, Aberdeen, Scotland)
  • SSR-9: Understanding the Web on 19 December 2013 with Weiyi Meng (State University of New York at Binghamton, USA) and Gertjan van Noord (University of Groningen, The Netherlands)
  • SSR-10: Learning for Information Retrieval on 14 February 2014 with Alan Smeaton (Dublin City University, Ireland) and Arjen de Vries (CWI, Amsterdam)
  • SSR-11: Monitoring and preventing Cyberbullying on 12 September 2014 with Debra Pepler (York University, Canada) and Veronique Hoste (Ghent University, Belgium)
  • SSR-12: Probabilistic Approaches to Smart Discovery on 16 October 2015 with Ingo Frommholz (University of Bedfordshire, UK) and Tom Heskes (Radboud University Nijmegen, the Netherlands)
  • SSR-13: Deep Web Entity Monitoring on 2 June 2016 with Gianluca Demartini (University of Sheffield, UK), Andrea CalĂŹ (Birkbeck, University of London, UK) and Pierre Senellart (TĂ©lĂ©com ParisTech, France)
  • SSR-14: Text as social and cultural data on 10 March 2017 with Anders SĂžgaard (University of Copenhagen), Jacob Eisenstein (Georgia Institute of Technology), Lysbeth Jongbloed-Faber (De Fryske Akademy), Leonie Cornips (Meertens Institute), Tom Kenter (University of Amsterdam), Folgert Karsdorp (Meertens Institute), John Nerbonne (University of Groningen/Albert-Ludwigs-UniversitĂ€t Freiburg)

Harvesting all matching information to a given query from a deep website

by Mohammadreza Khelghati, Djoerd Hiemstra, and Maurice van Keulen

In this paper, the goal is harvesting all documents matching a given (entity) query from a deep web source. The objective is to retrieve all information about for instance “Denzel Washington”, “Iran Nuclear Deal”, or “FC Barcelona” from data hidden behind web forms. Policies of web search engines usually do not allow accessing all of the matching query search results for a given query. They limit the number of returned documents and the number of user requests. In this work, we propose a new approach which automatically collects information related to a given query from a search engine, given the search engine's limitations. The approach minimizes the number of queries that need to be sent by applying information from a large external corpus. The new approach outperforms existing approaches when tested on Google, measuring the total number of unique documents found per query.

To be presented at the 1st International Workshop on Knowledge Discovery on the Web (KDWeb 2015) on 3-5 September in Cagliari, Italy.

[download pdf]

The Influence of Prosocial Norms and Online Network Structure on Prosocial Behavior

The Influence of Prosocial Norms and Online Network Structure on Prosocial Behavior: An Analysis of Movember’s Twitter Campaign in 24 Countries

by Tijs van den Broek, Ariana Need, Michel Ehrenhard, Anna Priante and Djoerd Hiemstra

Sociological research points at norms and social networks as antecedents of prosocial behavior. To date, the literature remains undecided on how these factors jointly influence prosocial behavior. Furthermore, the use of social media by campaign organizations may change the need for formal networks to organize large-scale collective action. Hence, in this paper we examine the interplay of prosocial norms and the structure of online social networks on offline prosocial behavior. For this purpose we use donation data from the global Movember campaign, messages about the Movember campaign on the online social networking site Twitter, and data from the World Giving Index. A multi-level analysis of Movember’s campaigns in 24 countries finds support for the logic of connective action: larger and more decentralized networks raise more donations. Furthermore, we find that the effect of prosocial norms on donations is decreased by larger and denser campaign networks.

To be presented at Social media, Activism, and Organizations 2015 (SMAO) on 6 November in Londen, UK.

Anne van de Venis graduates on Recommendations using DBpedia

Recommendations using DBpedia: How your Facebook profile can be used to find your next greeting card

by Anne van de Venis

Recommender systems (RS) are systems that provide suggestions that users may find interesting. In this thesis we present our Interest-Based Recommender System (IBRS) that can recommend tagged item sets from any domain. This RS is validated with item sets from two different domains, namely postcards and holidays homes. While postcards and holiday homes are very different items, with different characteristics, IBRS uses the same recommender engine to create recommendations. IBRS solves several problems that are present in classic RSs, such as the cold-start problem and language independence. The cold-start problem for new users, is solved by using Facebook likes for creating a user profile. It uses information in DBpedia to create recommendations in a tag-based item set for multiple domains, independent of the language. Using both external knowledge sources and user content, makes our system a hybrid of a knowledge-based and content-based RS. We validated our system through an online evaluation system in two evaluation rounds with test user groups of approximately 71 and 44 people. The main contributions in this thesis are:

  • a literature study of existing recommendation approaches;
  • a language-independent mapping approach for tags and social media resource onto DBpedia resources;
  • a domain-independent algorithm for detecting related concepts in the DBpedia graph;
  • a recommendation approach based on both Facebook and DBpedia;
  • a validation of our recommendation approach.

[download pdf]

A cross-benchmark comparison of 87 learning to rank methods

by Niek Tax (Eindhoven University), Sander Bockting (Avanade), and Djoerd Hiemstra

Learning to rank is an increasingly important scientific field that comprises the use of machine learning for the ranking task. New learning to rank methods are generally evaluated on benchmark test collections. However, comparison of learning to rank methods based on evaluation results is hindered by the absence of a standard set of evaluation benchmark collections. In this paper we propose a way to compare learning to rank methods based on a sparse set of evaluation results on a set of benchmark datasets. Our comparison methodology consists of two components: (1) Normalized Winning Number, which gives insight in the ranking accuracy of the learning to rank method, and (2) Ideal Winning Number, which gives insight in the degree of certainty concerning its ranking accuracy. Evaluation results of 87 learning to rank methods on 20 well-known benchmark datasets are collected through a structured literature search. ListNet, SmoothRank, FenchelRank, FSMRank, LRUF and LARF are Pareto optimal learning to rank methods in the Normalized Winning Number and Ideal Winning Number dimensions, listed in increasing order of Normalized Winning Number and decreasing order of Ideal Winning Number.

To appear in November in Information Processing and Management 51(6), pages 757–772

[download preprint]

On the Impact of Twitter-based Health Campaigns

A Cross-Country Analysis of Movember

by Nugroho Dwi Prasetyo (TU Delft), Claudia Hauff (TU Delft), Dong Nguyen, Tijs van den Broek, Djoerd Hiemstra

Health campaigns that aim to raise awareness and subsequently raise funds for research and treatment are commonplace. While many local campaigns exist, very few attract the attention of a global audience. One of those global campaigns is Movember, an annual campaign during the month of November, that is directed at men's health with special focus on cancer and mental health. Health campaigns routinely use social media portals to capture people’s attention. Recently, researchers began to consider to what extent social media is effective in raising the awareness of health campaigns. In this paper we expand on those works by conducting an investigation across four different countries, while not only restricting ourselves to the impact on awareness but also on fund-raising. To that end, we analyze the 2013 Movember Twitter campaigns in Canada, Australia, the United Kingdom and the United States.

To be presented at the 6th International Workshop on Health Text Mining and Information Analysis (Louhi 2015) Workshop at EMNLP 2015 on September 17 in Lisbon, Portugal.

[download pdf]

Han van der Veen graduates on composing a more complete and relevant Twitter dataset

Composing a more complete and relevant Twitter dataset

by Han van der Veen

Social data is widely used by many researchers. Facebook, Twitter and other social networks are producing huge amounts of social data. This social data can be used for analyzing human behavior. Social datasets are typically created by a hashtag, however not all relevant data includes the hashtag. A better overview can be constructed with more data. This research is focusing on creating a more complete and relevant dataset. Using additional keywords for finding more relevant tweets and a filtering mechanism to filter out the irrelevant tweets. Three additional keywords methods are proposed and evaluated. One based on word frequency, one on probability of word in a dataset and the last method is using estimates about the volume of tweets. Two classifiers are used for filtering Tweets. A Naive Bayes classifier and a Support Vector Machine classifier are compared. Our method increases the size of the dataset with 105%. The average precision was reduced from 95% of only using a hashtag to 76% for a resulting dataset. These evaluations were executed on two TV-Shows and two sport events. A tool was developed that automatically executes all parts of the program. As input a specific hashtag of an event is required and using the hash will output a more complete and relevant dataset than using the original hashtag. This is useful for social researchers that uses Tweets, but also other researchers that uses Tweets as their data.

[download pdf]