2013’s DB colloquium

Below you find a list of last year's DB colloquia, usually Tuesday's from 13:45h. – 14:30h. in ZI-3126.

15 January 2013
Zhemin Zhu
Separate Training for Conditional Random Fields Using Co-occurrence Rate Factorization
The standard training method of Conditional Random Fields (CRFs) is very slow for large-scale applications. In this paper, we present separate training for undirected models based on the novel Co-occurrence Rate Factorization (CR-F). Separate training is a local training method. In contrast to piecewise training, separate training is exact. In contrast to MEMMs, separate training is unaffected by the label bias problem. Experiments show that separate training (i) is unaffected by the label bias problem; (ii) reduces the training time from weeks to seconds; and (iii) obtains competitive results to the standard and piecewise training on linear-chain CRFs.

18 January 2013 (Friday from 13.45h. to 15.30h. in Dutch)
Edo Drenthen (Avanade)
Big Data & Hadoop
BigData staat voor data die zo groot en complex is dat ze moeilijk zijn op te slaan en te bewerken met traditionele database management tools. Big Datasets verschillen van traditionele datasets over 3 assen: 1) Volume: We zijn meer dan ook in staat om data te genereren en op te slaan. Dit heeft tot gevolg dat datasets in orde en factor groter worden; 2) Variety: Data analyze is nodig over een combinatie van gestructureerd (bijv RDBMS) en ongestructureerd (bijv logs, tweets); 3) Velocity: Data wordt steeds sneller gegenereerd, bijvoorbeeld door mobile devices, sensor netwerken, logs, streaming systems. Het onderscheid tussen relevante en en niet relevante data is hierdoor voor bedrijven steeds moeilijker te maken. Organisaties staan voor de uitdaging om de juiste informatie uit Big Data te halen en deze om te zetten naar waarde. Afhankelijk van o.a doelstellingen en de maturity van een organisatie zal een strategie bedacht moeten worden om dit te bewerkstelligen. Avanade helpt haar klanten bij het beantwoorden van deze vragen. Deze presentatie zal gaan over BigData integratie en Hadoop op het Microsoft platform. Verschillende aspecten van BigData uiteengezet zullen worden naar aanleiding van een business-case en toegelicht worden door middel van een demo. Voor BI, Datawarehousing en data mining enthousiastelingen is dit een presentatie die je zeker niet mag missen!

Jasper Stoop
Process Mining and Fraud Detection:
A case study on the theoretical and practical value of using process mining for the detection of fraudulent behavior in the procurement process
The presentation is about how process mining can be utilized in fraud detection and what the benefits of using process mining for fraud detection are. The concepts of fraud and fraud detection are discussed. These results are combined with an analysis of existing case studies on the application of process mining and fraud detection to construct an initial setup of two case studies, in which process mining is applied to detect possible fraudulent behavior in the procurement process. Based on the experiences and results of these case studies, the 1+5+1 methodology is presented as a first step towards operationalizing principles with advice on how process mining techniques can be used in practice when trying to detect fraud.

23 January 2013 (Wednesday)
Ander de Keijzer (Windesheim University of Applied Sciences)
Securing Data
With computers everywhere and pretty much always online and/or accessible, securing data is of the utmost importance. In this presentation we highlight several points of interest when thinking of security and suggest some possible (and practical) solutions.

5 February 2013 (15:45h. – 17.30h. in Waaier 1&2)
Ed Brinksma
Try out Euclid
Prof. Brinksma has prepared an amazing lecture for next year's bachelor programme. He will take you to exciting corners of mathematics where you have never been before. But, above all, he will challenge you with some remarkable brain teasers. Solutions may be submitted live through your smartphone or tablet. Among the ten best solutions we will allot nine Spotify Gift Cards. The first prize will be a surprise.

5 March 2013
Robin Aly
A Unified Framework to Evaluate Plurality Oriented Retrieval
A standard way to evaluate search engines is the Cranfield paradigm. The paradigm assumes well-defined information needs, golden standard relevance assessments, and an evaluation measure on graded relevance. The literature believes that the Cranfield paradigm is a too poor model of reality and that better paradigms should include phenomena such as the following: first, users issue ambiguous queries, second, information needs require diverse documents, and finally, relevance assessments could be erroneous. Until now, each approach in the literature addresses only a subset of these phenomena. We propose, however, that these phenomena will almost always co-occur and an search evaluation approach should account for this. In this talk I present a unified framework of approaches from the literature that has exactly one component for each phenomena. Using the framework, we investigate how the comparison of existing search engines changes under a simulated mixture of these phenomena.

18 March 2013 (Monday at 15.45 h. in the SmartXP lab)
Peter Norvig (Google)
Norvig Web Data Science Award Ceremony
In a live video connection from California, USA, Peter Norvig, Director of Research at Google, will award the winners of the Web Data Science Award that was named in his honor: Lesley Wevers, Oliver Jundt, and Wanno Drijfhout from the University of Twente. The Norvig Web Data Science Award was created by Common Crawl and SURFsara to encourage research in web data science.
Read more…

26 March 2013
Vincent van Donselaar
TF×IDF ranking: a physical approach
This paper tries to find similarities and differences between TFxIDF ranking and the theory of network analysis. A simplified network model based on the principle of an electrical circuit acts as a guide to gain understanding of the model’s operation. The correctness of this model is tested by implementing it as a function of the Terrier Information Retrieval System, whereupon it is evaluated against Terrier’s predefined TF×IDF model.

2 April 2013
This day is a follow up to the successful COMMIT Kick-Off meeting of last year. COMMIT/ted to you! is meant only for people who are officially participating in the COMMIT program.
Read more

9 April 2013
Paul Stapersma
Efficient Query Evaluation on Probabilistic XML Data
In this thesis, we instruct MayBMS to cope with probabilistic XML (P-XML) in order to evaluate XPath queries on P-XML data as SQL queries on uncertain relational data. This approach entails two aspects: (1) a data mapping from P-XML to U-Rel that ensures that the same information is represented by database instances of both data structures, and (2) a query mapping from XPath to SQL that ensures that the same question is specified in both query languages.

16 April 2013
Maurice van Keulen
Parels der Informatica (Pearls of Computer Science)
Maurice will talk about the plans for the new Computer Science bachelor curriculum, where student learn about the computer science's most impressive achievements: the pearls of Computer Science.

22 April 2013 (15:00 h. Atrium Ravelijn)
with Maurice van Keulen and Djoerd Hiemstra
TOM inspiration market
On Monday, 22nd April, from 15.00 – 17.00 (Atrium, Ravelijn), there will be an Inspiration market about Twente's Educational Model. You can walk in and out as you wish, there is no fixed schedule. Maurice and Djoerd will both be present at the market, talking about using digital white boards, online lectures, and challenges.
Read more (access restricted)

1 May 2013 (Wednesday)
Rezwan Huq
Inference-based Framework managing Data Provenance
Data provenance allows scientists to validate their model as well as to investigate the origin of an unexpected value. Furthermore, it can be used as a replication recipe of output data products. However, capturing provenance requires enormous effort by scientists in terms of time and training. First, they need to design the workflow of the scientific model, i.e., workflow provenance, which requires both time and training. Second, they need to capture provenance while the model is running, i.e., fine-grained data provenance. Explicit documentation of fine-grained provenance is not feasible because of the massive storage consumption by provenance data in the applications where data is continuously arriving and is processed. We propose an inference-based framework which provides both workflow and fine-grained data provenance at a minimal cost in terms of time, training and disk consumption. Our framework is applicable to any given scientific model and is capable of handling different system dynamics such as variation in the processing time as well as input data products arrival pattern. We evaluate the framework on two different use cases. Our evaluation shows that the proposed framework can infer accurate provenance information at reduced costs and therefore, is relevant and suitable for scientists in different domains.

27 May 2013 (Monday)
Mena Habib
University of Twente at #MSM2013
Twitter messages are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. In this paper we present a hybrid approach for Named Entity Extraction (NEE) and Classification (NEC) for tweets. The system uses the power of the Conditional Random Fields (CRF) and the Support Vector Machines (SVM) in a hybrid way to achieve better results. For named entity type classification we used the AIDA disambiguation system to disambiguate the extracted named entities and hence find their type.

4 June 2013 (10.00h. in Waaier 4)
CTIT Symposium on Big Data
The CTIT Symposium Big Data and the emergence of Data Science addresses the multidisciplinary opportunities and challenges of Big Data from the perspectives of industrial research, academic research, and education.
Read more…

10 June 2013 (Monday)
Dan Ionita and Niek Tax
An API-based Search System for One Click Access to Information
This paper proposes a prototype One Click access system, based on previous work in the field and the related 1CLICK-2@NTCIR10 task. The proposed solution integrates methods from previous such attempts into a three tier algorithm: query categorization, information extraction and output generation and offers suggestions on how each of these can be implemented. Finally, a thorough user-based evaluation concludes that such an information retrieval system outperforms the textual preview collected from Google search results, based on a paired sign test. Based on validation results possible suggestions on future improvements are proposed.

18 June 2013
Design projects
Online leeromgeving voor derde wereld landen
We will present the result of our ‘Ontwerpproject’. The last semester, we have built a system that will help students in Third World Countries with their Master Thesis. These students will graduate at their own university, but can use our system to get in touch with supervisors from around the world to help them with their research. This way, the limitations on resources and expertises in those areas are dealt with to make sure the student can reach out to their potential. During this presentation, we will demonstrate the system and discuss some techniques we used to accomplish our goals.
Science Challenges
For the 'Ontwerpproject' we set out to build a system that will support the Challenge initiative. This initiative strives to organize and promote science challenges for students of the University of Twente. Challenges are short, engaging and fun projects – often combined with a competition – in which students can participate in teams or alone. The presentation will briefly describe the initiative, our workflow and the system that we built.

26 June 2013 (Wednesday)
Ilya Markov (Università della Svizzera italiana)
Reducing the Uncertainty in Resource Selection
The distributed retrieval process is plagued by uncertainty. Sampling, selection, merging and ranking are all based on very limited information compared to centralized retrieval. This talk will be focused on reducing the uncertainty within the resource selection phase by obtaining a number of estimates, rather than relying upon only one point estimate. Three methods for reducing uncertainty will be proposed, which will be compared against state-of-the-art baselines across three distributed retrieval testbeds. The experimental results show that the proposed methods significantly improve baselines, reduce the uncertainty and improve robustness of resource selection.

2 July 2013
Christian Wartena (Hochschule Hannover)
Distributional Similarity of Words with Different Frequencies
In this talk I will present three case studies in which we deal the frequency bias of distributional similarity. In the first study we compare context vectors of words with word distributions of documents in order to find words that can represent the content of a document. Here we explicitly require that the context vectors of the words are similar to the word distribution of the document and dissimilar to the general word distribution of the collection. In two recent studies we used distributional similarity to determine the semantic equivalence of words and short phrases. Here we eliminate the frequency bias by modeling the dependency of vector similarity on the word frequency.
We are not yet able to give a final solution to the frequency problem of distributional semantics, but the three case studies show that distributional similarity can be substantially improved when we take the frequency bias serious. Thus the topic is definitely worth investigating further

9 July 2013
Robin Aly
Taily: Shard Selection Using the Tail of Score Distributions
I propose Taily, a novel shard selection algorithm that models a query’s score distribution in each shard as a Gamma distribution and selects shards with highly scored documents in the tail of the distribution. Taily estimates the parameters of score distributions based on the mean and variance of the score function’s features in the collections and shards. Because Taily operates on term statistics instead of document samples, it is efficient and has deterministic effectiveness. Experiments on large web collections (Gov2, CluewebA and CluewebB) show that Taily achieves similar effectiveness to sample-based approaches, and improves upon their efficiency by roughly 20% in terms of used resources and response time. Read more

See also: 2012's DB colloquium.

2012’s DB colloquium

Below you find a list of last year's DB colloquia, usually Tuesday's from 13:45h. – 14:30h. in ZI-3126.

26 January 2012 (Thursday at 16.00 – in Dutch)
Design Project Group
Quetis – a self-configuring interface between devices
Quetis is a software tool that allows communication with paralyzed patients by interactively calibrating and configuring any given set of human input devices that have a compatible middleware driver for the patient. Quetis detects user capabilities per device and configures itself to only utilize that subset of actions that the patient can actually use and maps these to a configuration driving a premade GUI, though it could use any generic output system with proper middleware. In short: Quietis is a generalized and self-configuring interface between specialistic input devices and the environment for paralyzed patients in ICU’s.

27 February 2012
Tjitze Rienstra (University of Luxembourg)
Argumentation Theory
In the theory of abstract argumentation, the acceptance status of arguments is normally determined for the complete set of arguments at once, under a single semantics. However, this is not always desired. For example, in multi-agent systems, the provenance of the arguments and the competence of the agents often suggest different evaluation criteria for different arguments.

13 March 2012
Robin Aly
AXES: Access to Audiovisual Archives
The EU Project AXES aims at opening large audio-visual archives to a variety of user groups. My main task within the project is to provide search and linking functionality. Meanwhile, the project the first year of the project has passed and its progress will be reviewed next week. This colloquium is a rehearsal of the presentation I will give there. I will also provide a general overview of the project and demos of existing work.

28 March 2012 (Wednesday at 15.00h.-15.45h.)
Dong Nguyen
Evaluating federated search on the Web
Federated search systems have previously been evaluated by reusing existing TREC datasets. However, these datasets do not reflect realistic search systems found on the Web. As a result, it has been difficult to assess whether these systems are suitable for federated search on the Web. We therefore introduce a new dataset containing more than hundred actual search engines. We first discuss the design and present several analyses of the dataset. We then compare several popular resource selection methods and discuss the results. Several suggestions/modifications to incorporate more Web specific features are then presented.

28 March 2012 (Wednesday at 10.45h.-11.45h. in CR-1B)
Henning Rode (Textkernel)
Structured Retrieval in Practice
In this talk I will give a demo of a CV search system build for job recruiters, and describe the challenges of building the system such as e.g. user-friendly faceted search, synonym handling, and location search. When searching on richly structured documents such as CVs we also encountered a number of ranking problems using the standard language modelling approach for retrieval. The second part of the presentation will therefore discuss these issues in more detail and explain why they require field-specific solutions. Finally, I will share some ideas on how to further improve the search experience by making use of large domain knowledge sources.

3 April 2012
Maarten Fokkinga
Database Design
what you’ve been doing always but never was fully aware of
We show how to construct (in an almost algorithmic way) a query formulation for a database schema, out of a (arguably simpler) query formulation in terms of an Entity-Relationship diagram. To do so requires first a thorough understanding of the construction of a database schema out of the ER diagram. For this latter task, we show how express the relations between varies steps in the development of the database schema, and what proof obligations exist.

17 April 2012
Lesley Wevers
A functional database programming language
We explore the possibilities of using functional languages in database management. We will develop a prototype implementation and compare it to the traditional approach of general purpose language and database management systems on aspects of performance and usability.

1 May 2012
Juan Amiguet, Rezwan Huq, and Andreas Wombacher
Data Processing – provenance and propagation
We want to shortly introduce two case studies we are currently working on and which Juan and Rezwan will use for evaluating their research. This talk hopefully allows to discuss the differences between propagation investigated by Juan and provenance investigated by Rezwan.

2 May 2012 (Wednesday at 11:30h. in ZI-4126)
Dolf Trieschnigg
An Exploration of Language Identification Techniques for Dutch Folktales
The Dutch Folktale Database contains fairy tales, traditional legends, urban legends, and jokes written in a large variety and combination of languages including (Middle and 17th century) Dutch, Frisian and a number of Dutch dialects. In this work we compare a number of approaches to automatic language identification for this collection. We show that in comparison to typical language identification tasks, classification performance for highly similar languages with little training data is low.
Read more

12 June 2012
Robin Aly
Statistical Shard Selection
Large search engines partitions their documents into so-called shards (each shard contains the index of many documents). For a query, the search engine has to decide which shards should be used for searching (usually the top-n). To decide which shards to use, the shards are represented by sample documents which are retrieved in a first retrieval step. However, generating good document samples is not trivial and requires storage space. In this talk, I want to reflect my ideas for the shard selection problem. The ideas are based on the fact most current retrieval models are simply weighted sums of features, for which simple statistic laws exist: the expectation of the sum is the sum of its expectations, and the variance of the sum is the sum of its variances. I propose to represent shards by their expected feature value, the feature variance and co-variances. Using this representation, I hope one can determine the score distribution for the current query in each shard. The shards to select are those which have a fair chance to contain documents with a high score than a certain threshold, according to this distribution.

18 June 2012 (Monday all day)
CTIT Symposium 2012
ICT: The Innovation Highway
In EU’s Horizon 2020 three objectives have been set: excellent research, competitive industries, and better society. ICT plays a central role in reaching these goals. At this year’s CTIT symposium we will take up the challenges defined at the national and European level. Challenges which, when solved, will lead us to a better future. Together with you we will show that ICT is the true innovation highway.
Read more

19 June 2012
Design Projects
Yasona: a peer-to-peer social network
Our goal was to design a peer-to-peer network structure to create a decentralized social network. The design should not focus on social aspects of networking, or the various possibilities a social medium could have, per se, but should instead offer a platform on wich such elements can be build. Yasona was developed in order to give people the opportunity to communicate with each other and share media without being being dependent on a central server. We will demonstrate our prototype and discuss our goals, design choices and recommendations.

Design, implementation and evaluation of the Bata App
For the Batavierenrace an Android smart-phone app has been designed and implemented which –in a well organized way– shows all available information from the official Batavierenrace organization (like teams, running times, standing, etc). After some initial updates, the app has performed satisfactorily, without bugs, and was downloaded over two thousand times (being in the top 10 of Google market). The design, problems encountered, and experiences will be discussed during the talk.

17 July 2012
Brend Wanders
Semi-structured data in a wiki
Wikis offer free form editing of, and collaboration on, texts. These texts are usually of an informative nature, and are intended for consumption by people. By embedding semi-structured information in a wiki, the information can also be used by other systems. In this short talk I will present my take on using a wiki as a basis for the collaborative creation and curation of data sets by offering ad-hoc data entry and querying.

15 August 2012 (Wednesday at 13.30h.)
Nick Barkas (Spotify, Sweden)
Search at Spotify
Nick Barkas is a software developer at Spotify in Stockholm, Sweden, working mostly with backend/server-side systems. He studied scientific computing at KTH in Stockholm and the University of Washington in Seattle. Barkas will talk about how Spotify serves music metadata to users and how that relates to search.

11 September 2012
Mohammed Salem (Humboldt University, Berlin)
Journalistic Multimedia Data Analytics
In this project we propose to develop applications and tools around content-based journalistic data management, analysis, retrieval and visualization. New algorithms are needed for automatic extraction of content related metadata and annotation not only for text documents but also for news videos, images, audio signals and animations. Moreover, new retrieval methods are needed that utilize the multimodality nature of the news data and are able to return different materials related to a certain news story.

25 September 2012
Mena Habib
Toponym extraction and disambiguation
Toponym extraction and disambiguation are key topics recently addressed by fields of Information Extraction and Geographical Information Retrieval. Toponym extraction and disambiguation are highly dependent processes. Not only toponym extraction effectiveness affects disambiguation, but also disambiguation results may help improving extraction accuracy. In this paper we propose a hybrid toponym extraction approach based on Hidden Markov Models (HMM) and Support Vector Machines (SVM). Hidden Markov Model is used for extraction with high recall and low precision. Then SVM is used to find false positives based on informativeness features and coherence features derived from the disambiguation results. Experimental results conducted with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms showed that the proposed approach outperform the state of the art methods of extraction and also proved to be robust. Robustness is proved on three aspects: language independence, high and low HMM threshold settings, and limited training data.

28 September 2012 (Friday at 14.00h.)
Hans Wormer (Almere Data Capital)
Growing with Big Data
Hans Wormer is program mamanager of Almere Data Capital. The term Data Capital refers to the concentration of companies, services, knowledge and facilities that support the collection, storage, access, sharing, editing and visualization of big data. The program Almere Data Capital brings together supply and demand safely and efficiently. Hans Wormer will address: Almere's vision on the developments around Big Data; Almer's approach to stimulate new activities; the creation of new jobs for the city and region; and finally, how to get involved in Almere Data Capital.

1 October 2012 (Monday 13.30h. in ZI-3126)
Victor de Graaff (with an introduction from Djoerd Hiemstra)
The theory behind scrum
Djoerd will give a 6 minutes and 40 seconds, “pecha kucha” introduction on the plans for the module “Data & Information” of the new Computer Science bachelor.
Victor will give a 45 minute presentation on the theory behind Scrum, an increasingly popular software development methodology. Scrum is an implementation of Agile development, and is based on the concept of the capabilities of the team to plan and review their own work.

23 October 2012
Rezwan Huq
From Scripts Towards Provenance Inference
Scientists require provenance information either to validate their model or to investigate the origin of an unexpected value. However, they do not maintain any provenance information and even designing the processing workflow is rare in practice. Therefore, in this paper, we propose a solution that can build the workflow provenance graph by interpreting the scripts used for actual processing. Further, scientists can request fine-grained provenance information facilitating the inferred workflow provenance. We also provide a guideline to customize the workflow provenance graph based on user preferences. Our evaluation shows that the proposed approach is relevant and suitable for scientists to manage provenance.

12 November 2012 (Monday, 12.30h. in ZI-2126)
Iwe Muiser
Cleaning up and Standardizing a Folktale Corpus for Humanities Research
Recordings in the field of folk narrative have been made around the world for many decades. By digitizing and annotating these texts, they are frozen in time and are better suited for searching, sorting and performing research on. This paper describes the first steps of the process of standardization and preparation of digital folktale metadata for scientific use and improving avail- ability of the data for humanities and, more specifically, folktale research. The Dutch Folktale Database has been used as case study but, since these problems are common in all corpora with manually created metadata, the explanation of the process is kept as general as possible.

14 November 2012 (Wednesday, 13.45h. in CR-3E)
Thijs Westerveld (Teezir, Utrecht)
Analysing Online Sentiments: Big Data, Small Building Blocks
The term big data has become mainstream to the point that its showing up in lists of most annoying management buzzwords. Teezir helps its customers to find value in big data beyond the hype. By collecting and analysing almost half a million documents on a daily basis and ordering, summarizing and aggregating the gathered information, we turn big data into valuable insights. To process a continuous stream of Tweets, Facebook updates, forum and blog posts and online and offline news articles we have developed a series of building blocks. In this talk I will discuss some of these including our smart crawlers that learn which links to follow based on user interaction, our text analysis components to detect the language and sentiment of a document, and the index structures we use to quickly produce suitable aggregates in a faceted search like fashion. To conclude, I will give a demonstration of our analytics dashboards and show some examples of how our customers interact with this data and how they incorporate our technology in their daily process.

20 November 2012
Sergio Duarte
Query Recommendation for Children
In this work we propose a method that utilizes tags from social media to suggest queries related to children topics. Concretely we propose a simple yet effective approach to bias a random walk defined on a bipartite graph of web resources and tags through keywords that are more commonly used to describe resources for children. We evaluate our method using a large query log sample of queries aimed at retrieving information for children. We show that our method outperforms query suggestions of state-of-the-art search engines and state-of-the art query suggestions based on random walks.

27 November 2012
Mohammad Khelghati
Size Estimation of Non-Cooperative Data Collections
In this paper, the approaches for estimating the size of non-cooperative databases and search engines are categorized and reviewed. The most recent approaches are implemented and compared in a real environment. Finally, four methods based on the modification of the available techniques are introduced and evaluated. In one of the modifications, the estimations from other approaches could be improved ranging from 35 to 65 percent.

18 December 2012
Fabian Panse (University of Hamburg)
Indeterministic Handling of Uncertain Decisions in Deduplication
In this paper, we present an indeterministic approach for deduplication by using a probabilistic target model including techniques for proper probabilistic interpretation of similarity matching results. Thus, instead of deciding for a most likely situation, all realistic situations are modeled in the resultant data. This approach minimizes the negative impact of false decisions. Furthermore, the deduplication process becomes almost fully automatic and human effort can be reduced to a large extent. To increase applicability, we introduce several semi-indeterministic methods that heuristically reduce the set of indeterministically handled decisions in several meaningful ways. We also describe a full-indeterministic method for theoretical and presentational reasons.
Read more

See also: 2011's DB colloquium.

2011’s DB colloquium

Below you find last year’s DB colloquia, usually Tuesday’s from 13:45h – 14:30h. in ZI-3126.

6 January 2011 (Thursday at 15.00 h.)
Andreas Wombacher
Data driven inter-model conformance checking
Sensor data document changes in the physical world, which can be understood based on metadata modeling part of the physical world. In the digital world, information systems are used for handling exchange of information between different actors, where some information is related to physical objects. Since these objects are potentially the same as observed by sensors, the sensor model (metadata) and the information system should describe the handling of physical objects in the same way, i.e., the information system and the sensor model should conform. So far conformance checking has been done on model level. I propose to use observed sensor and potentially information system data to check conformance.

13 January 2011 (Thursday, 10:45-12:30h. in CR-3E)
Data Warehousing and Data Mining guest lectures

Tom Jansen (Distimo)
Data warehousing for app store analytics. Distimo is an innovative app store analytics company built to solve the challenges created by a widely fragmented app store marketplace filled with equally fragmented information and statistics.

Jacques Niehof and Alexandra Molenaar (SIOD)
Data Mining for selection profiles for fraud and misuse of social security. The Social Intelligence and Investigation Service (SIOD) of the Ministry of Social Affairs and Employment (Ministerie van Sociale Zaken en Werkgelegenheid) fights criminality in the field of social security.

8 February 2011
Paul Stapersma
A probabilistic XML database on top of MayBMS
We use the probabilistic XML model proposed by Van Keulen and De Keijzer to create a prototype of an probabilistic XML database. One disadvantage of the XML data model is that queries cannot be executed as efficiently as in the relational database model. Many non-probabilistic mapping techniques have been developed to map semi structured data into relational databases to overcome this disadvantage. In this research, we use the schema-less mapping technique ‘XPath Accelerator’ to build a probabilistic XML database (PXML-DBMS) based on an URDBMS. A working prototype can be found at http://code.google.com/p/pxmlconverter/.

24 May 2011 (11:30h. in ZI-5126)
Almer Tigelaar
Search Result Caching in P2P Information Retrieval Networks
We explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

31 May 2011 (11:30h. in ZI-5126)
Kien Tjin-Kam-Jet
Free-Text Search over Complex Web Forms
This paper investigates the problem of using free-text queries as an alternative means for searching ‘behind’ web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

7 June 2011
CTIT Symposium Security and Privacy – something to worry about?
The list of invited speakers includes Prof.dr.ir. Vincent Rijmen (TU Graz, Austria and KU Leuven), Dr. George Danezis (Microsoft Research), Dr. Steven Murdoch (University of Cambridge), Prof. Bert-Jaap Koops (TILT), Prof.mr.dr. Mireille Hildebrandt (RU Nijmegen), Dr.ir. Martijn van Otterlo (KU Leuven) and Prof. Pieter Hartel (UT).
Read more…

21 June 2011
Andreas Wombacher
Sensor Data Visualization & Aggregation: A self-organizing approach
In this year's Advanced Database Systems course the students had the assignment to design and implement the database functionality to visualize sensor data based on user requests in a Web based system. To guarantee good response times of the database it is necessary to pre-aggregate data. The core of the assignment was to find good pre-aggregations which minimize the query response times while using only a limited amount of storage space. The pre-aggregation levels must adjust in case the characteristics of the user requests changes. In this talk I will present the different approximation approaches of the students and present an optimal solution NP-hard solution.

28 June 2011
Marijn Koolen (University of Amsterdam)
Relevance, Diversity and Redundancy in Web Search Evaluation
Diversity performance is often evaluated with a measure that combines relevance, subtopic coverage and redundancy. Although this is understandable from a user perspective, it is problematic when analysing the impact of diversifying techniques on diversity performance. Such techniques not only affect subtopic coverage, but often the underlying relevance ranking as well. A evaluation measure that conflates these aspects hampers our progress in developing systems that provide diverse search results. In this talk, I argue that to further our understanding of how system components affect diversity, we need to look at relevance, coverage and redundancy individually. Using the official runs of the TREC 2009 Diversity task, I show that differences in diversity performance are mainly due to difference in the relevance ranking, with only minimal differences in how the relevant documents are ordered amongst themselves. If we measure diversity independent of the relevance ranking, we fin d that some of systems that perform badly on conflated measures have the most diverse ordering of relevant documents.

30 June 2011 (Thursday, 10:30-11:30h. in ZI-3126)
Design Project Presentations

Tristan Brugman, Maarten Hoek, Mustafa Radha, Iwan Timmer, Steven van der Vegt
UT-Search. UT Search is gemaakt als een vervanger voor google custom search. Er is gepoogd om een zoekmachine te maken waarin de kennis van de verschillende systemen binnen de UT wordt gebruikt om ook de vele informatie te vinden die niet via custom search gevonden kunnen worden. We maken gebruik van aggregated search om de verschillende systemen van de universiteit aan te spreken. Middels faceted search kan de gebruiker selecties maken van de systemen die hij wil doorzoeken om zo de zoekopdracht te verfijnen. Het doel is om een centrale zoekmachine te hebben voor de universiteit waar vanaf alle verschillende systemen doorzocht kunnen worden om de informatie toegankelijker te maken.

Ralph Broenink, Rick van Galen, Jarmo van Lenthe, Bas Stottelaar, Niek Tax
Alexia: Het borrelbeheersysteem voor Stichting Borrelbeheer Zilverling Met de activiteiten van vijf verenigingen, afstudeerborrels en andere activiteiten worden de beide borrelruimtes van het Educafé druk bezet. Er is gebleken dat door deze hoge bezettingsgraad er behoefte is aan een managementsysteem dat borrels kan plannen, voorraad kan bijhouden, tappers kan beheren en ook het verbruik tijdens de borrels kan registreren. Als extraatje is het ook mogelijk om op basis van RFID-kaarten op rekening te drinken. Tijdens de presentatie laten wij zijn hoe we uit de eisen van deze vijf verenigingen een webapplicatie hebben ontwikkeld dat bovenstaande functies kan vervullen, gebruikmakende van de laatste technologieën zoals Django, CSS3 en HTML5.

12 July 2011
Sergio Duarte
Sergio will talk about his internship at Yahoo! Research, Barcelona.

23 August 2011
Rezwan Huq
Inferring Fine-grained Data Provenance in Stream Data Processing: Reduced Storage Cost, High Accuracy Fine-grained data provenance ensures reproducibility of results in decision making, process control and e-science applications. However, maintaining this provenance is challenging in stream data processing because of its massive storage consumption, especially with large overlapping sliding windows. In this paper, we propose an approach to infer fine-grained data provenance by using a temporal data model and coarse-grained data provenance of the processing. The approach has been evaluated on a real dataset and the result shows that our proposed inferring method provides provenance information as accurate as explicit fine-grained provenance at reduced storage consumption.

30 August 2011 at 14:30h.
Maarten Fokkinga
Aggregation – polymorphic and polytypic
Repeating the work of Meertens “Calculate Polytypically!” we show how to define in a few lines a very general “aggregation” function. Our intention is to give a self-contained exposition that is, compared to Meertens' work, more accessible for the uninitiated reader who wants to see the idea with a minimum of formal details.

5 September 2011 (Monday at 13.30h.)
Robin Aly
Towards a Better Understanding of the Relationship Between Probabilistic Models in IR
Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work.

20 September 2011
Juan Amiguet
Annotation propagation and topology based approaches
When making sensor data stream applications more robust to changes in the sensing environment by introducing annotations representing the changes. We find ourselves with the need to propagate such annotations across processing elements. We present here a technique for performing such propagation exploiting the relations amongst the inputs and outputs both from an information theoretic and topological perspective. Topologies are used to describe the structure of the inputs and the outputs separately. Whilst information theory techniques are used to model the transform as a channel enabling the topological transformations to be treated as optimisation problems. We present here a framework of functions which is generic in the light of all transforms, and which enables for the maximisation of the entropy across the transform.

26 October 2011
Sergio Duarte
What and How Children Search on the Web
In this work we employed a large query log sample from a commercial web search engine to identify the struggles and search behavior of children of the age of 6 to young adults of the age of 18. Concretely we hypothesized that the large and complex volume of information to which children are exposed leads to ill-defined searches and to dis-orientation during the search process. For this purpose, we quantified their search difficulties based on query metrics (e.g. fraction of queries posed in natural language), session metrics (e.g. fraction of abandoned sessions) and click activity (e.g. fraction of ad clicks). We also used the search logs to retrace stages of child development. Concretely we looked for changes in the user interests (e.g. distribution of topics searched), language development (e.g. readability of the content accessed) and cognitive development (e.g. sentiment expressed in the queries) among children and adults. We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level of the clicked pages and the demographics characteristics of the users such as age and average educational attainment of the zone in which the user is located.
Read more

29 November 2011
Rezwan Huq
Adaptive Inference of Fine-grained Data Provenance to Achieve High Accuracy at Lower Storage Costs
In stream data processing, data arrives continuously and is processed by decision making, process control and e-science applications. To control and monitor these applications, reproducibility of result is a vital requirement. However, it requires massive amount of storage space to store fine-grained provenance data especially for those transformations with overlapping sliding windows. In this paper, we propose techniques which can significantly reduce storage costs and can achieve high accuracy. Our evaluation shows that adaptive inference technique can achieve more than 90% accurate provenance information for a given dataset at lower storage costs than the other techniques. Moreover, we present a guideline about the usage of different provenance collection techniques described in this paper based on the transformation operation and stream characteristics.

See also: 2010's DB colloquium.

2010’s DB colloquium

Below you find this year's DB colloquia, usually Tuesday's from 13:45h – 14:30h. in ZI-3126.

24 March 2010 (Wednesday)
Robin Aly
Beyond shot retrieval: Searching for Broadcast News Items Using Language Models of Concepts
In this paper we use a method to evaluate the performance of story retrieval, based on the TRECVID shot-based retrieval ground truth. Our experiments on the TRECVID 2005 collection show a significant performance improvement against four standard methods.
Read more…

30 March 2010
Maarten Fokkinga
A Greedy Algorithm for Team Formation that is Fair over Time
In terms of a concrete example we derive a “fast” so-called greedy algorithm for a “hard” problem (having exponential time complexity). The concrete problem is: the formation of teams from a given set of players such that, when repeated many times, each player is equally often teammate of each other player. We also formalize our greedy algorithm in a general setting.

6 April 2010
Sergio Duarte
An analysis of queries intended to search information for children
In this paper we analyze queries and groups of queries intended to satisfy children’s information needs using a large-scale query log to compare the characteristics of these queries. The aim of this analysis is twofold: i) To identify differences in the query space, content space, user sessions and user behavior of these two types of queries. ii) To enhance this query log by including annotation on children queries, sessions and actions.
Read more…

13 April 2010
Djoerd Hiemstra
MapReduce information retrieval experiments
We propose to use MapReduce to quickly test new retrieval approaches on a cluster of machines by sequentially scanning all documents. We present a small case study in which we use a cluster of 15 low cost machines to search a web crawl of 0.5 billion pages (ClueWeb09, Category A) showing that sequential scanning is a viable approach to running large-scale information retrieval experiments with little effort.
Read more…

20 April 2010
Maurice van Keulen
Climbing trees in parallel universes
When doing research on data uncertainty, one enters the realm of science fiction, where the same data can co-exist in parallel universes in slight different forms. An uncertain XML document is like a tree that exists in many parallel universes where its leafs and branches may exist in one, but possibly not in another. Querying XML is navigating through XML trees … so I'm going to teach you how to efficiently climb trees in parallel universes. The trick is to simultaneously climb the tree in all parallel universes at once, while not stepping on branches that do not exist … because then you will fall down in one universe and not in the other, splitting yourself into a life and a dead person like Schroedinger's cat … sorry, my imagination carries me away …

27 April 2010
Almer Tigelaar
Query-Based Sampling using Only Snippets
Query-based sampling is a popular approach to model the content of an uncooperative server. It works by sending queries to the server and downloading the returned documents in the search results in full. This sample of documents then represents the server’s content. We present an approach that uses the document snippets as samples instead of downloading entire documents. This yields more stable results at the same amount of bandwidth usage as the full document approach. Additionally, we show that using snippets does not necessarily incur more latency, but can actually save time.

4 May 2010
Riham Abdel Kader
Run-time Optimization for Pipelined Systems
Traditional optimizers fail to pick good execution plans, when faced with increasingly complex queries and large data sets. This failure is even more acute in the context of XQuery, due to the structured nature of the XML language. To overcome the vulnerabilities of traditional optimizers, we have previously proposed ROX, a Run-time Optimizer for XQueries, which interleaves optimization and execution of full tables. ROX has proved to be robust, even in the presence of strong correlations, but it has one limitation: it uses full materialization of intermediate results making it unsuitable for pipelined systems. Therefore, this paper proposes ROX-sampled, a variant of ROX, which executes small data samples, thus generating smaller intermediates. We conduct extensive experiments which proved that ROX-sampled is comparable to ROX in performance, and that it is still robust against correlations. The main benefit of ROX-sampled is that it allows the large number of pipelined databases to import the ROX idea into their optimization paradigm.
Read more…

11 May 2010
Peter Apers
How bright looks your research future.
In the beginning of your research career you set out your own goals and take decisions that may negatively affect your career. With this presentation I want to make you aware of this. Topics that are discussed: Research Challenges, Trends in Research Funding, Funding agencies with their own programs, What is expected from you?

18 May 2010
Robin Aly
Uncertainty in Information Retrieval
Today, information retrieval systems take many things for granted. For example: (1) A classifier decides whether a concepts occurs in a medical document. (2) Every document that contains an expert's email address describes his competence. (3) The systems parameters in a retrieval function are constant for all queries. Finally, a score is the final number to rank documents by – ignoring the other documents in the ranking. In this talk, I will first identify three main sources of uncertainty in an information retrieval system. Afterwards, I will describe existing approaches to this uncertainty and propose future directions in this field of research. This talk contains the core ideas of the research I would like to conduct in the future and therefore has a more visionary character.

25 May 2010
Andreas Wombacher
Uncertainty principle in stream processing
In environmental applications sensor data are processed online as a stream of measurements for warning, decision support, forecasting and controlling applications. The stream processing can be (i)accurate but delayed, since it requires to wait for delayed measurements, or it can be (ii) timely but inaccurate, since the processing is done based on available data. In my talk I will discuss this uncertainty principle and present research challenges derived from it.

1 June 2010
CTIT Symposium Dependable ICT: who cares?
The central theme of this CTIT's annual symposium is Dependable ICT. A system is dependable, if we can justifiably rely on its services. A dependable system should be robust against unavoidable physical faults, for instance a jammed communication channel. Also, a dependable system should resist human error, be it during operation or at design time, for instance software errors. Dependable ICT systems should even defend themselves against malicious attacks by intrusion or abuse.
Read more…

8 June 2010
Rezwan Huq
Facilitating Fine Grained Data Provenance using Temporal Data Model
E-science applications use fine grained data provenance to maintain the reproducibility of scientific results, i.e., for each processed data tuple, the data used to process the data tuple as well as the used approach is documented. Since most of the e-science applications perform on-line processing of sensor data using overlapping time windows, the overhead of maintaining fine grained data provenance is huge especially in longer data processing chains. This is because data items are used by many time windows. Here, we propose an approach to reduce storage costs for fine-grained data provenance by maintaining data provenance on the relation level instead on the tuple level and make the content of the used database reproducible. The approach has prototypically been implemented for streaming and manually sampled data.

15 June 2010
Juan Amiguet Vercher
Annotations: Purposeful Stream Data Processing
In E-Science data provenance and technical data quality measurements are two major recent contributions, aiming to make data processing more accountable and verifiable. Both techniques have a series of drawbacks. Important changes impacting data interpretation can not be recognised from changes in data quality measurements. A sensor may continue reporting correctly yet its environment can change without it being reflected in its data. Annotations can address this by conveying information about the data. Annotations take the form of tokens, manually or automatically generated, which are streamed separately from the data. The information they convey can help drive the data transform or explain the impact of the latter. Issues discussed are: Stream non-update principle violation, Stream synchronisation, Incomplete annotation understanding, and Data Invalidation through partial annotation implementation.

22 June 2010
Dolf Trieschnigg
A Cross-lingual Framework for Monolingual Biomedical Information Retrieval
We approach the incorporation of a concept-based representation in monolingual biomedical IR from a cross-lingual perspective. In the proposed framework, this is realized by translating and matching between text and concept-based representations. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a cross-lingual IR framework for monolingual biomedical IR if basic translations models are combined.
Read more…

29 June 2010
Robin Aly
From Stars and Planets to Information Retrieval:
Events, Event Spaces and Random Variables in IR.
Recently, a discussion about the event spaces used for probability functions in IR emerged in the research community. Based on practical examples I will explain the different assumptions of a selection of models.

6 July 2010
Design Project Groups
Ontwerpproject Jurybeheer Nederlandse AtletiekUnie:
Voor de AtletiekUnie hebben vijf studenten een jurybeheer applicatie ontwikkeld. Hiervoor is een framework en een daarop gebouwde webapplicatie gemaakt, die de inzet van juryleden bij wedstrijden administreert en verroosteringsproblemen tackelt.
XML-DataSet Converter
Voor het converteren van een XML-Document naar een DataSet (relationele database) is er een adapter ontwikkeld. De data in de DataSet moet door een andere applicatie kunnen worden gemanipuleerd (records wijzigen/toevoegen/verwijderen). Vervolgens moet de adapter de DataSet weer omzetten in een XML-Document. De structuur van de XML-Documenten dient hierbij gelijk te blijven.

28 September 2010 (Wednesday in ZI-2126)
Robin Aly
Guest lecture on Multimedia Information Retrieval
More and more information is stored as multimedia: From rap courses for kids over historical documents to academic publications in video format. Because of this data explosion, multimedia information retrieval quickly gains importance. This talk will give an overview of the field starting from the main difference to text information retrieval — the human incomprehensible data format of multimedia documents. Four different approaches for the understanding of multimedia documents are presented: human annotations, low level feature vectors, spoken document words and concept-based representations. The overview of the first three approaches is kept at a high level and the focus of the talk is on concept-based retrieval.

6 October 2010 (Wednesday in ZI-2126)
Thijs Westerveld (Teezir B.V., Utrecht)
Guest lecture: Automatically Analyzing Word of Mouth
In this talk I will demonstrate Teezir's Opinion Analysis dashboards and discuss the underlying technology. For collecting content from web sites we developed advanced crawling technology that automatically identifies relevant news, blog and forum pages and extracts the relevant content and metadata. The collected content is then further analyzed to identify the main sentiments before everything is indexed to be disclosed in the online dashboards. Various sentiment analysis variants that have proven successful in an academic setting have been evaluated on our live collections. I will demonstrate that success on academic test collections does not necessarily imply the practical use of a sentiment analysis algorithm.
Read more…

20 October 2010 (Wednesday in ZI-2126)
Arjen de Vries (CWI, Amsterdam)
How search logs can help improve future searches
In the European project Vitalas, we had the opportunity to analyze the search log data from a commercial picture portal of a European news agency, which offers access to photographic images to professional users. I will discuss how these logs can be used in various ways to improve image search: to expand the image representation, to make suggestions of alternative queries, to adapt the search results to user context, and to build automatically concept detectors for content-based image retrieval.
Read more…

10 November 2010 (Wednesday, 11:30h. – 12.15h.)
Robin Aly
Exploiting Uncertainty about the Knowledge of Objects for Searching
The aim of this project is to improve the experience of users searching the internet for complex objects by exploiting the uncertainty a search engine has about the object's representation.

17 November 2010 (Wednesday, 11:30h. – 12.15h.)
Mena Habib
Neogeography: The Challenge of Channeling Large and Ill-behaved Data Streams
In this project, our wide objective is to propose a new portable, domain-independent XML-based technology that involves set of free services that: enable end-users communities to express and share their spatial knowledge using free text; extract specific spatial information from this text; build a database from all the users’ contributions; and make use of this collective knowledge to answer – natural language – users’ questions through a question answering service.

2 December 2010 (Thursday)
Thomas Demeester (Ghent University, Belgium)
INTEC's Broadband Communication Networks group and Information Retrieval
The main purpose of this short and high-level presentation is to present our research group at Ghent University, in the light of a future collaboration. An overview of the different activities within the group will be followed by our work in the field of Information Retrieval, for a project with the Flemish digital audiovisual archive. Furthermore, as our collaboration might be initiated with a short stay of myself in your group, it could be interesting for you to know my background. I will briefly introduce myself, and how I made the change from the field of electromagnetics (my Ph.D.) to machine learning and information retrieval.

Beyond Shot Retrieval

Searching for Broadcast News Items Using Language Models of Concepts

by Robin Aly, Aiden Doherty, Djoerd Hiemstra, and Alan Smeaton

Current video search systems commonly return video shots as results. We believe that users may better relate to longer, semantic video units and propose a retrieval framework for news story items, which consist of multiple shots. The framework is divided into two parts: (1) A concept based language model which ranks news items with known occurrences of semantic concepts by the probability that an important concept is produced from the concept distribution of the news item and (2) a probabilistic model of the uncertain presence, or risk, of these concepts. In this paper we use a method to evaluate the performance of story retrieval, based on the TRECVID shot-based retrieval groundtruth. Our experiments on the TRECVID 2005 collection show a significant performance improvement against four standard methods.

The paper will be presented at the 32nd European Conference on Information Retrieval (ECIR) in Milton Keynes, UK. (and in the DB colloquium of 24 March)

[download pdf]

Guest lecture: Henke Pons of Arcadis


Hands on GIS by ARCADIS

Who: Henke Pons (Arcadis)
When: Friday 17 October, 8.30 h.
Where: HO-B1228

Henke Pons from ARCADIS Nederland will give a guest lecture Hands on GIS by ARCADIS. Henke Pons is Project leader Geographic Information Systems at ARCADIS Spatial Information in Apeldoorn. He will talk about several special application of GIS at ARCADIS.

More information on TeleTOP

DB colloquium: Suzan Verberne of Radboud University

Using Structural Information for Improving Why-Question Answering

Who: Suzan Verberne (Radboud University Nijmegen)
When: Tuesday September 30, 2008
Where: ZI-3126

My PhD research project “In Search of the Why” aims at developing a system for answering why-questions. Today I will present my recent work on extending a simple passage retrieval approach with structural information. The starting point is Lemur's TFIDF, which retrieves a relevant answer in the top 150 for 79% of the test questions. However, only 45% of the questions is answered in the top 10. We aim to improve the ranking by adding a reranking module. For re-ranking we consider a set of 31 features representing structural information of the question and answer candidate: syntactic structure as well as document structure. We find a significant improvement over the baseline for both MRR and Success@10, which is now 55%. The most important features for re-ranking are TFIDF (the baseline score), the presence of cue words, the question's main verb, and the relation between question focus and document title.

Joost de Wit graduates on evaluating recommender systems

Recommender systems use knowledge about a user’s preferences (and those of others) to recommend them items that they are likely to enjoy. Recommender system evaluation has proven to be challenging since a recommender system’s performance depends on, and is influenced by many factors. The data set on which a recommender system operates for example has great influence on its performance. Furthermore, the goal for which a system is evaluated may differ and therefore require different evaluation approaches. Another issue is that the quality of a system recorded by the evaluation is only a snapshot in time since it may change gradually. Although there exists no consensus among researchers on what recommender system’s attributes to evaluate, accuracy is by far the most popular dimension to measure. However, some researchers believe that user satisfaction is the most important quality attribute of a recommender and that greater user satisfaction is not achieved by an ever increasing accuracy. Other dimensions for recommender system evaluation that are described in literature are coverage, confidence, diversity, learning rate, novelty and serendipity. It is believed that these dimensions contribute in some way to the user satisfaction achieved by a recommender system.

Joost performed a user study for which 133 people subscribed to an evaluation application specially designed and build for this purpose. The user study consisted of two phases. During the first phase users had to rate TV programmes they were familiar with or that they recently watched. This phase resulted in 36.353 programme ratings for 7.844 TV programmes. Based on this data, the recommender system that was part of the evaluation application could start generating recommendations. In phase two of the study the application displayed recommendations for tonight’s TV programmes to its users. These recommendation lists were deliberately varied with respect to the accuracy, diversity, novelty and serendipity dimensions. Another dimension that was altered was programme overlap. Users were asked to provide feedback on how satisfied they were with the list. Over a period of four weeks 70 users provided 9762 ratings for the recommendation lists. For each of the recommendation lists that were rated in the second phase of the user study, the five dimensions (accuracy, diversity, novelty and serendipity) were measured using 15 different metrics. For each of these metrics its correlation with user satisfaction was determined using Spearman’s rank correlation. These correlation coefficients indicate whether there exists a relation between that metric and user satisfaction and how strong this relation is. It appeared that accuracy is indeed the most important dimension in relation to user satisfaction. Other metrics that had a strong correlation were user’s diversity, series level diversity, user’s serendipity and effective overlap ratio. This indicates that diversity, serendipity and programme overlap are important dimensions as well, although to lesser extent.

[more info] [download pdf]

DB Master Students Colloquium

Next Friday 25 April March there will be a DB master students colloquium at 13.45 h. in ZI-3126 with two speakers:

  • Alex van Oostrum will talk about: “The design of an object- and aspect oriented framework to facilitate software development of enterprise components”
  • Matthijs Ooms will talks about: “Provenance of Biomedical data”