2011’s DB colloquium – Djoerd Hiemstra

Below you find last year’s DB colloquia, usually Tuesday’s from 13:45h – 14:30h. in ZI-3126.

6 January 2011 (Thursday at 15.00 h.)

Andreas Wombacher
Data driven inter-model conformance checking
Sensor data document changes in the physical world, which can be understood based on metadata modeling part of the physical world. In the digital world, information systems are used for handling exchange of information between different actors, where some information is related to physical objects. Since these objects are potentially the same as observed by sensors, the sensor model (metadata) and the information system should describe the handling of physical objects in the same way, i.e., the information system and the sensor model should conform. So far conformance checking has been done on model level. I propose to use observed sensor and potentially information system data to check conformance.

13 January 2011 (Thursday, 10:45-12:30h. in CR-3E)

Data Warehousing and Data Mining guest lectures

Tom Jansen (Distimo)
Data warehousing for app store analytics. Distimo is an innovative app store analytics company built to solve the challenges created by a widely fragmented app store marketplace filled with equally fragmented information and statistics.

Jacques Niehof and Alexandra Molenaar (SIOD)
Data Mining for selection profiles for fraud and misuse of social security. The Social Intelligence and Investigation Service (SIOD) of the Ministry of Social Affairs and Employment (Ministerie van Sociale Zaken en Werkgelegenheid) fights criminality in the field of social security.

8 February 2011

Paul Stapersma
A probabilistic XML database on top of MayBMS
We use the probabilistic XML model proposed by Van Keulen and De Keijzer to create a prototype of an probabilistic XML database. One disadvantage of the XML data model is that queries cannot be executed as efficiently as in the relational database model. Many non-probabilistic mapping techniques have been developed to map semi structured data into relational databases to overcome this disadvantage. In this research, we use the schema-less mapping technique â€˜XPath Acceleratorâ€™ to build a probabilistic XML database (PXML-DBMS) based on an URDBMS. A working prototype can be found at http://code.google.com/p/pxmlconverter/.

24 May 2011 (11:30h. in ZI-5126)

Almer Tigelaar
Search Result Caching in P2P Information Retrieval Networks
We explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

31 May 2011 (11:30h. in ZI-5126)

Kien Tjin-Kam-Jet
Free-Text Search over Complex Web Forms
This paper investigates the problem of using free-text queries as an alternative means for searching â€˜behindâ€™ web forms. We introduce a novel specification language for specifying free-text interfaces, and report the results of a user study where we evaluated our prototype in a travel planner scenario. Our results show that users prefer this free-text interface over the original web form and that they are about 9% faster on average at completing their search tasks.

7 June 2011

CTIT Symposium Security and Privacy – something to worry about?
The list of invited speakers includes Prof.dr.ir. Vincent Rijmen (TU Graz, Austria and KU Leuven), Dr. George Danezis (Microsoft Research), Dr. Steven Murdoch (University of Cambridge), Prof. Bert-Jaap Koops (TILT), Prof.mr.dr. Mireille Hildebrandt (RU Nijmegen), Dr.ir. Martijn van Otterlo (KU Leuven) and Prof. Pieter Hartel (UT).
Read more…

21 June 2011

Andreas Wombacher
Sensor Data Visualization & Aggregation: A self-organizing approach
In this year's Advanced Database Systems course the students had the assignment to design and implement the database functionality to visualize sensor data based on user requests in a Web based system. To guarantee good response times of the database it is necessary to pre-aggregate data. The core of the assignment was to find good pre-aggregations which minimize the query response times while using only a limited amount of storage space. The pre-aggregation levels must adjust in case the characteristics of the user requests changes. In this talk I will present the different approximation approaches of the students and present an optimal solution NP-hard solution.

28 June 2011

Marijn Koolen (University of Amsterdam)
Relevance, Diversity and Redundancy in Web Search Evaluation
Diversity performance is often evaluated with a measure that combines relevance, subtopic coverage and redundancy. Although this is understandable from a user perspective, it is problematic when analysing the impact of diversifying techniques on diversity performance. Such techniques not only affect subtopic coverage, but often the underlying relevance ranking as well. A evaluation measure that conflates these aspects hampers our progress in developing systems that provide diverse search results. In this talk, I argue that to further our understanding of how system components affect diversity, we need to look at relevance, coverage and redundancy individually. Using the official runs of the TREC 2009 Diversity task, I show that differences in diversity performance are mainly due to difference in the relevance ranking, with only minimal differences in how the relevant documents are ordered amongst themselves. If we measure diversity independent of the relevance ranking, we fin d that some of systems that perform badly on conflated measures have the most diverse ordering of relevant documents.

30 June 2011 (Thursday, 10:30-11:30h. in ZI-3126)

Design Project Presentations

Tristan Brugman, Maarten Hoek, Mustafa Radha, Iwan Timmer, Steven van der Vegt
UT-Search. UT Search is gemaakt als een vervanger voor google custom search. Er is gepoogd om een zoekmachine te maken waarin de kennis van de verschillende systemen binnen de UT wordt gebruikt om ook de vele informatie te vinden die niet via custom search gevonden kunnen worden. We maken gebruik van aggregated search om de verschillende systemen van de universiteit aan te spreken. Middels faceted search kan de gebruiker selecties maken van de systemen die hij wil doorzoeken om zo de zoekopdracht te verfijnen. Het doel is om een centrale zoekmachine te hebben voor de universiteit waar vanaf alle verschillende systemen doorzocht kunnen worden om de informatie toegankelijker te maken.

Ralph Broenink, Rick van Galen, Jarmo van Lenthe, Bas Stottelaar, Niek Tax
Alexia: Het borrelbeheersysteem voor Stichting Borrelbeheer Zilverling Met de activiteiten van vijf verenigingen, afstudeerborrels en andere activiteiten worden de beide borrelruimtes van het EducafÃ© druk bezet. Er is gebleken dat door deze hoge bezettingsgraad er behoefte is aan een managementsysteem dat borrels kan plannen, voorraad kan bijhouden, tappers kan beheren en ook het verbruik tijdens de borrels kan registreren. Als extraatje is het ook mogelijk om op basis van RFID-kaarten op rekening te drinken. Tijdens de presentatie laten wij zijn hoe we uit de eisen van deze vijf verenigingen een webapplicatie hebben ontwikkeld dat bovenstaande functies kan vervullen, gebruikmakende van de laatste technologieÃ«n zoals Django, CSS3 en HTML5.

12 July 2011

Sergio Duarte
Sergio will talk about his internship at Yahoo! Research, Barcelona.

23 August 2011

Rezwan Huq
Inferring Fine-grained Data Provenance in Stream Data Processing: Reduced Storage Cost, High Accuracy Fine-grained data provenance ensures reproducibility of results in decision making, process control and e-science applications. However, maintaining this provenance is challenging in stream data processing because of its massive storage consumption, especially with large overlapping sliding windows. In this paper, we propose an approach to infer fine-grained data provenance by using a temporal data model and coarse-grained data provenance of the processing. The approach has been evaluated on a real dataset and the result shows that our proposed inferring method provides provenance information as accurate as explicit fine-grained provenance at reduced storage consumption.

30 August 2011 at 14:30h.

Maarten Fokkinga
Aggregation – polymorphic and polytypic
Repeating the work of Meertens “Calculate Polytypically!” we show how to define in a few lines a very general “aggregation” function. Our intention is to give a self-contained exposition that is, compared to Meertens' work, more accessible for the uninitiated reader who wants to see the idea with a minimum of formal details.

5 September 2011 (Monday at 13.30h.)

Robin Aly
Towards a Better Understanding of the Relationship Between Probabilistic Models in IR
Probability of relevance (PR) models are generally assumed to implement the Probability Ranking Principle (PRP) of IR, and recent publications claim that PR models and language models are similar. However, a careful analysis reveals two gaps in the chain of reasoning behind this statement. First, the PRP considers the relevance of particular documents, whereas PR models consider the relevance of any query-document pair. Second, unlike PR models, language models consider draws of terms and documents. We bridge the first gap by showing how the probability measure of PR models can be used to define the probabilistic model of the PRP. Furthermore, we argue that given the differences between PR models and language models, the second gap cannot be bridged at the probabilistic model level. We instead define a new PR model based on logistic regression, which has a similar score function to the one of the query likelihood model. The performance of both models is strongly correlated, hence providing a bridge for the second gap at the functional and ranking level. Understanding language models in relation with logistic regression models opens ample new research directions which we propose as future work.

20 September 2011

Juan Amiguet
Annotation propagation and topology based approaches
When making sensor data stream applications more robust to changes in the sensing environment by introducing annotations representing the changes. We find ourselves with the need to propagate such annotations across processing elements. We present here a technique for performing such propagation exploiting the relations amongst the inputs and outputs both from an information theoretic and topological perspective. Topologies are used to describe the structure of the inputs and the outputs separately. Whilst information theory techniques are used to model the transform as a channel enabling the topological transformations to be treated as optimisation problems. We present here a framework of functions which is generic in the light of all transforms, and which enables for the maximisation of the entropy across the transform.

26 October 2011

Sergio Duarte
What and How Children Search on the Web
In this work we employed a large query log sample from a commercial web search engine to identify the struggles and search behavior of children of the age of 6 to young adults of the age of 18. Concretely we hypothesized that the large and complex volume of information to which children are exposed leads to ill-defined searches and to dis-orientation during the search process. For this purpose, we quantified their search difficulties based on query metrics (e.g. fraction of queries posed in natural language), session metrics (e.g. fraction of abandoned sessions) and click activity (e.g. fraction of ad clicks). We also used the search logs to retrace stages of child development. Concretely we looked for changes in the user interests (e.g. distribution of topics searched), language development (e.g. readability of the content accessed) and cognitive development (e.g. sentiment expressed in the queries) among children and adults. We observed that these metrics clearly demonstrate an increased level of confusion and unsuccessful search sessions among children. We also found a clear relation between the reading level of the clicked pages and the demographics characteristics of the users such as age and average educational attainment of the zone in which the user is located.
Read more

29 November 2011

Rezwan Huq
Adaptive Inference of Fine-grained Data Provenance to Achieve High Accuracy at Lower Storage Costs
In stream data processing, data arrives continuously and is processed by decision making, process control and e-science applications. To control and monitor these applications, reproducibility of result is a vital requirement. However, it requires massive amount of storage space to store fine-grained provenance data especially for those transformations with overlapping sliding windows. In this paper, we propose techniques which can significantly reduce storage costs and can achieve high accuracy. Our evaluation shows that adaptive inference technique can achieve more than 90% accurate provenance information for a given dataset at lower storage costs than the other techniques. Moreover, we present a guideline about the usage of different provenance collection techniques described in this paper based on the transformation operation and stream characteristics.